Metadata-Version: 2.1
Name: pfdo_run
Version: 3.2.4
Summary: Run arbitrary CLI on each nested dir of an inputdir
Home-page: https://github.com/FNNDSC/pfdo_med2image
Author: FNNDSC
Author-email: dev@babymri.org
License: MIT
Platform: UNKNOWN
License-File: LICENSE

pfdo_run 2.2.8
==================

.. image:: https://badge.fury.io/py/pfdo_run.svg
    :target: https://badge.fury.io/py/pfdo_run

.. image:: https://travis-ci.org/FNNDSC/pfdo_run.svg?branch=master
    :target: https://travis-ci.org/FNNDSC/pfdo_run

.. image:: https://img.shields.io/badge/python-3.5%2B-blue.svg
    :target: https://badge.fury.io/py/pfdo_run

.. contents:: Table of Contents


Overview
---------

``pfdo_run`` provides a powerful mechanism for exploring an input directory space for files and directories of interest, and applying a user specified CLI in the space of each hit. Outputs are saved typically in an output tree that mirrors the input tree.

Internally, ``pfdo_run`` leverages the ``pftree`` infrastructure to perform the space exploration and allows for callback methods to be applied at each stage of ``read``, ``analyze`` and ``write`` for valid target hits.

Addtionally, ``pfdo_run`` can apply some additional functions to its hits such as ``md5`` hashing, string replacement, extension removal and more. See below for more detail.

Installation
------------

Dependencies
~~~~~~~~~~~~

The following dependencies are installed on your host system/python3 virtual env (they will also be automatically installed if pulled from pypi):

-  ``pfmisc`` (various misc modules and classes for the pf* family of objects)
-  ``pftree`` (create a dictionary representation of a filesystem hierarchy)
-  ``pfdo``   (the base module that does the core interfacing with ``pftree``)

Using ``PyPI``
~~~~~~~~~~~~~~

The best method of installing this script and all of its dependencies is
by fetching it from PyPI

.. code:: bash

        pip3 install pfdo_run

CLI specification
-----------------

Any text in the CLI prefixed with a percent char ``%`` is interpreted in one of two ways.

First, any CLI to the ``pfdo_run`` itself can be accessed via ``%``. Thus, for example a ``%outputDir`` in the ``--exec`` string will be expanded to the ``outputDir`` of the ``pfdo_run``.

Secondly, three internal '%' variables are available:

* ``%inputWorkingDir``  - the current input tree working directory
* ``%outputWorkingDir`` - the current output tree working directory
* ``%inputWorkingFile`` - the current file being processed

These internal variables allow for contextual specification of values. For example, a simple CLI touch command could be specified as

.. code:: bash

    --exec "touch %outputWorkingDir/%inputWorkingFile"

or a command to convert an input ``png`` to an output ``jpg`` using the ImageMagick ``convert`` utility

.. code:: bash

    --exec "convert %inputWorkingDir/%inputWorkingFile
                    %outputWorkingDir/%inputWorkingFile.jpg"

Special Functions
-----------------

Furthermore, ``pfdo_run`` offers the ability to apply some interal functions to a tag. The template for specifying a function to apply is:

.. code::

    %_<functionName>[|arg1|arg2|...]_<tag>

thus, a function is identified by a ``<functionName>`` that is prefixed and suffixed by an underscore ``_`` and appears in front of the tag to process. Possible args to the ``<functionName>`` are separated by pipe ``|`` characters.

For example a string snippet that contains

.. code:: bash

    %_strrepl|.|-_inputWorkingFile.txt

will replace all occurences of ``.`` in the ``%inputWorkingFile`` with ``-``. Also of interest, the trailing ``.txt`` is preserved in the final pattern for the result.

The following functions are available:

.. code:: html

    %_md5[|<len>]_<tagName>
    Apply an 'md5' hash to the value referenced by <tagName> and optionally
    return only the first <len> characters.

    %_strmsk|<mask>_<tagName>
    Apply a simple mask pattern to the value referenced by <tagName>. Chars
    that are "*" in the mask are passed through unchanged. The mask and its
    target should be the same length.

    %_strrepl|<target>|<replace>_<tagName>
    Replace the string <target> with <replace> in the value referenced by
    <tagName>.

    %_rmext_<tagName>
    Remove the "extension" of the value referenced by <tagName>. This
    of course only makes sense if the <tagName> denotes something with
    an extension!

    %_name_<tag>
    Replace the value referenced by <tag> with a name generated by the
    faker module.

Functions cannot currently be nested.

Command line arguments
----------------------

.. code:: html


        -I|--inputDir <inputDir>
        Input base directory to traverse.

        -O|--outputDir <outputDir>
        The output root directory that will contain a tree structure identical
        to the input directory, and each "leaf" node will contain the analysis
        results.

        --exec <CLIcmdToExec>
        The command line expression to apply at each directory node of the
        input tree. See the CLI SPECIFICATION section for more information.

        [-i|--inputFile <inputFile>]
        An optional <inputFile> specified relative to the <inputDir>. If
        specified, then do not perform a directory walk, but convert only
        this file.

        [-f|--fileFilter <someFilter1,someFilter2,...>]
        An optional comma-delimated string to filter out files of interest
        from the <inputDir> tree. Each token in the expression is applied in
        turn over the space of files in a directory location, and only files
        that contain this token string in their filename are preserved.

        [-d|--dirFilter <someFilter1,someFilter2,...>]
        An additional filter that will further limit any files to process to
        only those files that exist in leaf directory nodes that have some
        substring of each of the comma separated <someFilter> in their
        directory name.

        [--analyzeFileIndex <someIndex>]
        An optional string to control which file(s) in a specific directory
        to which the analysis is applied. The default is "-1" which implies
        *ALL* files in a given directory. The space of valid <someIndex> are:

            'm':   only the "middle" file in the returned file list
            "f":   only the first file in the returned file list
            "l":   only the last file in the returned file list
            "<N>": the file at index N in the file list. If this index
                   is out of bounds, no analysis is performed.
            "-1":  all files.

        [--outputLeafDir <outputLeafDirFormat>]
        If specified, will apply the <outputLeafDirFormat> to the output
        directories containing data. This is useful to blanket describe
        final output directories with some descriptive text, such as
        'anon' or 'preview'.

        This is a formatting spec, so

            --outputLeafDir 'preview-%s'

        where %s is the original leaf directory node, will prefix each
        final directory containing output with the text 'preview-' which
        can be useful in describing some features of the output set.

        [--threads <numThreads>]
        If specified, break the innermost analysis loop into <numThreads>
        threads.

        [--noJobLogging]
        If specified, then suppress the logging of per-job output. Usually
        each job that is run will have, in the output directory, three
        additional files:

                %inputWorkingFile-returncode
                %inputWorkingFile-stderr
                %inputWorkingFile-stdout

        By specifying this option, the above files are not recorded.

        [-x|--man]
        Show full help.

        [-y|--synopsis]
        Show brief help.

        [--json]
        If specified, output a JSON dump of final return.

        [--followLinks]
        If specified, follow symbolic links.

        -v|--verbosity <level>
        Set the app verbosity level.

            0: No internal output;
            1: Run start / stop output notification;
            2: As with level '1' but with simpleProgress bar in 'pftree';
            3: As with level '2' but with list of input dirs/files in 'pftree';
            5: As with level '3' but with explicit file logging for
                    - read
                    - analyze
                    - write


Examples
--------

Perform a ``pfdo_run`` down some input directory and convert all input ``jpg`` files to ``png`` in the output tree:

.. code:: bash

    pfdo_run                                                \
        -I /var/www/html/data --fileFilter jpg              \
        -O /var/www/html/png                                \
        --exec "convert %inputWorkingDir/%inputWorkingFile
        %outputWorkingDir/%_rmext_inputWorkingFile.png"     \
        --threads 0 --printElapsedTime

The above will find all files in the tree structure rooted at ``/var/www/html/data`` that also contain the string ``jpg`` anywhere in the filename. For each file found, a ``convert`` conversion will be called, storing a converted file in the same tree location in the output directory as the original input.

Note the special construct, ``%_remext_inputWorkingFile.png`` -- the ``%_rmext_`` designates a built in funtion to apply to the tag value. In this case, to "remove the extension" from the ``%inputWorkingFile`` string.

Consider an example where only one file in a branched inputdir
space is to be preserved:

.. code:: bash

    pfdo_run                                                \
        -I (pwd)/raw -O (pwd)/out                           \
        -d 100307 -f " "                                    \
        --exec "cp %inputWorkingDir/brain.mgz
        %outputWorkingDir/brain.mgz"                        \
        --threads 0 --verbosity 3 --noJobLogging

Here, the input directory space is pruned for a directory leaf node that contains the string 100307. The exec command essentially copies the file `brain.mgz` in that target directory to the corresponding location in the output tree.

Finally the elapsed time and a JSON output are printed.



