Metadata-Version: 2.1
Name: split_file_reader
Version: 0.0.2
Summary: A package to read parted names on disk.
Home-page: https://gitlab.com/Reivax/split_file_reader
Author: Xavier Halloran
Author-email: sfr@reivax.us
License: UNKNOWN
Description: # SplitFileReader
        
        A python module to transparently read files that have been split on disk, without combining them.  Exposes the 
        `readable`, `read`, `writable`, `write`, `tellable`, `tell`, `seekable`, `seek`, `open` and `close` functions, as well
        as a Context Manager and an Iterable.
        
        ### Usage
        #### Simple Example
        List all of the files within a TAR file that has been broken into multiple parts.
        ```python
        import tarfile
        from split_file_reader import SplitFileReader
        
        filepaths = [
            "./files/archives/files.tar.000",
            "./files/archives/files.tar.001",
            "./files/archives/files.tar.002",
            "./files/archives/files.tar.003",
        ]
        
        with SplitFileReader(filepaths) as sfr:
            with tarfile.open(fileobj=sfr, mode="r") as tf:
                for tff in tf.filelist:
                    print("File in archive: ", tff.name)
        ```
        
        #### Text files.
        The `SplitFileReader` works only on binary data, but does support the use of the `.io.TextIOWrapper`.
        
        The `SplitFileReader` may also be given a glob for the filepaths.
        ```python
        import glob
        from io import TextIOWrapper
        
        from split_file_reader import SplitFileReader
        
        with SplitFileReader(glob.glob("./files/plaintext/Adventures_In_Wonderland.txt.*")) as sfr:
            with TextIOWrapper(sfr) as text_wrapper:
                for line in text_wrapper:
                    print(line, end='')
        ```
        
        These files may be anywhere on disk, or across multiple disks.
        
        SplitFileReader does not support writing, `writable` will always return `False`, calls to `write` will raise an
        IOError.
        
        ### Use case
        
        Many large files are distributed in multiple parts, especially archives.  The general solution to reassembly is to call
        `cat` from the terminal, and pipe them into a single cohesive file; however for various reasons this may not always be
        possible or desirable.  If the full set of files is larger than the entire disk; if there is not enough space left to
        `cat` them all together; or if only a small set of the payload data is required.
        ```bash
        cat ./files/archives/files.zip.* > ./files/archives/file.zip
        ```
        
        In these scenarios, using the `SplitFileReader` will provide an alternative solution, enabling random access throughout
        the archive without making a single file on disk.
        
        #### Github and Gitlab Large File Size
        
        Github and Gitlab (as well as other file repositories) impose file size limits.  By parting these files into
        sufficiently small chunks, the `SplitFileReader` will be able to make transparent use of them, as though they were a
        single cohesive file.  This will remove any requirements to host these files with pre-fetch or pre-commit scripts, or
        any other "setup" mechanism to make use of them.
        
        #### Symmetric Download
        Some HTTP file servers set maximum transfer windows.  With the `SplitFileReader`, each piece of data can be streamed
        into its own file, and then used directly, without the need to reassemble them; by piping each file stream directly to 
        disk.  The files will then be immediately available for use, without a recombination step.
        
        #### Other Uses
        Because the file type is transparent to the class, even CSV Files can be split and processed this way, provided
        that the column headers are only present on the first file.  The CSV does not even need to be split along the rows, it 
        can be split at any point (and even mid character for multi-byte characters). 
        
        This library supports only binary read modes; to support decoding, wrap a String Buffer or other decoding system.
        Because the component files may be split at any byte offset, it is possible that files are split mid-character.  This
        will be transparant to any module wrapped around the SplitFileReader.
        
        #### Random Access
        This module allows for random access of the data, allowing for Tar or Zip files to be extracted without first combining
        them.
        
        ```python
        sfr = split_file_reader.open(filepaths)
        with zipfile.ZipFile(sfr, "r") as zf:
            print(zf.filelist)
        sfr.close()
        ```
        Or, for text files:
        ```python
        with SplitFileReader(filepaths) as sfr,\
                io.TextIOWrapper(sfr, encoding="utf-8") as tiow:
            for line in tiow:
                print(line, end='')
        ```
        
        #### Streaming Access
        The `SplitFileReader` can be used in a stream-only format, which disables the `seek` functionality.  It allows one to 
        call `iter()` on the object, and then call`next()` to produce a stream of bytes; or, it may be wrapped in a `for` loop.
        
        ```python
        with SplitFileReader(filepaths) as sfr:
            for b in sfr:
                print("{:02X}", b)
        ```
        Or, to produce fixed amounts of data, the `set_iter_size(size)` function can be called, which will read up to the `size` 
        amount of data.  `set_iter_size` may be called at any point, even inside the loop.
        ```python
        with SplitFileReader(filepaths) as sfr:
            sfr.set_iter_size(16)
            for byte_list in sfr:
                print(" ".join("{:02X}".format(x) for x in byte_list))
        ```
        
        Additionally, adding the `streaming_only=True` argument to the initializer will force this mode, but will not create
        an iterable.   `iter()` must still be called, either explicitly, or implied via a loop. 
        
        An existiing `SplitFileReader` instance may be converted to Streaming mode at any time, but may not be converted back
        to random-access mode.
        
        #### Constructor Arguments
        - `files`: a list of zero or more strings, with either a fully qualified explicit location, or a relative location.
        These file paths are whatever `builtins.open()` would need.
          - An empty list will always read nothing, and finish iterating immediately.
          - A list with a single file will simply wrap a single file, as a pass-through.
          - Otherwise, each of these files will be opened, one at a time, in the given order.
        - `mode`: this must be `rb` or `r`.  It is only left for programs that explicitly set the `mode` argument.
        - `stream_only`: Disables the `seek()` method.  The `__init__` will still not return an iterator, must still use 
        `__iter__` for that.  Mutually exclusive with `validate_all_readable`
        - `validate_all_readable`: Seek to every file in the `files` list, and check if readbale.  Calls `test_all_readable`
        method at the end of the constructor.  Mutually exclusive with `stream_only`
        
        #### Context Manager
        The `SplitFileReader` allows for a Context Manager.  It simply calls `close()` at exit.
        
        ## Command Line Invocation
        The module may be used via the command line for some simple processing of certain archive types.  Presently, only Tar
        and Zip formats are supported, and they must have been split via the `split` command, or other binary split mechanism.
        
        
        ```
        usage:  [-h] [-a {zip,z,tar,t,tgz,tbz,txz}] [-p <password>]
                (-t | -l | -x <destination> | -r <filename>)
                <filepath> [<filepath> ...]
        
        Identify and process parted archives without manual concat. This command line
        capability provides supports only Tar and Zip files; but not 7z or Rar.
        Designed to work for files that have been split via the `split` utility, or
        any other binary cut; but does not support Zip's built-in split capability.
        The python module supports any arbitrarily split files, regardless of type.
        
        positional arguments:
          <filepath>            In-order list of the parted files on disk. Use shell
                                expansion, such as ./files.zip.*
        
        optional arguments:
          -h, --help            show this help message and exit
          -a {zip,z,tar,t,tgz,tbz,txz}, --archive {zip,z,tar,t,tgz,tbz,txz}
                                Archive type, either zip, tar, tgz, tbz, or txz
          -p <password>, --password <password>
                                Zip password, if needed
          -t, --test            Test the archive, using the module's built-in test.
          -l, --list            List all the payload files in the archive.
          -x <destination>, --extract <destination>
                                Extract the entire archive to filepath <destination>.
          -r <filename>, --read <filename>
                                Read out payload file contents to stdout.
        ```
        
        #### Examples
        To display the contents of the Zip files included in the test suite of this modules, run
        ```bash
        python3 -m split_file_reader -azip --list ./files/archives/files.zip.*
        ```
        The bash autoexpansion of the `*` wildcard will fill in the files in order, correctly.  `--list` will print out the 
        names of the payload fiels within the zip archive, and the `-azip` flag instructs the module to expect the `Zip`
         archive type.
        
        ### Mechanics
        #### File Descriptors
        The `SplitFileReader` will make use of only a single File Descriptor at a time.  In random-access mode, the default
        mode, as the file pointer moves over file boundaries, the existing File Descriptor will be closed before a new one is
        opened.  For functions that regularly seek and read over a file boundary, the File Descriptors will be opened and 
        closed often.  For streaming mode, once a file's File Descriptor is closed, a new one will not be created.
        
        Just like with `open()`, a File Descriptor will be kept open unless `close()` is called on the object.  Using the
        Context Managed version with the `with` keyword will automatically close the last file descriptor.  `SplitFileReader`
        exposes a `close()` method for this.
        
        Reading beyond the end of the list of files will cause `read()` to return nothing, but will not close the last File
        Descriptor.  A `read()` call that crosses the file boundaries will close one and open another, transparently to the
        calling Python code, but will always keep one File Descriptor open.  The same applies to `seek()`.
        
        #### Concurrency
        The `SplitFileReader` is not designed for concurrent or threaded access, it behaves the same as any other file that
        has been opened via `open()` (and in fact uses the `builtins.open()` to operate.)  However, since the data it operates
        against is read-only, multiple `SplitFileReader`s can be opened against the same data at the same time.
        
        #### Caveats
        While this class can open any arbitrarily split data, Zip chunks that are produced by the `zip` command are *not* simple
        binary chunks.  They are logically divided in a separate way.  Zip files that have been parted via the `split` command,
        after or during their creation, will work just fine.
        
        Because the `SplitFileReader` allows random-access to the component files, the `files` list must also be random-access,
        indexable, and contain only filepaths.  It cannot be generator.
        
Platform: any
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Topic :: System :: Filesystems
Requires-Python: >=3.5
Description-Content-Type: text/markdown
