Metadata-Version: 1.1
Name: mmmeta
Version: 0.1.1
Summary: A simple toolkit for managing local state against remote metadata.
Home-page: https://github.com/simonwoerpel/mmmeta
Author: Simon Wörpel
Author-email: simon.woerpel@medienrevolte.de
License: MIT
Description: mmmeta
        ======
        
        Handle meta information and local state about files from remote
        (read-only) locations.
        
        example usecase
        ---------------
        
        It’s better explained by a concrete example:
        
        **Server** scrapes documents and stores them with metadata
        
        **Client1** wants to download all files with
        ``document_type="contract"``
        
        **Client2** wants to import all documents scraped not longer than 1 week
        ago into a database, but only the ones that are not imported yet
        
        synopsis
        ~~~~~~~~
        
        To clarify the terms used in this manual:
        
        -  **files**: actual files (like pdfs…)
        -  **metadata files**: json files that contain metadata for actual files
        -  **metadata db**: sqlite database containing metadata for all files
           from the remote
        -  **remote**: the “source of truth” where files, metadata files and
           metadata db are stored. A remote can still be a local folder on the
           same machine…
        -  **client**: A client that has read-only access to the remote
        -  **state db**: sqlite database, stored only on the client, containing
           local state for files
        -  `store <#store>`__: A simple implementation of a key-value store for
           additional information
        -  **metadir**: a directory named ``_mmmeta`` that is
           `synced <#synchronization>`__ between remote and client and contains
           metadata db, store, and (on the client) state db
        
        how does this scenario work?
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        **Server** 1. Stores a metadata json file for each file 2. Generates
        (and updates) a *metadir*
        
        **Client1** 1. Syncs remote *metadir* 2. Merge remote *metadata db* with
        local *state db* 3. Query *state db* for given criteria 4. For each
        result download the actual file from the remote
        
        **Client2** 1. Syncs remote *metadir* 2. Merge remote *metadata db* with
        local *state db* 3. Query the *state db* for remote metadata
        ``retrieved_at=<date>`` and local state ``imported=False``
        
        **mmmeta** automates *almost ;)* all of this:
        
        implementation of this scenario with mmmeta
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        
        Server
        ^^^^^^
        
        scrapes document and stores them with metadata
        
        a) Server stores files locally
        ''''''''''''''''''''''''''''''
        
        If the files (and their metadata) are stored locally, metadata
        generation is as easy as looping through all of the json files and
        generate the database out of it. This can be done via command line
        inside the directory of the *metadata files*:
        
        ::
        
           mmmeta generate
        
        This will loop through all json files create a sqlite database in
        ``./_mmmeta/meta.db``
        
        For other path locations, see `initialization <#initialization>`__
        
        When new *metadata files* are added, simply re-run this command. It will
        just update the *meta db* without deleting existing entries, which means
        the old *metadata files* don’t need to stay on the server (see next
        situation).
        
        b) Server downloads files locally but then pushes into a cloud
        ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
        
        Here we don’t have all the files locally, only a subset (the new
        downloaded ones).
        
        First, `synchronize <#synchronization>`__ cloud *metadir* to local (aka
        the server).
        
        Then update metadata as described above:
        
        ::
        
           mmmeta generate
        
        Last, `synchronize <#synchronization>`__ the updated *metadir* back to
        the cloud.
        
        c) Server directly pushes files to cloud
        ''''''''''''''''''''''''''''''''''''''''
        
        Here, we don’t have any file and its metadata locally (on the server).
        Updating the *meta db* happens within the python code of the
        application:
        
        First, `synchronize <#synchronization>`__ cloud *metadir* to local (aka
        the server).
        
        Then, run your application…
        
        .. code:: python
        
           from mmmeta import mmmeta
        
           m = mmmeta("./path/to/metadir")
        
           for data in scraper:
               m.files.insert(**data)
               # or upsert, if you want:
               m.files.upsert(**data, [keys])  # e.g. "content_hash"
        
        This will update the *meta db* in the *metadir*
        
        Last, `synchronize <#synchronization>`__ the updated *metadir* back to
        the cloud.
        
        Client1
        ^^^^^^^
        
        wants to download all files with ``document_type="contract"``
        
        First, `synchronize <#synchronization>`__ remote *metadir* to local.
        
        Then,
        
        .. code:: python
        
           from mmmeta import mmmeta
        
           m = mmmeta("./path/to/metadir")
        
           for file in m.files(document_type="contract"):
               download(file["public_url"])
        
           def download(url):
               # implement download based on remote storage
               # url will be, based on storage, something like:
               # - file:///path/to/file.pdf (remote is local filesystem)
               # - s3://bucket/path/to/file.pdf (remote is aws cloud storage)
               # - https://remote.com/path/to/file.pdf
               # ...
        
        The
        
        Client2
        ^^^^^^^
        
        wants to import all documents scraped not longer than 1 week ago into a
        database, but only the ones that are not imported yet
        
        Therefore the client uses a local state db in the mmmeta.
        
        First, `synchronize <#synchronization>`__ remote metadata db to local
        
        Then, update meta to local state: via command-line:
        
        ::
        
           MMMETA=./path/to/metadir mmmeta update
        
        or programmatically:
        
        .. code:: python
        
           from mmmeta import mmmeta
        
           m = mmmeta("./path/to/metadir/")
           m.update()
        
        After that, remote metadata and local state are merged and easy usable
        like this:
        
        .. code:: python
        
           for file in m.files.find(retrieved_at=<date>, imported=False):
               process_import(file)
               file["imported"] = True
               file.save()
        
        The ``files`` object on a metadir is a wrapper to a `dataset
        table <https://dataset.readthedocs.io/en/latest/api.html#table>`__ with
        all its functionallity, instead that it yields ``File`` objects that you
        can use to alter the state of the files in the database as described in
        the example above.
        
        Initialization
        ~~~~~~~~~~~~~~
        
        On the *client*:
        
        When **mmmeta** is `initialized <#initialization>`__ with a ``path``,
        the directory ``path/_mmmeta`` will be the *metadir*
        
        ``path`` can be set via env var:
        
        ::
        
           MMMETA=./path/ mmmeta update
        
        or in scripts:
        
        .. code:: python
        
           from mmmeta import mmmeta
        
           m = mmmeta("./path/")
        
        On the *remote*:
        
        Same as client, but for the *metadata files* either recursively inside
        ``path`` unless other specified via env var ``MMMETA_FILES_ROOT``
        
        This means, on the *remote* the *metadata files* and the *metadir* don’t
        need to be in the same path location.
        
        Or, speaking of clouds: *metadir* and *actual files* can exist in
        different buckets.
        
        Synchronization
        ^^^^^^^^^^^^^^^
        
        This package is totally agnostic about the remote storage backend (could
        be a local filesystem location or cloud storage) and doesn’t handle any
        of the local <-> remote synchronization.
        
        Therefore the synchronization of the *metadir* ``./foo/_mmmeta`` is up
        to you with the tool of your choice.
        
        Store
        -----
        
        ``mmmeta`` ships with a simple key-value-store that can be used by both
        the *remote* and *client* to store some additional data. The store lives
        in the *metadir* ``./foo/_mmmeta/_store``
        
        You can store any values in it:
        
        .. code:: python
        
           from mmmeta import mmmeta
        
           m = mmmeta("./path/to/metadir/")
           m.store["new_files"] = 17
        
        any machine that `synchronizes <#synchronization>`__ the metadir can
        read these values:
        
        .. code:: python
        
           from mmmeta import mmmeta
        
           m = mmmeta("./path/to/metadir/")
           new_files = m.store["new_files"]  # 17
        
        For storing timestamps, there is a shorthand via the ``touch`` function:
        
        .. code:: python
        
           m.touch("my_ts_key")
        
        This will save the value of the current ``datetime.now()`` to the key
        ``my_ts_key``. The values are typed (``int``, ``float`` or
        ``timestamp``), so you can easily do something like this:
        
        .. code:: python
        
           from mmmeta import mmmeta
        
           m = mmmeta("./path/to/metadir/")
        
           if m.store["remote_last_updated"] > m.store["local_last_updated"]:
               # run scraper
        
        Installation
        ------------
        
        Requires python3. Virtualenv use recommended.
        
        Additional dependencies will be installed automatically:
        
        ::
        
           pip install mmmeta
        
        After this, you should be able to execute in your terminal:
        
        ::
        
           mmmeta --help
        
        You should as well be able to import it in your python scripts:
        
        .. code:: python
        
           from mmmeta import mmmeta
        
        cli
        ---
        
        .. code:: bash
        
           Usage: mmmeta [OPTIONS] COMMAND [ARGS]...
        
           Options:
             --metadir TEXT     Base path for reading meta info and storing state
                                [default: <current/working/dir>]
             --files-root TEXT  Base path for actual files to generate metadir from
                                [default: <current/working/dir>]
             --help             Show this message and exit.
        
           Commands:
             generate
             inspect
             update
        
        developement
        ------------
        
        Install testing requirements:
        
        ::
        
           make install
        
        Test:
        
        ::
        
           make test
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Utilities
