Metadata-Version: 2.1
Name: url_metadata
Version: 0.1.3
Summary: A cache which saves URL metadata and summarizes content
Home-page: https://github.com/seanbreckenridge/url_metadata
Author: Sean Breckenridge
Author-email: seanbrecke@gmail.com
License: http://www.apache.org/licenses/LICENSE-2.0
Description: [![PyPi version](https://img.shields.io/pypi/v/url_metadata.svg)](https://pypi.python.org/pypi/url_metadata) [![Python3.7|Python 3.8](https://img.shields.io/pypi/pyversions/url_metadata.svg)](https://pypi.python.org/pypi/url_metadata) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)
        
        A cache which saves URL metadata and summarizes content
        
        This is meant to provide more context to any of my tools which use URLs. If I [watched some youtube video](https://github.com/seanbreckenridge/mpv-sockets/blob/master/DAEMON.md) and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I [read an article](https://github.com/seanbreckenridge/ffexport), I want the article text! This requests, parses and abstracts away that data for me locally, so I can just do:
        
        ```python
        >>> from url_metadata import metadata
        >>> m = metadata("https://pypi.org/project/beautifulsoup4/")
        >>> len(m.info["images"])
        46
        >>> m.info["title"]
        'beautifulsoup4'
        >>> m.text_summary[:57]
        "Beautiful Soup is a library that makes it easy to scrape"
        ```
        
        If I ever request the same URL again, that info is grabbed from a local directory cache instead.
        
        ---
        
        ## Installation
        
        Requires `python3.7+`
        
        To install with pip, run:
        
            pip install url_metadata
        
        ---
        
        This uses:
        
        - [`lassie`](https://github.com/michaelhelmick/lassie) to get generic metadata; the title, description, opengraph information, links to images/videos on the page
        - [`readability`](https://github.com/buriy/python-readability) to convert HTML to a summary of the HTML content.
        - [`bs4`](https://pypi.org/project/beautifulsoup4/) to convert the parsed HTML to text (to allow for nicer text searching)
        - [`youtube_subtitles_downloader`](https://github.com/seanbreckenridge/youtube_subtitles_downloader) to get manual/autogenerated captions (converted to a `.srt` file) from Youtube URLs.
        
        ---
        
        ### Usage:
        
        In Python, this can be configured by using the `url_metadata.URLMetadataCache` class:
        
        ```python
           URLMetadataCache(loglevel: int = logging.WARNING,
                            subtitle_language: str = 'en',
                            sleep_time: int = 5,
                            cache_dir: Union[str, pathlib.Path, NoneType] = None) -> url_metadata.URLMetadataCache:
               """
               Main interface to the library
               Supply 'cache_dir' to overwrite the default location.
               """
        
           get(self, url: str) -> url_metadata.model.Metadata
               """
               Gets metadata/summary for a URL.
               Save the parsed information in a local data directory
               If the URL already has cached data locally, returns that instead.
               """
        
           in_cache(self, url: str) -> bool
               """
               Returns True if the URL already has cached information
               """
        
           request_data(self, url: str) -> url_metadata.model.Metadata
               """
               Given a URL:
        
               If this is a youtube URL, this requests youtube subtitles
               Uses lassie to grab metadata
               Parses the HTML text with readablity
               uses bs4 to parse that text into a plaintext summary
               """
        ```
        
        For example:
        
        ```python
        import logging
        from url_metadata import URLMetadataCache
        
        # make requests every 2 seconds
        # debug logs
        # save to a folder in my home directory
        cache = URLMetadataCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir="~/mydata")
        c = cache.get("https://github.com/seanbreckenridge/url_metadata")
        # just request information, don't read/save to cache
        data = cache.request_data("https://www.wikipedia.org/")
        ```
        
        ---
        
        The CLI interface lets you specify much the same; and provides some utility commands to get/list information from the cache.
        
        ```
        $ url_metadata
        Usage: url_metadata [OPTIONS] COMMAND [ARGS]...
        
        Options:
          --cache-dir PATH          Override default directory cache location
          --debug / --no-debug      Increase log verbosity
          --sleep-time INTEGER      How long to sleep between requests
          --subtitle-language TEXT  Subtitle language for Youtube captions
          --help                    Show this message and exit.
        
        Commands:
          cachedir  Prints the location of the local cache directory
          export    Print all cached information as JSON
          get       Get information for one or more URLs
          list      List all cached URLs
        ```
        ### CLI Examples
        
        The `get` command emits `JSON`, so it could with other tools (e.g. [`jq`](https://stedolan.github.io/jq/)) used like:
        
        ```shell
        $ url_metadata get "https://click.palletsprojects.com/en/7.x/arguments/" \
            | jq -r '.[] | .text_summary' | head -n5
        Arguments
        Arguments work similarly to options but are positional.
        They also only support a subset of the features of options due to their
        syntactical nature. Click will also not attempt to document arguments for
        you and wants you to document them manually
        ```
        
        ```shell
        $ url_metadata export | jq -r '.[] | .info | .title'
        seanbreckenridge/youtube_subtitles_downloader
        Arguments — Click Documentation (7.x)
        ```
        
        ```shell
        $ url_metadata list --location
        /home/sean/.local/share/url_metadata/data/b/a/a/c8e05501857a3c7d2d1a94071c68e/000
        /home/sean/.local/share/url_metadata/data/9/4/4/1c380792a3d62302e1137850d177b/000
        ```
        
        ```shell
        # to make a backup of the cache directory
        $ tar -cvzf url_metadata.tar.gz "$(url_metadata cachedir)"
        ```
        
        Accessible through the `url_metadata` script and `python3 -m url_metadata`
        
        ---
        
        This stores all of this information as individual files in a cache directory (using [`appdirs`](https://github.com/ActiveState/appdirs)). In particular, it `MD5` hashes the URL and stores information like:
        
        ```
        .
        └── 7
            └── b
                └── 0
                    └── d952fd7265e8e4cf10b351b6f8932
                        └── 000
                            ├── epoch_timestamp.txt
                            ├── key
                            ├── metadata.json
                            ├── subtitles.srt
                            ├── summary.html
                            └── summary.txt
        ```
        
        You're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching `key` file. See comments [here](https://github.com/seanbreckenridge/url_metadata/blob/master/src/url_metadata/cache.py) for implementation details.
        
        By default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.
        
        Originally created for [`HPI`](https://github.com/seanbreckenridge/HPI).
        
        ---
        
        ### Testing
        
            git clone 'https://github.com/seanbreckenridge/url_metadata'
            cd ./url_metadata
            git submodule update --init
            pip install '.[testing]'
            mypy ./src/url_metadata/
            pytest
        
Keywords: url cache metadata youtube subtitles
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/markdown
Provides-Extra: testing
