Metadata-Version: 2.1
Name: dfsummarizer
Version: 0.1.6
Summary: Python command line application to summarize a CSV or TSV dataset.
Home-page: http://john-hawkins.github.io
Author: John Hawkins
Author-email: johnc@getting-data-science-done.com
License: MIT
Project-URL: Documentation, http://dfsummarizer.readthedocs.io
Project-URL: Source, https://github.com/john-hawkins/dfsummarizer
Project-URL: Tracker, https://github.com/john-hawkins/dfsummarizer/issues
Description: dfsummarizer  
        =====================================================
        
        [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
        ![build](https://github.com/john-hawkins/dfsummarizer/workflows/build/badge.svg)
        [![PyPI](https://img.shields.io/pypi/v/dfsummarizer.svg)](https://pypi.org/project/dfsummarizer)
        [![Documentation Status](https://readthedocs.org/projects/dfsummarizer/badge/?version=latest)](https://dfsummarizer.readthedocs.io/en/latest/?badge=latest)
        
        This is an application to summarize the variables in a data frame.
        It will accept a CSV, TSV or XLS file and produce a table summarizing 
        all columns individually.
        
        This was motivated by the fact that the summary function for a pandas
        data frame ignores all non-numeric columns, and does not contain multiple
        common analytical considerations: how many unique values, how many missing
        values, min and max dates, min, mean and max string lengths.
        
        Output can be generated as either Latex or Markdown.
        
        Released and distributed via setuptools/PyPI/pip for Python 3.
         
        Additional detail available in the [companion blog post](https://john-hawkins.github.io/posts/2020/07/dfsummarizer-dataframe-summarizer-application/)
        
        ## Notes
        
        Initial implementation can handle larger files by chunking data and iteratively
        building statistics. All statistics are robust except for estimation of the proportion
        of unique values. We have used a simple implementation of the Flajolet Martin algorithm
        based on the implementation by [Javia Jinkal](https://github.com/javiajinkal/Flajolet-Martin)
        
        This [review article by Phillip Gibbons](https://www.cs.cmu.edu/~gibbons/Phillip%20B.%20Gibbons_files/Distinct-Values-Estimation-over-Data-Streams-PBGibbons.pdf) gives a great overview of the alternatives.
        
        
        ## Usage
        
        You can use this application multiple ways
        
        Use the runner:
        
        ```
        ./dfsummarizer-runner.py markdown data/test.csv > markdown_test.md
        ```
        
        Which was used to generate the markdown [output test file](markdown_test.md)
        
        Invoke the directory as a package:
        
        ```
        python -m dfsummarizer markdown data/test.csv
        ```
        
        Or simply install the package and use the command line application directly
        
        
        # Installation
        
        Installation from the source tree:
        
        ```
        python setup.py install
        ```
        
        (or via pip from PyPI):
        
        ```
        pip install dfsummarizer
        ```
        
        
        Now, the ``dfsummarizer`` command is available::
        
        ```
        dfsummarizer markdown test.csv
        ```
        
        This will produce a markdown table summarizing the contents of the CSV
        file test.csv
        
        
        # Acknowledgements
        
        Python package built using the
        [bootstrap cmdline template](https://github.com/jgehrcke/python-cmdline-bootstrap)
         by [jgehrcke](https://github.com/jgehrcke)
        
        
        
        | Name     | Type   | Unique Vals | Nulls   | Mode                     |  Min       |  Mean      |  Max       |
        | ----     | ------ | ----------- | ------- | ----                     |  ---       |  ----      |  ---       |
        | id       | Char   |           6 |    0.0% |                     S001 |          4 |        4.0 |          4 |
        | opening  | Date   |           6 |    0.0% |      2019-01-01 00:00:00 | 2019-01-01 | 2019-04-18 | 2019-07-12 |
        | first    | Bool   |           2 |   16.7% |                       NO |        0.0 |        0.4 |        1.0 |
        | last     | Bool   |           2 |   50.0% |                      NaN |          0 |      0.333 |          1 |
        | state    | Char   |           3 |   16.7% |                      NSW |        3.0 |        3.0 |        3.0 |
        | balance  | Float  |           5 |    0.0% |                    500.0 |      200.0 |    1093.55 |     4230.9 |
        | duration | Float  |           3 |   33.3% |                     24.0 |       12.0 |       21.0 |       24.0 |
        | years    | Int    |           3 |    0.0% |                        2 |          2 |        3.0 |          4 |
        | flag     | Float  |           2 |   66.7% |                      NaN |        1.0 |        1.0 |        1.0 |
        | comments | Char   |           6 |    0.0% | Combined savings account |          9 |     21.167 |         35 |
        
Platform: UNKNOWN
Description-Content-Type: text/markdown
