Metadata-Version: 2.1
Name: forma
Version: 0.0.1
Summary: Automatic format error detection on tabular data
Home-page: https://github.com/dpoulopoulos/forma/tree/master/
Author: Dimitris Poulopoulos
Author-email: dimitris.a.poulopoulos@gmail.com
License: Apache Software License 2.0
Description: # Forma
        > Automatic format error detection on tabular data.
        
        
        Forma is an open-source library, written in python, that enables automatic and domain-agnostic format error detection on tabular data. The library is a by-product of the research project [BigDataStack](https://bigdatastack.eu/).
        
        ## Install
        
        Run `pip install forma` to install the library in your environment.
        
        ## How to use
        
        We will work with the the popular [movielens](https://grouplens.org/datasets/movielens/) dataset.
        
        ```python
        # local
        # load the data
        col_names = ['user_id', 'movie_id', 'rating', 'timestamp']
        ratings_df = pd.read_csv('../data/ratings.dat', delimiter='::', names=col_names, engine='python')
        ```
        
        ```python
        # local
        ratings_df.head()
        ```
        
        
        
        
        <div>
        <style scoped>
            .dataframe tbody tr th:only-of-type {
                vertical-align: middle;
            }
        
            .dataframe tbody tr th {
                vertical-align: top;
            }
        
            .dataframe thead th {
                text-align: right;
            }
        </style>
        <table border="1" class="dataframe">
          <thead>
            <tr style="text-align: right;">
              <th></th>
              <th>user_id</th>
              <th>movie_id</th>
              <th>rating</th>
              <th>timestamp</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <th>0</th>
              <td>1</td>
              <td>1193</td>
              <td>5</td>
              <td>978300760</td>
            </tr>
            <tr>
              <th>1</th>
              <td>1</td>
              <td>661</td>
              <td>3</td>
              <td>978302109</td>
            </tr>
            <tr>
              <th>2</th>
              <td>1</td>
              <td>914</td>
              <td>3</td>
              <td>978301968</td>
            </tr>
            <tr>
              <th>3</th>
              <td>1</td>
              <td>3408</td>
              <td>4</td>
              <td>978300275</td>
            </tr>
            <tr>
              <th>4</th>
              <td>1</td>
              <td>2355</td>
              <td>5</td>
              <td>978824291</td>
            </tr>
          </tbody>
        </table>
        </div>
        
        
        
        Initialize the detector, fit and detect. The returned result is a pandas DataFrame with an extra column `p`, which records the probability of a format error being present in the row.
        
        ```python
        # local
        # initialize detector
        detector = FormatDetector()
        # fit detector
        detector.fit(ratings_df[:100], PatternGenerator(other='leaf'))
        # detect error probability
        assessed_df = detector.detect()
        
        # visualize results
        assessed_df.head()
        ```
        
            100%|██████████| 4/4 [00:00<00:00, 222.64it/s]
        
        
        
        
        
        <div>
        <style scoped>
            .dataframe tbody tr th:only-of-type {
                vertical-align: middle;
            }
        
            .dataframe tbody tr th {
                vertical-align: top;
            }
        
            .dataframe thead th {
                text-align: right;
            }
        </style>
        <table border="1" class="dataframe">
          <thead>
            <tr style="text-align: right;">
              <th></th>
              <th>user_id</th>
              <th>movie_id</th>
              <th>rating</th>
              <th>timestamp</th>
              <th>p</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <th>0</th>
              <td>1</td>
              <td>1193</td>
              <td>5</td>
              <td>978300760</td>
              <td>0.041667</td>
            </tr>
            <tr>
              <th>1</th>
              <td>1</td>
              <td>661</td>
              <td>3</td>
              <td>978302109</td>
              <td>0.128333</td>
            </tr>
            <tr>
              <th>2</th>
              <td>1</td>
              <td>914</td>
              <td>3</td>
              <td>978301968</td>
              <td>0.128333</td>
            </tr>
            <tr>
              <th>3</th>
              <td>1</td>
              <td>3408</td>
              <td>4</td>
              <td>978300275</td>
              <td>0.041667</td>
            </tr>
            <tr>
              <th>4</th>
              <td>1</td>
              <td>2355</td>
              <td>5</td>
              <td>978824291</td>
              <td>0.041667</td>
            </tr>
          </tbody>
        </table>
        </div>
        
        
        
Keywords: error detection,machine learning
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/markdown
