Metadata-Version: 2.1
Name: pd2ml
Version: 0.1.0
Summary: A Python package that provides an efficient way to upload pandas DataFrame to MySQL and download from the database table into a DataFrame.
Home-page: https://github.com/TanyeeZhang/pd2ml
Author: Tanyee Zhang
Author-email: 676761828@qq.com
License: UNKNOWN
Description: <div align="center">
         <div>
           <img src="https://user-images.githubusercontent.com/32212649/88253383-5510de00-cce4-11ea-8773-467c51b6c0bf.png">
          </div>
        </div>
        
        ## What is it?
        
        **pd2ml** (**p**andas **D**ataFrame **to** **M**ySQL **L**oader) is a Python package that provides an efficient way to upload Pandas `DataFrame` to `MySQL` and download from the database table into a DataFrame. The application of MySQL statements `LOAD DATA LOCAL INFILE` and `SELECT INTO OUTILE` are the most essential reason for such efficient upload and download.It turns out that the advantages of pd2ml will gradually become apparent as the amount of data increases, so it will be a good assistant when dealing with massive amounts of data.
        
        To make a digression, the name pd2ml can be pronounced as pudumal (/'pudum蓴:/), which is a homonym for purdy cat in Chinese, and this is also the inspiration for the logo of this package.
        
        ## How to get it
        
        You can install using the pip package manager by running
        ```sh
        pip install pd2ml
        ```
        Note that it is necessary for pd2ml to run in the `Python3` environment.
        
        ## Dependencies
        
        - [pandas](https://pandas.pydata.org/)
        - [NumPy](https://www.numpy.org)
        - [SQLAlchemy](https://www.sqlalchemy.org/)
        
        ## Examples
        
        It is veary easy to get started and easy to use.
        
        Firstly, start by reading a csv file into your pandas DataFrame, e.g. by using
        ```python
        import pandas as pd
        import sqlalchemy as sa
        engine = sa.create_engine('mysql+pymysql://username:pwd@localhost:3306/db?charset=utf8&local_infile=1')
        df = pd.read_csv('stock.csv')
        ```
        To upload `df` to database, run
        ```python
        from pd2ml import Loader
        Loader(engine).load_to(df, 'stock')
        ```
        To download `df` from databse, run
        ```python
        from pd2ml import Loader
        df = Loader(engine).load_from('stock')
        print(df)
        ```
        In particular, it supports pandas extension, so you can also use it like this
        ```python
        import pd2ml
        # Call it directly just like any other DataFrame's native methods.
        # load to db
        df.load_to('stock', engine)
        # load from db
        df = pd.DataFrame()
        df = df.load_from('stock', engine)
        print(df)
        ```
        What's more, it also works well with multi-processing or multi-threading
        ```python
        # from multiprocessing.pool import ThreadPool as Pool
        from multiprocessing.pool import Pool
        from utils import split_dataframe, load_to, load_from_
        # load to db
        with Loader(engine):
            for d in split_dataframe(df, chunk_size):
                pool.apply_async(load_to, (d, 'stock'))
        # load from db
        with Loader(engine):
            for table in tables:
                result = pool.apply_async(load_from_, (table, get_engine, ))
                result_arr.append(result)
        ```
        For more details about examples, please see [here](https://github.com/TanyeeZhang/pd2ml/tree/master/examples).
        
        ## Tips
        
        To ensure that pd2ml works well, here are some tips and suggestions
        - It is essential to add parameters `infile_local=1` when connecting to the database
        - To make sure pd2ml works, it must be set `secure-file-priv=""` in MySQL configuration file `my.ini` or `my.cnf`
        - Meanwhile, to maximize efficiency, the value `innodb_buffer_pool_size` may be adjusted appropriately in configuration file
        
        ## Performance
        
        To estimate the performance of pd2ml, I did a preliminary rough test on my laptop.
        
        #### Test Environment
        
        - MySQL 5.7
        - Windows 10
        - ThinkPad of 4 cores and 8G RAM
        
        #### Test Items
        
        I created a table named stock, this table has 6 fields, of which 2 primary keys, 3 indexes. Then
        
        - Use pd2ml and multi-value insert SQL (`INSERT INTO ... VALUES (), () ...`) to insert 10000/100000/1000000 records 10 times into the empty table respectively, the average elapsed time is shown in the figure
        
        ![compare_1](https://user-images.githubusercontent.com/32212649/89024505-ce997380-d357-11ea-99c5-86049a12fe7d.png)
        
        - Use p2dml and multi-value replace SQL (`REPLACE INTO ... VALUES (), () ...`) to insert 10000/100000/1000000 records 10 times into the non-empty table, the average elapsed time is shown in the figure
        
        ![compare_2](https://user-images.githubusercontent.com/32212649/89024771-38198200-d358-11ea-948f-5993fc5095e5.png)
        
        - Use pd2ml and pandas native method `read_sql_table` to read 10000/100000/1000000 records 10 times from the table as DATAFRAME, the average elapsed time is shown in the figure
        
        ![compare_3](https://user-images.githubusercontent.com/32212649/89024806-48316180-d358-11ea-963d-6c29cefc14a7.png)
        
        #### Conclusion
        
        As the amount of data increases, the speed advantage of pd2ml becomes more and more obvious. The efficiency of writing into the database is at least 20% faster, and reading from the databse is improved three to four times.
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
