Metadata-Version: 2.1
Name: mzcn
Version: 1.0
Summary: Facilitating the design, comparison and sharingof deep text matching models.
Home-page: https://github.com/yingdajun/mzcn/
Author: mzcn Authors:英大俊
Author-email: 2227495940@qq.com
License: Apache 2.0
Description: 
        # mzcn
        
        中文版本的matchzoo-py
        
        本库包是基于matchzoo-py的库包做的二次开发开源项目，MatchZoo 是一个通用的文本匹配工具包，它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。
        <br>
        由于matchzoo-py面向英文预处理较为容易，中文处理则需要进行一定的预处理。为此本人在借鉴学习他人成功的基础上，改进了matchzoo-py包，开发mzcn库包。
        <br>
        mzcn库包对中文文本语料进行只保留文本、去除表情、去除空格、去除停用词等操作，使得使用者可以快速进行中文文本语料进行预处理，使用方法和matchzoo-py基本一致。
        
        # 快速入手
        
        ## 定义损失函数和指标
        
        
        ```python
        import torch
        import numpy as np
        import pandas as pd
        import mzcn as mz
        print('matchzoo version', mz.__version__)
        ranking_task = mz.tasks.Ranking(losses=mz.losses.RankHingeLoss())
        ranking_task.metrics = [
            mz.metrics.NormalizedDiscountedCumulativeGain(k=3),
            mz.metrics.NormalizedDiscountedCumulativeGain(k=5),
            mz.metrics.MeanAveragePrecision()
        ]
        print("`ranking_task` initialized with metrics", ranking_task.metrics)
        ```
        
            C:\Users\Administrator\Anaconda3\lib\requests\__init__.py:80: RequestsDependencyWarning: urllib3 (1.25.11) or chardet (3.0.4) doesn't match a supported version!
              RequestsDependencyWarning)
            
        
            matchzoo version 1.0
            `ranking_task` initialized with metrics [normalized_discounted_cumulative_gain@3(0.0), normalized_discounted_cumulative_gain@5(0.0), mean_average_precision(0.0)]
            
        
        ## 准备输入数据
        
        
        ```python
        def load_data(tmp_data,tmp_task):
        	df_data = mz.pack(tmp_data,task=tmp_task)
        	return df_data
        ##数据集，并且进行相应的预处理
        train=pd.read_csv('./data/train_data.csv').sample(100)
        dev=pd.read_csv('./data/dev_data.csv').sample(50)
        test=pd.read_csv('./data/test_data.csv').sample(30)
        train_pack_raw = load_data(train,ranking_task)
        dev_pack_raw = load_data(dev,ranking_task)
        test_pack_raw=load_data(test,ranking_task)
        ```
        
        ## 数据集预处理
        
        
        ```python
        preprocessor = mz.models.aNMM.get_default_preprocessor()
        ```
        
        
        ```python
        train_pack_processed = preprocessor.fit_transform(train_pack_raw)
        dev_pack_processed = preprocessor.transform(dev_pack_raw)
        test_pack_processed = preprocessor.transform(test_pack_raw)
        ```
        
            Processing text_left with chain_transform of ChineseRemoveBlack => ChineseSimplified => ChineseEmotion => IsChinese => ChineseStopRemoval => ChineseTokenizeDemo => Tokenize => Lowercase => PuncRemoval:   0%| | 0/92 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
            Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
            Loading model cost 1.062 seconds.
            Prefix dict has been built successfully.
            Processing text_left with chain_transform of ChineseRemoveBlack => ChineseSimplified => ChineseEmotion => IsChinese => ChineseStopRemoval => ChineseTokenizeDemo => Tokenize => Lowercase => PuncRemoval: 100%|█| 92/92 [00:01<00:00, 61.25it/s]
            Processing text_right with chain_transform of ChineseRemoveBlack => ChineseSimplified => ChineseEmotion => IsChinese => ChineseStopRemoval => ChineseTokenizeDemo => Tokenize => Lowercase => PuncRemoval: 100%|█| 93/93 [00:00<00:00, 216.90it/s]
            Processing text_right with append: 100%|████████████████████████████████████████████| 93/93 [00:00<00:00, 92741.39it/s]
            Building FrequencyFilter from a datapack.: 100%|████████████████████████████████████| 93/93 [00:00<00:00, 46575.55it/s]
            Processing text_right with transform: 100%|█████████████████████████████████████████| 93/93 [00:00<00:00, 46503.37it/s]
            Processing text_left with extend: 100%|█████████████████████████████████████████████| 92/92 [00:00<00:00, 15340.54it/s]
            Processing text_right with extend: 100%|████████████████████████████████████████████| 93/93 [00:00<00:00, 93073.32it/s]
            Building Vocabulary from a datapack.: 100%|██████████████████████████████████████| 817/817 [00:00<00:00, 203900.18it/s]
            Processing text_left with chain_transform of ChineseRemoveBlack => ChineseSimplified => ChineseEmotion => IsChinese => ChineseStopRemoval => ChineseTokenizeDemo => Tokenize => Lowercase => PuncRemoval: 100%|█| 92/92 [00:00<00:00, 218.14it/s]
            Processing text_right with chain_transform of ChineseRemoveBlack => ChineseSimplified => ChineseEmotion => IsChinese => ChineseStopRemoval => ChineseTokenizeDemo => Tokenize => Lowercase => PuncRemoval: 100%|█| 93/93 [00:00<00:00, 227.51it/s]
            Processing text_right with transform: 100%|█████████████████████████████████████████| 93/93 [00:00<00:00, 46536.66it/s]
            Processing text_left with transform: 100%|██████████████████████████████████████████| 92/92 [00:00<00:00, 30685.96it/s]
            Processing text_right with transform: 100%|█████████████████████████████████████████| 93/93 [00:00<00:00, 31014.57it/s]
            Processing length_left with len: 100%|██████████████████████████████████████████████| 92/92 [00:00<00:00, 92138.48it/s]
            Processing length_right with len: 100%|█████████████████████████████████████████████| 93/93 [00:00<00:00, 46497.83it/s]
            Processing text_left with chain_transform of ChineseRemoveBlack => ChineseSimplified => ChineseEmotion => IsChinese => ChineseStopRemoval => ChineseTokenizeDemo => Tokenize => Lowercase => PuncRemoval: 100%|█| 45/45 [00:00<00:00, 202.82it/s]
            Processing text_right with chain_transform of ChineseRemoveBlack => ChineseSimplified => ChineseEmotion => IsChinese => ChineseStopRemoval => ChineseTokenizeDemo => Tokenize => Lowercase => PuncRemoval: 100%|█| 50/50 [00:00<00:00, 215.62it/s]
            Processing text_right with transform: 100%|█████████████████████████████████████████| 50/50 [00:00<00:00, 49920.30it/s]
            Processing text_left with transform: 100%|██████████████████████████████████████████| 45/45 [00:00<00:00, 11257.53it/s]
            Processing text_right with transform: 100%|█████████████████████████████████████████| 50/50 [00:00<00:00, 50135.12it/s]
            Processing length_left with len: 100%|██████████████████████████████████████████████| 45/45 [00:00<00:00, 22512.37it/s]
            Processing length_right with len: 100%|█████████████████████████████████████████████| 50/50 [00:00<00:00, 12510.60it/s]
            Processing text_left with chain_transform of ChineseRemoveBlack => ChineseSimplified => ChineseEmotion => IsChinese => ChineseStopRemoval => ChineseTokenizeDemo => Tokenize => Lowercase => PuncRemoval: 100%|█| 30/30 [00:00<00:00, 209.93it/s]
            Processing text_right with chain_transform of ChineseRemoveBlack => ChineseSimplified => ChineseEmotion => IsChinese => ChineseStopRemoval => ChineseTokenizeDemo => Tokenize => Lowercase => PuncRemoval: 100%|█| 28/28 [00:00<00:00, 209.05it/s]
            Processing text_right with transform: 100%|█████████████████████████████████████████| 28/28 [00:00<00:00, 28062.25it/s]
            Processing text_left with transform: 100%|██████████████████████████████████████████| 30/30 [00:00<00:00, 10006.29it/s]
            Processing text_right with transform: 100%|█████████████████████████████████████████| 28/28 [00:00<00:00, 14031.12it/s]
            Processing length_left with len: 100%|███████████████████████████████████████████████| 30/30 [00:00<00:00, 7504.12it/s]
            Processing length_right with len: 100%|█████████████████████████████████████████████| 28/28 [00:00<00:00, 13924.65it/s]
            
        
        ## 生成训练数据
        
        
        ```python
        trainset = mz.dataloader.Dataset(
            data_pack=train_pack_processed,
            mode='pair',
            num_dup=2,
            num_neg=1
        )
        devset = mz.dataloader.Dataset(
            data_pack=dev_pack_processed
        )
        ```
        
        ## 生成管道
        
        
        ```python
        padding_callback = mz.models.aNMM.get_default_padding_callback()
        
        trainloader = mz.dataloader.DataLoader(
            dataset=trainset,
            stage='train',
            callback=padding_callback,
        )
        devloader = mz.dataloader.DataLoader(
            dataset=devset,
            stage='dev',
            callback=padding_callback,
        )
        ```
        
        ## 定义模型
        
        
        ```python
        model = mz.models.aNMM()
        model.params['task'] = ranking_task
        model.params["embedding_output_dim"]=100
        model.params["embedding_input_dim"]=preprocessor.context["embedding_input_dim"]
        model.params['dropout_rate'] = 0.1
        model.build()
        print(model)
        ```
        
            aNMM(
              (embedding): Embedding(319, 100, padding_idx=0)
              (matching): Matching()
              (hidden_layers): Sequential(
                (0): Sequential(
                  (0): Linear(in_features=200, out_features=100, bias=True)
                  (1): ReLU()
                )
                (1): Sequential(
                  (0): Linear(in_features=100, out_features=1, bias=True)
                  (1): ReLU()
                )
              )
              (q_attention): Attention(
                (linear): Linear(in_features=100, out_features=1, bias=False)
              )
              (dropout): Dropout(p=0.1, inplace=False)
              (out): Linear(in_features=1, out_features=1, bias=True)
            )
            
        
        ## 模型训练
        
        
        ```python
        optimizer = torch.optim.Adam(model.parameters(), lr = 3e-4)
        
        trainer = mz.trainers.Trainer(
            model=model,
            optimizer=optimizer,
            trainloader=trainloader,
            validloader=devloader,
            validate_interval=None,
            epochs=10
        )
        
        trainer.run()
        ```
        
        
            HBox(children=(IntProgress(value=0, max=1), HTML(value='')))
        
        
            [Iter-1 Loss-1.000]:
              Validation: normalized_discounted_cumulative_gain@3(0.0): 0.2121 - normalized_discounted_cumulative_gain@5(0.0): 0.2121 - mean_average_precision(0.0): 0.2121
            
            
        
        
            HBox(children=(IntProgress(value=0, max=1), HTML(value='')))
        
        
            [Iter-2 Loss-1.000]:
              Validation: normalized_discounted_cumulative_gain@3(0.0): 0.2121 - normalized_discounted_cumulative_gain@5(0.0): 0.2121 - mean_average_precision(0.0): 0.2121
            
            
        
        
            HBox(children=(IntProgress(value=0, max=1), HTML(value='')))
        
        
            [Iter-3 Loss-1.000]:
              Validation: normalized_discounted_cumulative_gain@3(0.0): 0.2121 - normalized_discounted_cumulative_gain@5(0.0): 0.2121 - mean_average_precision(0.0): 0.2121
            
            
        
        
            HBox(children=(IntProgress(value=0, max=1), HTML(value='')))
        
        
            [Iter-4 Loss-1.000]:
              Validation: normalized_discounted_cumulative_gain@3(0.0): 0.2121 - normalized_discounted_cumulative_gain@5(0.0): 0.2121 - mean_average_precision(0.0): 0.2121
            
            
        
        
            HBox(children=(IntProgress(value=0, max=1), HTML(value='')))
        
        
            [Iter-5 Loss-1.000]:
              Validation: normalized_discounted_cumulative_gain@3(0.0): 0.2121 - normalized_discounted_cumulative_gain@5(0.0): 0.2121 - mean_average_precision(0.0): 0.2121
            
            
        
        
            HBox(children=(IntProgress(value=0, max=1), HTML(value='')))
        
        
            [Iter-6 Loss-1.000]:
              Validation: normalized_discounted_cumulative_gain@3(0.0): 0.2121 - normalized_discounted_cumulative_gain@5(0.0): 0.2121 - mean_average_precision(0.0): 0.2121
            
            
        
        
            HBox(children=(IntProgress(value=0, max=1), HTML(value='')))
        
        
            [Iter-7 Loss-1.000]:
              Validation: normalized_discounted_cumulative_gain@3(0.0): 0.2121 - normalized_discounted_cumulative_gain@5(0.0): 0.2121 - mean_average_precision(0.0): 0.2121
            
            
        
        
            HBox(children=(IntProgress(value=0, max=1), HTML(value='')))
        
        
            [Iter-8 Loss-1.000]:
              Validation: normalized_discounted_cumulative_gain@3(0.0): 0.2121 - normalized_discounted_cumulative_gain@5(0.0): 0.2121 - mean_average_precision(0.0): 0.2121
            
            
        
        
            HBox(children=(IntProgress(value=0, max=1), HTML(value='')))
        
        
            [Iter-9 Loss-1.000]:
              Validation: normalized_discounted_cumulative_gain@3(0.0): 0.2121 - normalized_discounted_cumulative_gain@5(0.0): 0.2121 - mean_average_precision(0.0): 0.2121
            
            
        
        
            HBox(children=(IntProgress(value=0, max=1), HTML(value='')))
        
        
            [Iter-10 Loss-1.000]:
              Validation: normalized_discounted_cumulative_gain@3(0.0): 0.2121 - normalized_discounted_cumulative_gain@5(0.0): 0.2121 - mean_average_precision(0.0): 0.2121
            
            Cost time: 3.3411495685577393s
            
        
        # Install
        
        由于mzcn是依赖于matchzoo-py模型，所以一共有两种途径安装mzcn
        
        ### Install MatchZoo-py from Pypi:
        pip install mzcn
        
        ### Install MatchZoo-py from the Github source:
        
        git clone https://github.com/yingdajun/mzcn.git
        <br>
        cd mzcn
        <br>
        python setup.py install
        
        # Citation
        
        本人是第一次写库包，水平有限，希望能给大家带来使用的帮助，如果有不足的地方请多指教
        这里是所有引用过的库包
        
        ## matchzoo-py
        
        @inproceedings{Guo:2019:MLP:3331184.3331403,
         author = {Guo, Jiafeng and Fan, Yixing and Ji, Xiang and Cheng, Xueqi},
         title = {MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching},
         booktitle = {Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval},
         series = {SIGIR'19},
         year = {2019},
         isbn = {978-1-4503-6172-9},
         location = {Paris, France},
         pages = {1297--1300},
         numpages = {4},
         url = {http://doi.acm.org/10.1145/3331184.3331403},
         doi = {10.1145/3331184.3331403},
         acmid = {3331403},
         publisher = {ACM},
         address = {New York, NY, USA},
         keywords = {matchzoo, neural network, text matching},
        } 
        
        ## CSDN的作者：SK-Berry的博文
        
        https://blog.csdn.net/sk_berry/article/details/104984599
        
Keywords: text matching models
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Operating System :: POSIX :: Linux
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.6
Description-Content-Type: text/markdown
Provides-Extra: tests
