Metadata-Version: 2.1
Name: similarity_check
Version: 0.1.3
Summary: package for measuring the similarity of two texts
Author-email: meshari <meshari34343@gmail.com>
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.7
Requires-Dist: gensim
Requires-Dist: pandas
Requires-Dist: pyemd
Requires-Dist: sentence-transformers
Description-Content-Type: text/markdown

# similarity_check

similarity_check is a Python package for measuring the similarity of two texts.

## Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install similarity_check.

```bash
pip install similarity_check 
```

## Usage
### sentence tranformer
#### documentation
* sentence_tranformer(source, target, source_col=None, target_col=None, only_include=None, model=None, lang='en'):
  * parameters:
    * source: dataframe or list of input texts to find closest match for.
    * target: dataframe or list of targets text to compare with.
    * target_group (optional): goups ids for the target to match only a single target for each group, can either provide list of ids,
    or the column name in the target dataframe.
    * source_col (partially optional): the source column name used to match, *must be specified for dataframe matching*.
    * target_col (partially optional): the target column name used to match, *must be specified for dataframe matching*.
    * only_include (optional): used only for dataframe matching, allow providing a list of column names to only include for the target matches, provide empty list to get only target_col.
    * model (optional): a sentence tranformer model object, to use instead of the default one, for more [details](https://www.sbert.net/).
    * lang (optional): the languge of the model ('en'|'ar').
* sentence_tranformer.clean_data(remove_punct=True, remove_stop_words=True, stemm=False, lang='en'):
  * parameters:
    * remove_punct: boolean flag to indicate whatever to remove punctuations. 
    * remove_stop_words: boolean flag to indicate whatever to remove stop words.
    * stemm: boolean flag to indicate whatever to do stemming.
    * lang: language of the text to clean ('en'|'ar').
* sentence_tranformer.match(topn=1, threshold=0.5, return_match_idx=False):
  * parameters:
    * topn: number of matches to return.
    * threshold: the lowest threeshold to ignore matches below it.
    * return_match_idx: return an extra column for each match containing the index of the match within the target_names.
  * returns:
    * a data frame with 3 columns (source, target, score), and two extra columns for each extra match (target_2, score_2 ...), and an optional extra column for each match containg the match index, if return_match_idxs set to True.
#### examples
the given examples will only use english to present the output in the correct format, if you like to use arabic matching change the lang attribute of the sentence_tranformer object to 'ar'.
##### using lists
```python
from similarity_check.checkers import sentence_tranformer

X = ['test', 'remove test']
y =  ['tests', 'stop the test', 'testing']

### arabic example:
# X = ['حذف الاختبار', 'اختبار']
# y =  ['اختبارات', 'ايقاف الاختبار']
# st = sentence_tranformer(X, y, lang='ar')
# st.clean_data(lang='ar')

st = sentence_tranformer(X, y)
st.clean_data()
match_df = st.match(topn=4, return_match_idx=True, threshold=0.6)
```
output:
| source      |    score | prediction    |   match_idx |    score_2 | prediction_2   |   match_idx_2 |    score_3 | prediction_3   |   match_idx_3 | score_4   | prediction_4   | match_idx_4   |
|:------------|---------:|:--------------|------------:|-----------:|:---------------|--------------:|-----------:|:---------------|--------------:|:----------|:---------------|:--------------|
| test        | 0.922843 | tests         |           0 |   0.908599 | testing        |             2 |   0.721023 | stop the test  |             1 |           |                |               |
| remove test | 0.728872 | stop the test |           1 | nan        |                |           nan | nan        |                |           nan |           |                |               |
##### using dataframes
```python
X = pd.DataFrame({
    'text': ['test', 'remove test'],
    'id': [1, 2],
}
)
y = pd.DataFrame({
    'new_text': ['tests', 'stop the test', 'testing'],
    'new_id': [1, 2, 3],
    'tags': ['pos', 'neg', 'pos'],
    'num': [10, 22, 40],
    'day': [3, 5, 2],
}
)

st = sentence_tranformer(X, y, source_col='text', target_col='new_text', only_include=['new_id'])
st.clean_data()
match_df = st.match(topn=4, threshold=0.6)
```
output:
| text        |   id |   score_1 | new_text_1    |   new_id_1 |   score_2 | new_text_2   |   new_id_2 |   score_3 | new_text_3    |   new_id_3 | score_4   | new_text_4   | new_id_4   |
|:------------|-----:|----------:|:--------------|-----------:|----------:|:-------------|-----------:|----------:|:--------------|-----------:|:----------|:-------------|:-----------|
| test        |    1 |  0.922843 | tests         |          1 |  0.908599 | testing      |          3 |  0.721023 | stop the test |          2 |           |              |            |
| remove test |    2 |  0.728872 | stop the test |          2 |           |              |            |           |               |            |           |              |            |
#### english
```python
# for medical use #
# from gensim.models import KeyedVectors
# download the model from here: https://github.com/ncbi-nlp/BioSentVec
# model = KeyedVectors.load_word2vec_format('BioWordVec_PubMed_MIMICIII_d200.vec.bin', binary=True)

# for general usage #
import gensim.downloader as api
from similarity_check.checkers import word_mover_distance

model = api.load('glove-wiki-gigaword-300')

X = ['test now', 'remove test']
y =  ['tests', 'stop the test']

wmd = word_mover_distance(X, y, model)
wmd.clean_data()
match_df = wmd.match(topn=3)
```
#### arabic
```python
from gensim.models import Word2Vec
from similarity_check.checkers import word_mover_distance

# download the embedding from here: https://github.com/bakrianoo/aravec (N-Grams Models, Wikipedia-SkipGram, Vec-Size:300)
model = Word2Vec.load('full_grams_sg_300_wiki/full_grams_sg_300_wiki.mdl')
# take the keydvectors as the model
model = model.wv

X = ['حذف الاختبار', 'اختبار']
y =  ['اختبارات', 'ايقاف الاختبار']

wmd = word_mover_distance(X, y, model)
wmd.clean_data()
match_df = wmd.match(topn=3)
match_df
```
* word_mover_distance(source_names, target_names, model):
  * parameters:
    * source_names: a list of input texts to find closest match for.
    * target_names: a list of targets text to compare with.
    * model (optional): a keyed vectors model (embeddings) to use for more [details](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).
* word_mover_distance.clean_data(remove_punct=True, remove_stop_words=True, stemm=False, lang='en'):
  * parameters:
    * remove_punct: boolean flag to indicate whatever to remove punctuations. 
    * remove_stop_words: boolean flag to indicate whatever to remove stop words.
    * stemm: boolean flag to indicate whatever to do stemming.
    * lang: language of the text to clean ('en'|'ar').
* sentence_tranformer.match(topn=1, return_match_idx=False):
  * parameters:
    * topn: number of matches to return.
    * return_match_idxs: return an extra column for each match containing the index of the match within the target_names.
  * returns: 
    * a data frame with 3 columns (source, target, score), and two extra columns for each extra match (target_2, score_2 ...), and an optional extra column for each match containg the match index, if return_match_idxs set to True.
    
## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

## License
[MIT](https://choosealicense.com/licenses/mit/)