# similarity_check

similarity_check is a Python package for measuring the similarity of two texts.

## Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install similarity_check.

```bash
pip install similarity_check 
```

## Usage
### sentence tranformer
#### documentation
* sentence_tranformer(source, target, source_col=None, target_col=None, threshold=0.5, only_include=None, model=None, lang='en'):
  * source: dataframe or list of input texts to find closest match for.
  * target: dataframe or list of targets text to compare with.
  * target_group: goups ids for the target to match only a single target for each group, can either provide list of ids,
  or the column name in the target dataframe.
  * source_col: the source column name used to match, *must be specified for dataframe matching*.
  * target_col: the target column name used to match, *must be specified for dataframe matching*.
  * threshold: the lowest threeshold to ignore matches below it.
  * only_include: used only for dataframe matching, allow providing a list of column names to only include for the target matches.
  * model (optional): a sentence tranformer model object, to use instead of the default one, for more [details](https://www.sbert.net/).
  * lang (optional): the languge of the model ('en'|'ar').
* sentence_tranformer.clean_data(remove_punct=True, remove_stop_words=True, stemm=False, lang='en'):
  * remove_punct: boolean flag to indicate whatever to remove punctuations. 
  * remove_stop_words: boolean flag to indicate whatever to remove stop words.
  * stemm: boolean flag to indicate whatever to do stemming.
  * lang: language of the text to clean ('en'|'ar').
* sentence_tranformer.match(topn=1, return_match_idx=False):
  * topn: number of matches to return.
  * return_match_idx: return an extra column for each match containing the index of the match within the target_names.
  * returns: a data frame with 3 columns (source, target, score), and two extra columns for each extra match (target_2, score_2 ...), and an optional extra column for each match containg the match index, if return_match_idxs set to True.
#### examples
the given examples will only use english to present the output in the correct format, if you like to use arabic matching change the lang attribute of the sentence_tranformer object to 'ar'.
##### using lists
```python
from similarity_check.checkers import sentence_tranformer

X = ['test', 'remove test']
y =  ['tests', 'stop the test', 'testing']

### arabic example:
# X = ['حذف الاختبار', 'اختبار']
# y =  ['اختبارات', 'ايقاف الاختبار']
# st = sentence_tranformer(X, y, lang='ar')


st = sentence_tranformer(X, y)
st.clean_data()
match_df = st.match(topn=4, return_match_idx=True, threshold=0.6)
```
output:
| source      |    score | prediction    |   match_idx |    score_2 | prediction_2   |   match_idx_2 |    score_3 | prediction_3   |   match_idx_3 | score_4   | prediction_4   | match_idx_4   |
|:------------|---------:|:--------------|------------:|-----------:|:---------------|--------------:|-----------:|:---------------|--------------:|:----------|:---------------|:--------------|
| test        | 0.922843 | tests         |           0 |   0.908599 | testing        |             2 |   0.721023 | stop the test  |             1 |           |                |               |
| remove test | 0.728872 | stop the test |           1 | nan        |                |           nan | nan        |                |           nan |           |                |               |
##### using dataframes
```python
X = pd.DataFrame({
    'text': ['test', 'remove test'],
    'id': [1, 2],
}
)
y = pd.DataFrame({
    'new_text': ['tests', 'stop the test', 'testing'],
    'new_id': [1, 2, 3],
    'tags': ['pos', 'neg', 'pos'],
    'num': [10, 22, 40],
    'day': [3, 5, 2],
}
)

st = sentence_tranformer(X, y, source_col='text', target_col='new_text', only_include=['new_id'])
st.clean_data()
match_df = st.match(topn=4, threshold=0.6)
```
output:
| text        |   id |   score_1 | new_text_1    |   new_id_1 |   score_2 | new_text_2   |   new_id_2 |   score_3 | new_text_3    |   new_id_3 | score_4   | new_text_4   | new_id_4   |
|:------------|-----:|----------:|:--------------|-----------:|----------:|:-------------|-----------:|----------:|:--------------|-----------:|:----------|:-------------|:-----------|
| test        |    1 |  0.922843 | tests         |          1 |  0.908599 | testing      |          3 |  0.721023 | stop the test |          2 |           |              |            |
| remove test |    2 |  0.728872 | stop the test |          2 |           |              |            |           |               |            |           |              |            |
#### english
```python
# for medical use #
# from gensim.models import KeyedVectors
# download the model from here: https://github.com/ncbi-nlp/BioSentVec
# model = KeyedVectors.load_word2vec_format('BioWordVec_PubMed_MIMICIII_d200.vec.bin', binary=True)

# for general usage #
import gensim.downloader as api
from similarity_check.checkers import word_mover_distance

model = api.load('glove-wiki-gigaword-300')

X = ['test now', 'remove test']
y =  ['tests', 'stop the test']

wmd = word_mover_distance(X, y, model)
wmd.clean_data()
match_df = wmd.match(topn=3)
```
#### arabic
```python
from gensim.models import Word2Vec
from similarity_check.checkers import word_mover_distance

# download the embedding from here: https://github.com/bakrianoo/aravec (N-Grams Models, Wikipedia-SkipGram, Vec-Size:300)
model = Word2Vec.load('full_grams_sg_300_wiki/full_grams_sg_300_wiki.mdl')
# take the keydvectors as the model
model = model.wv

X = ['حذف الاختبار', 'اختبار']
y =  ['اختبارات', 'ايقاف الاختبار']

wmd = word_mover_distance(X, y, model)
wmd.clean_data()
match_df = wmd.match(topn=3)
match_df
```
* word_mover_distance(source_names, target_names, model):
  * source_names: a list of input texts to find closest match for.
  * target_names: a list of targets text to compare with.
  * model (optional): a keyed vectors model (embeddings) to use for more [details](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).
* word_mover_distance.clean_data(remove_punct=True, remove_stop_words=True, stemm=False, lang='en'):
  * remove_punct: boolean flag to indicate whatever to remove punctuations. 
  * remove_stop_words: boolean flag to indicate whatever to remove stop words.
  * stemm: boolean flag to indicate whatever to do stemming.
  * lang: language of the text to clean ('en'|'ar').
* sentence_tranformer.match(topn=1, return_match_idx=False):
  * topn: number of matches to return.
  * return_match_idxs: return an extra column for each match containing the index of the match within the target_names.
  * returns: a data frame with 3 columns (source, target, score), and two extra columns for each extra match (target_2, score_2 ...), and an optional extra column for each match containg the match index, if return_match_idxs set to True.
## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

## License
[MIT](https://choosealicense.com/licenses/mit/)