Metadata-Version: 2.1
Name: drfp
Version: 0.3.2
Summary: An NLP-inspired chemical reaction fingerprint based on basic set arithmetic.
Home-page: https://github.com/reymond-group/drfp
Author: Daniel Probst
Author-email: daniel.probst@hey.com
License: MIT
Project-URL: Documentation, https://github.com/reymond-group/drfp
Project-URL: Source, https://github.com/reymond-group/drfp
Project-URL: Twitter, https://twitter.com/skepteis
Platform: any
Classifier: Development Status :: 5 - Production/Stable
Classifier: Topic :: Scientific/Engineering :: Chemistry
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Description-Content-Type: text/markdown; charset=UTF-8
Provides-Extra: testing
License-File: LICENSE.txt

![test workflow](https://github.com/reymond-group/drfp/actions/workflows/tests.yml/badge.svg)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5268144.svg)](https://doi.org/10.5281/zenodo.5268144)


# DRFP


An NLP-inspired chemical reaction fingerprint based on basic set arithmetic.

Read the associated [open access article](https://pubs.rsc.org/en/content/articlehtml/2022/dd/d1dd00006c)


## Description

Predicting the nature and outcome of reactions using computational methods is an important tool to accelerate chemical research. The recent application of deep learning-based learned fingerprints to reaction classification and reaction yield prediction has shown an impressive increase in performance compared to previous methods such as DFT- and structure-based fingerprints. However, learned fingerprints require large training data sets, are inherently biased, and are based on complex deep learning architectures. Here we present the differential reaction fingerprint *DRFP*. The *DRFP* algorithm takes a reaction SMILES as an input and creates a binary fingerprint based on the symmetric difference of two sets containing the circular molecular n-grams generated from the molecules listed left and right from the reaction arrow, respectively, without the need for distinguishing between reactants and reagents. We show that *DRFP* outperforms DFT-based fingerprints in reaction yield prediction and other structure-based fingerprints in reaction classification, and reaching the performance of state-of-the-art learned fingerprints in both tasks while being data-independent.

## Getting Started
The best way to start exploring DRFP is on binder.
A notebook that gets you started on creating and using DRFP:

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/reymond-group/drfp/HEAD?filepath=notebooks%2F01_fingerprinting.ipynb)

A notbook that explains how you can use SHAP to analyse and interpret your machine learning models when using DRFP:

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/reymond-group/drfp/HEAD?filepath=notebooks%2F02_model_explainability.ipynb)

## Installation and Usage
*DRFP* can be installed from pypi using `pip install drfp`. However, it depends on [RDKit](https://www.rdkit.org/) which is best [installed using conda](https://www.rdkit.org/docs/Install.html).

Once DRFP is installed, there are two ways you can use it. You can use the cli app `drfp` or the library provided by the package.

### CLI
```bash
drfp my_rxn_smiles.txt my_rxn_fps.pkl -d 512
```

This will create a pickle dump containing an numpy ndarray containing DRFP fingerprints with a dimensionality of 512. To also export the mapping, use the flag `--mapping`. This will create the additional file `my_rxn_fps.map.pkl`. You can call `drfp --help` to show all available flags and options.

### Library
Following is a basic exmple of how to use DRFP in a Python script.
```python
from drfp import DrfpEncoder

rxn_smiles = [
    "CO.O[C@@H]1CCNC1.[C-]#[N+]CC(=O)OC>>[C-]#[N+]CC(=O)N1CC[C@@H](O)C1",
    "CCOC(=O)C(CC)c1cccnc1.Cl.O>>CCC(C(=O)O)c1cccnc1",
]

fps = DrfpEncoder.encode(rxn_smiles)
```

The variable `fps` now points to a list containing the fingerprints for the two reaction SMILES as numpy arrays.


## Documentation

The library contains the class `DrfpEncoder` with one public method `encode`.

| `DrfpEncoder.encode()` | Description | Type | Default |
|-|-|-|-|
| `X` | An iterable (e.g. a list) of reaction SMILES or a single reaction SMILES to be encoded | `Iterable` or `str` |  |
| `n_folded_length` | The folded length of the fingerprint (the parameter for the modulo hashing) | `int` | `2048` |
| `min_radius` | The minimum radius of a substructure (0 includes single atoms) | `int` | `0` |
| `radius` | The maximum radius of a substructure | `int` | `3` |
| `rings` | Whether to include full rings as substructures | `bool` | `True` |
| `mapping` |  Return a feature to substructure mapping in addition to the fingerprints. If true, the return signature of this method is `Tuple[List[np.ndarray], Dict[int, Set[str]]]` | `bool` | `False` |

# Reproduce
Want to reproduce the results in our paper? You can find all the data in the `data` folder and encoding and training scripts in the `scripts` folder.

# Cite Us
```
@article{probst2022reaction,
  title={Reaction Classification and Yield Prediction using the Differential Reaction Fingerprint DRFP},
  author={Probst, Daniel and Schwaller, Philippe and Reymond, Jean-Louis},
  journal={Digital Discovery},
  year={2022},
  publisher={Royal Society of Chemistry}
}
```
