Metadata-Version: 2.1
Name: easy-entrez
Version: 0.3.4
Summary: Python REST API for Entrez E-Utilities: stateless, easy to use, reliable.
Home-page: https://github.com/krassowski/easy-entrez
Author: Michal Krassowski
Author-email: krassowski.michal+pypi@gmail.com
License: MIT
Keywords: entrez,pubmed,e-utilities,ncbi,rest,api,dbsnp,literature,mining
Classifier: Development Status :: 5 - Production/Stable
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Topic :: Utilities
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Description-Content-Type: text/markdown
Provides-Extra: with_progress_bars
Provides-Extra: with_parsing_utils
License-File: LICENSE

# easy-entrez

![tests](https://github.com/krassowski/easy-entrez/workflows/tests/badge.svg)
![CodeQL](https://github.com/krassowski/easy-entrez/workflows/CodeQL/badge.svg)
[![Documentation Status](https://readthedocs.org/projects/easy-entrez/badge/?version=latest)](https://easy-entrez.readthedocs.io/en/latest/?badge=latest)

Python REST API for Entrez E-Utilities, aiming to  be easy to use and reliable.

Easy-entrez:

- makes common tasks easy thanks to simple Pythonic API,
- is typed and integrates well with mypy,
- is tested on Windows, Mac and Linux across Python 3.7, 3.8, 3.9 and 3.10,
- is limited in scope, allowing to focus on the reliability of the core code,
- does not use the stateful API as it is [error-prone](https://gitlab.com/ncbipy/entrezpy/-/issues/7) as seen on example of the alternative *entrezpy*.


### Examples

```python
from easy_entrez import EntrezAPI

entrez_api = EntrezAPI(
    'your-tool-name',
    'e@mail.com',
    # optional
    return_type='json'
)

# find up to 10 000 results for cancer in human
result = entrez_api.search('cancer AND human[organism]', max_results=10_000)

# data will be populated with JSON or XML (depending on the `return_type` value)
result.data
```

See more in the [Demo notebook](./Demo.ipynb) and [documentation](https://easy-entrez.readthedocs.io/en/latest).

For a real-world example (i.e. used for [this publication](https://www.frontiersin.org/articles/10.3389/fgene.2020.610798/full)) see notebooks in [multi-omics-state-of-the-field](https://github.com/krassowski/multi-omics-state-of-the-field) repository.

#### Fetching genes for a variant from dbSNP

Fetch the SNP record for `rs6311`:

```python
rs6311 = entrez_api.fetch(['rs6311'], max_results=1, database='snp').data[0]
rs6311
```

Display the result:

```python
from easy_entrez.parsing import xml_to_string

print(xml_to_string(rs6311))
```

Find the gene names for `rs6311`:

```python
namespaces = {'ns0': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}
genes = [
    name.text
    for name in rs6311.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
]
print(genes)
```

> `['HTR2A']`

Fetch data for multiple variants at once:

```python
result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')
gene_names = {
    'rs' + document_summary.get('uid'): [
        element.text
        for element in document_summary.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
    ]
    for document_summary in result.data
}
print(gene_names)
```

> `{'rs6311': ['HTR2A'], 'rs662138': ['SLC22A1']}`

#### Obtaining the chromosomal position from SNP rsID number

```python
from pandas import DataFrame

result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')

variant_positions = DataFrame([
    {
        'id': 'rs' + document_summary.get('uid'),
        'chromosome': chromosome,
        'position': position
    }
    for document_summary in result.data
    for chrom_and_position in document_summary.findall('.//ns0:CHRPOS', namespaces)
    for chromosome, position in [chrom_and_position.text.split(':')]
])

variant_positions
```

> |    | id       |   chromosome |   position |
> |---:|:---------|-------------:|-----------:|
> |  0 | rs6311   |           13 |   46897343 |
> |  1 | rs662138 |            6 |  160143444 |


#### Converting full variation/mutation data to tabular format

Parsing utilities can quickly extract the data to a `VariantSet` object
holding pandas `DataFrame`s with coordinates and alternative alleles frequencies:

```python
from easy_entrez.parsing import parse_dbsnp_variants

variants = parse_dbsnp_variants(result)
variants
```

> `<VariantSet with 2 variants>`

To get the coordinates:

```python
variants.coordinates
```

> | rs_id    | ref   | alts   |   chrom |       pos |   chrom_prev |   pos_prev | consequence                                                                  |
> |:---------|:------|:-------|--------:|----------:|-------------:|-----------:|:-----------------------------------------------------------------------------|
>| rs6311   | C     | A,T    |      13 |  46897343 |           13 |   47471478 | upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant |
>| rs662138 | C     | G      |       6 | 160143444 |            6 |  160564476 | intron_variant                                                               |

For frequencies:

```python
variants.alt_frequencies.head(5)  # using head to only display first 5 for brevity
```

> |    | rs_id   | allele   |   source_frequency |   total_count | study       |     count |
> |---:|:--------|:---------|-------------------:|--------------:|:------------|----------:|
> |  0 | rs6311  | T        |           0.44349  |          2221 | 1000Genomes |   984.991 |
> |  1 | rs6311  | T        |           0.411261 |          1585 | ALSPAC      |   651.849 |
> |  2 | rs6311  | T        |           0.331696 |          1486 | Estonian    |   492.9   |
> |  3 | rs6311  | T        |           0.35     |            14 | GENOME_DK   |     4.9   |
> |  4 | rs6311  | T        |           0.402529 |         56309 | GnomAD      | 22666     |


#### Obtaining the SNP rs ID number from chromosomal position

You can use the query string directly:

```python
results = entrez_api.search(
    '13[CHROMOSOME] AND human[ORGANISM] AND 31873085[POSITION]',
    database='snp',
    max_results=10
)
print(results.data['esearchresult']['idlist'])
```

> `['59296319', '17076752', '7336701', '4']`

Or pass a dictionary (no validation of arguments is performed, `AND` conjunction is used):

```python
results = entrez_api.search(
    dict(chromosome=13, organism='human', position=31873085),
    database='snp',
    max_results=10
)
print(results.data['esearchresult']['idlist'])
```

> `['59296319', '17076752', '7336701', '4']`

The base position should use the latest genome assembly (GRCh38 at the time of writing);
you can use the position in previous assembly coordinates by replacing `POSITION` with `POSITION_GRCH37`.
For more information of the arguments accepted by the SNP database see the [entrez help page](https://www.ncbi.nlm.nih.gov/snp/docs/entrez_help/) on NCBI website.

#### Find PubMed ID from DOI

When searching GWAS catalog PMID is needed over DOI. You can covert one to the other using:

```python
def doi_term(doi: str) -> str:
    """Prepare DOI for PubMed search"""
    doi = (
        doi
        .replace('http://', 'https://')
        .replace('https://doi.org/', '')
    )
    return f'"{doi}"[Publisher ID]'


result = entrez_api.search(
    doi_term('https://doi.org/10.3389/fcell.2021.626821'),
    database='pubmed',
    max_results=1
)
result.data['esearchresult']['idlist']
```

> `['33834021']`

### Installation

Requires Python 3.6+. Install with:


```bash
pip install easy-entrez
```

If you wish to enable (optional, tqdm-based) progress bars use:

```bash
pip install easy-entrez[with_progress_bars]
```

If you wish to enable (optional, pandas-based) parsing utilities use:

```bash
pip install easy-entrez[with_parsing_utils]
```

### Alternatives

You might want to try:

- [biopython.Entrez](https://biopython.org/docs/1.74/api/Bio.Entrez.html) - biopython is a heavy dependency, but probably good choice if you already use it
- [pubmedpy](https://github.com/dhimmel/pubmedpy) - provides interesting utilities for parsing the responses
- [entrez](https://github.com/jordibc/entrez) - appears to have a comparable scope but quite different API
- [entrezpy](https://gitlab.com/ncbipy/entrezpy) - this one did not work well for me (hence this package), but may have improved since
