Metadata-Version: 1.2
Name: snps
Version: 1.2.3
Summary: tools for reading, writing, merging, and remapping SNPs
Home-page: https://github.com/apriha/snps
Author: Andrew Riha
Author-email: apriha@gmail.com
License: BSD 3-Clause License
Project-URL: Changelog, https://github.com/apriha/snps/releases
Project-URL: Issue Tracker, https://github.com/apriha/snps/issues
Description: .. image:: https://raw.githubusercontent.com/apriha/snps/master/docs/images/snps_banner.png
        
        |build| |codecov| |docs| |pypi| |python| |downloads|
        
        snps
        ====
        tools for reading, writing, merging, and remapping SNPs 🧬
        
        Features
        --------
        Input / Output
        ``````````````
        - Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testing
          sources with a `SNPs <https://snps.readthedocs.io/en/latest/snps.html#snps.snps.SNPs>`_
          object
        - Read and write VCF files (e.g., convert `23andMe <https://www.23andme.com>`_ to VCF)
        - Merge raw data files from different DNA tests, identifying discrepant SNPs in the
          process with a
          `SNPsCollection <https://snps.readthedocs.io/en/latest/snps.html#snps.snps_collection.SNPsCollection>`_
          object
        - Read data in a variety of formats (e.g., files, bytes, compressed with `gzip` or `zip`)
        - Handle several variations of file types, validated via
          `openSNP parsing analysis <https://github.com/apriha/snps/tree/master/analysis/parse-opensnp-files>`_
        
        Build / Assembly Detection and Remapping
        ````````````````````````````````````````
        - Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
        - Remap SNPs between builds / assemblies
        
        Data Cleaning
        `````````````
        - Fix several common issues when loading SNPs
        - Sort SNPs based on chromosome and position
        - Deduplicate RSIDs
        - Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
        - Assign PAR SNPs to the X or Y chromosome
        
        Supported Genotype Files
        ------------------------
        ``snps`` supports `VCF <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/>`_ files and
        genotype files from the following DNA testing sources:
        
        - `23andMe <https://www.23andme.com>`_
        - `Ancestry <https://www.ancestry.com>`_
        - `Código 46 <https://codigo46.com.mx>`_
        - `DNA.Land <https://dna.land>`_
        - `Family Tree DNA <https://www.familytreedna.com>`_
        - `Genes for Good <https://genesforgood.sph.umich.edu>`_
        - `LivingDNA <https://livingdna.com>`_
        - `Mapmygenome <https://mapmygenome.in>`_
        - `MyHeritage <https://www.myheritage.com>`_
        - `Sano Genetics <https://sanogenetics.com>`_
        
        Additionally, ``snps`` can read a variety of "generic" CSV and TSV files.
        
        Dependencies
        ------------
        ``snps`` requires `Python <https://www.python.org>`_ 3.5+ and the following Python packages:
        
        - `numpy <http://www.numpy.org>`_
        - `pandas <http://pandas.pydata.org>`_
        - `atomicwrites <https://github.com/untitaker/python-atomicwrites>`_
        
        Installation
        ------------
        ``snps`` is `available <https://pypi.org/project/snps/>`_ on the
        `Python Package Index <https://pypi.org>`_. Install ``snps`` (and its required
        Python dependencies) via ``pip``::
        
            $ pip install snps
        
        Examples
        --------
        Download Example Data
        `````````````````````
        First, let's setup logging to get some helpful output:
        
        >>> import logging, sys
        >>> logger = logging.getLogger()
        >>> logger.setLevel(logging.INFO)
        >>> logger.addHandler(logging.StreamHandler(sys.stdout))
        
        Now we're ready to download some example data from `openSNP <https://opensnp.org>`_:
        
        >>> from snps.resources import Resources
        >>> r = Resources()
        >>> paths = r.download_example_datasets()
        Downloading resources/662.23andme.340.txt.gz
        Downloading resources/662.ftdna-illumina.341.csv.gz
        
        Load Raw Data
        `````````````
        Load a `23andMe <https://www.23andme.com>`_ raw data file:
        
        >>> from snps import SNPs
        >>> s = SNPs('resources/662.23andme.340.txt.gz')
        
        The ``SNPs`` class accepts a path to a file or a bytes object. A ``Reader`` class attempts to
        infer the data source and load the SNPs. The loaded SNPs are available via a ``pandas.DataFrame``:
        
        >>> df = s.snps
        >>> df.columns.values
        array(['chrom', 'pos', 'genotype'], dtype=object)
        >>> df.index.name
        'rsid'
        >>> len(df)
        991786
        
        ``snps`` also attempts to detect the build / assembly of the data:
        
        >>> s.build
        37
        >>> s.build_detected
        True
        >>> s.assembly
        'GRCh37'
        
        Remap SNPs
        ``````````
        Let's remap the SNPs to change the assembly / build:
        
        >>> s.snps.loc["rs3094315"].pos
        752566
        >>> chromosomes_remapped, chromosomes_not_remapped = s.remap_snps(38)
        Downloading resources/GRCh37_GRCh38.tar.gz
        >>> s.build
        38
        >>> s.assembly
        'GRCh38'
        >>> s.snps.loc["rs3094315"].pos
        817186
        
        SNPs can be remapped between Build 36 (``NCBI36``), Build 37 (``GRCh37``), and Build 38
        (``GRCh38``).
        
        Merge Raw Data Files
        ````````````````````
        The dataset consists of raw data files from two different DNA testing sources. Let's combine
        these files using a ``SNPsCollection``.
        
        >>> from snps import SNPsCollection
        >>> sc = SNPsCollection("resources/662.ftdna-illumina.341.csv.gz", name="User662")
        Loading resources/662.ftdna-illumina.341.csv.gz
        >>> sc.build
        36
        >>> chromosomes_remapped, chromosomes_not_remapped = sc.remap_snps(37)
        Downloading resources/NCBI36_GRCh37.tar.gz
        >>> sc.snp_count
        708092
        
        As the data gets added, it's compared to the existing data, and SNP position and genotype
        discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)
        
        >>> sc.load_snps(["resources/662.23andme.340.txt.gz"], discrepant_genotypes_threshold=300)
        Loading resources/662.23andme.340.txt.gz
        27 SNP positions were discrepant; keeping original positions
        151 SNP genotypes were discrepant; marking those as null
        >>> len(sc.discrepant_snps)  # SNPs with discrepant positions and genotypes, dropping dups
        169
        >>> sc.snp_count
        1006960
        
        Save SNPs
        `````````
        Ok, so far we've remapped the SNPs to the same build and merged the SNPs from two files,
        identifying discrepancies along the way. Let's save the merged dataset consisting of over 1M+
        SNPs to a CSV file:
        
        >>> saved_snps = sc.save_snps()
        Saving output/User662_GRCh37.csv
        
        Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:
        
        >>> saved_snps = sc.save_snps(vcf=True)
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.2.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.3.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.4.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.5.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.6.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.7.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.8.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.9.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.10.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.11.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.12.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.13.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.14.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.15.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.16.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.18.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.19.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.20.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.21.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.X.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.Y.fa.gz
        Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.MT.fa.gz
        Saving output/User662_GRCh37.vcf
        
        All `output files <https://snps.readthedocs.io/en/latest/output_files.html>`_ are saved to the
        output directory.
        
        Documentation
        -------------
        Documentation is available `here <https://snps.readthedocs.io/>`_.
        
        Acknowledgements
        ----------------
        Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, `openSNP <https://opensnp.org>`_,
        `Open Humans <https://www.openhumans.org>`_, and `Sano Genetics <https://sanogenetics.com>`_.
        
        .. https://github.com/rtfd/readthedocs.org/blob/master/docs/badges.rst
        .. |build| image:: https://travis-ci.com/apriha/snps.svg?branch=master
           :target: https://travis-ci.com/apriha/snps
        .. |codecov| image:: https://codecov.io/gh/apriha/snps/branch/master/graph/badge.svg
           :target: https://codecov.io/gh/apriha/snps
        .. |docs| image:: https://readthedocs.org/projects/snps/badge/?version=latest
           :target: https://snps.readthedocs.io/
        .. |pypi| image:: https://img.shields.io/pypi/v/snps.svg
           :target: https://pypi.python.org/pypi/snps
        .. |python| image:: https://img.shields.io/pypi/pyversions/snps.svg
           :target: https://www.python.org
        .. |downloads| image:: https://pepy.tech/badge/snps
           :target: https://pepy.tech/project/snps
        
Keywords: snps dna chromosomes bioinformatics
Platform: any
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Utilities
Requires-Python: >=3.5
