Metadata-Version: 1.1
Name: bio-hansel
Version: 2.5.0
Summary: Subtype microbial whole-genome sequencing (WGS) data using SNV targeting k-mer subtyping schemes.
Home-page: https://github.com/phac-nml/biohansel
Author: Peter Kruczkiewicz
Author-email: peter.kruczkiewicz@gmail.com
License: Apache Software License 2.0
Description: |logo|
        
        |conda| |nbsp| |pypi| |nbsp| |license| |nbsp| |nbsp| Master:|citest-master| |nbsp| Development:|citest-dev| |rtd|
        
        
        
        .. |logo| image:: logo.png
            :target: https://github.com/phac-nml/biohansel
        .. |pypi| image:: https://badge.fury.io/py/bio-hansel.svg
            :target: https://pypi.python.org/pypi/bio_hansel/
        .. |license| image:: https://img.shields.io/badge/License-Apache%20v2.0-blue.svg
            :target: http://www.apache.org/licenses/LICENSE-2.0
        .. |citest-dev|  image:: https://travis-ci.org/phac-nml/biohansel.svg?branch=development
            :target: https://travis-ci.org/phac-nml/biohansel
        .. |citest-master| image:: https://travis-ci.org/phac-nml/biohansel.svg?branch=master
            :target: https://travis-ci.org/phac-nml/biohansel
        .. |conda|   image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg
            :target: https://bioconda.github.io/recipes/bio_hansel/README.html
        .. |nbsp| unicode:: 0xA0
            :trim:
        .. |rtd| image:: https://readthedocs.org/projects/pip/badge/?version=latest&style=flat
            :target: https://bio-hansel.readthedocs.io/en/readthedocs/
        
        Subtype microbial whole-genome sequencing (WGS) data using SNV targeting k-mer subtyping schemes.
        
        Includes 33 bp k-mer SNV subtyping schemes for *Salmonella enterica* subsp. enterica serovars Heidelberg, Enteritidis, and Typhimurium genomes developed by Genevieve Labbe et al., and for *S*. ser Typhi adapted from `Wong et al. <https://www.nature.com/articles/ncomms12827)>`_, `Britto et al. <https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0006408>`_, `Rahman et al. <https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0008036>`_, and `Klemm et al. <https://mbio.asm.org/content/9/1/e00105-18>`_.
        
        Works on genome assemblies (FASTA files) or reads (FASTQ files)! Accepts Gzipped FASTA/FASTQ files as input!
        
        Also includes a *Mycobacterium tuberculosis* lineage scheme adapted from `Coll et al. <https://www.nature.com/articles/ncomms5812>`_ by Daniel Kein.
        
        
        Citation
        ========
        
        If you find the ``biohansel`` tool useful, please cite as:
        
        .. epigraph::
        
            Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel.
            Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Kim Ziebell, Anil Nichani, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash.
            bioRxiv 2020.01.10.902056; doi: https://doi.org/10.1101/2020.01.10.902056
        
        
        Read_The_Docs
        ==============
        
        More in-depth information on running and installing biohansel can be found on the `biohansel readthedocs page <https://bio-hansel.readthedocs.io/en/readthedocs/>`_.
        
        
        Requirements and Dependencies
        =============================
        
        Each new build of ``biohansel`` is automatically tested on Linux using `Continuous Integration <https://travis-ci.org/phac-nml/bio_hansel/branches>`_. ``biohansel`` has been confirmed to work on Mac OSX (versions 10.13.5 Beta and 10.12.6) when installed with Conda_.
        
        These are the dependencies required for ``biohansel``:
        
        - Python_ (>=v3.6)
            - numpy_ >=1.12.1
            - pandas_ >=0.20.1
            - pyahocorasick_ >=1.1.6
            - attrs_
        
        
        Installation
        ============
        
        With Conda_
        -----------
        
        Install ``biohansel`` from Bioconda_ with Conda_ (`Conda installation instructions <https://bioconda.github.io/#install-conda>`_):
        
        .. code-block:: bash
        
            # setup Conda channels for Bioconda and Conda-Forge (https://bioconda.github.io/#set-up-channels)
            conda config --add channels defaults
            conda config --add channels conda-forge
            conda config --add channels bioconda
            # install biohansel
            conda install bio_hansel
        
        With pip_ from PyPI_
        ---------------------
        
        Install ``biohansel`` from PyPI_ with pip_:
        
        .. code-block:: bash
        
            pip install bio_hansel
        
        With pip_ from Github
        ---------------------
        
        Or install the latest master branch version directly from Github:
        
        .. code-block:: bash
        
            pip install git+https://github.com/phac-nml/biohansel.git@master
        
        Install into Galaxy_ (version >= 17.01)
        ---------------------------------------
        
        Install ``biohansel`` from the main Galaxy_ toolshed:
        
        https://toolshed.g2.bx.psu.edu/view/nml/biohansel/ba6a0af656a6
        
        
        Usage
        =====
        
        If you run ``hansel -h``, you should see the following usage statement:
        
        .. code-block::
        
            usage: hansel [-h] [-s SCHEME] [--scheme-name SCHEME_NAME]
                          [-M SCHEME_METADATA] [-p forward_reads reverse_reads]
                          [-i fasta_path genome_name] [-D INPUT_DIRECTORY]
                          [-o OUTPUT_SUMMARY] [-O OUTPUT_KMER_RESULTS]
                          [-S OUTPUT_SIMPLE_SUMMARY] [--force] [--json]
                          [--min-kmer-freq MIN_KMER_FREQ]
                          [--max-kmer-freq MAX_KMER_FREQ]
                          [--low-cov-depth-freq LOW_COV_DEPTH_FREQ]
                          [--max-missing-kmers MAX_MISSING_KMERS]
                          [--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS]
                          [--low-cov-warning LOW_COV_WARNING]
                          [--max-intermediate-kmers MAX_INTERMEDIATE_KMERS] 
                          [--max-degenerate-kmers MAX_DEGENERATE_KMERS] [-t THREADS]
                          [-v] [-V]
                          [F [F ...]]
        
            Subtype microbial genomes using SNV targeting k-mer subtyping schemes.
            Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg, Enteritidis, Typhi, and Typhimurium subtyping. Also includes a Mycobacterium tuberculosis scheme called 'tb_lineage'.
            Developed by Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Kim Ziebell, Anil Nichani, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash.
        
            positional arguments:
              F                     Input genome FASTA/FASTQ files (can be Gzipped)
        
            optional arguments:
              -h, --help            show this help message and exit
              -s SCHEME, --scheme SCHEME
                                    Scheme to use for subtyping (built-in: "heidelberg",
                                    "enteritidis", "typhi", "typhimurium", "tb_lineage"; OR user-specified:
                                    /path/to/user/scheme)
              --scheme-name SCHEME_NAME
                                    Custom user-specified SNP substyping scheme name
              -M SCHEME_METADATA, --scheme-metadata scheme_metadata
                                    Scheme subtype metadata table (.TSV format accepted;
                                    must contain column called "subtype")
              -p forward_reads reverse_reads, --paired-reads forward_reads reverse_reads
                                    FASTQ paired-end reads
              -i fasta_path genome_name, --input-fasta-genome-name fasta_path genome_name
                                    fasta file path to genome name pair
              -D INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
                                    directory of input fasta files (.fasta|.fa|.fna) or
                                    FASTQ files (paired FASTQ should have same basename
                                    with "_\d\.(fastq|fq)" postfix to be automatically
                                    paired) (files can be Gzipped)
              -o OUTPUT_SUMMARY, --output-summary OUTPUT_SUMMARY
                                    Subtyping summary output path (tab-delimited)
              -O OUTPUT_KMER_RESULTS, --output-kmer-results OUTPUT_KMER_RESULTS
                                    Subtyping kmer matching output path (tab-delimited)
              -S OUTPUT_SIMPLE_SUMMARY, --output-simple-summary OUTPUT_SIMPLE_SUMMARY
                                    Subtyping simple summary output path
              --force               Force existing output files to be overwritten
              --json                Output JSON representation of output files
              --min-kmer-freq MIN_KMER_FREQ
                                    Min k-mer freq/coverage
              --max-kmer-freq MAX_KMER_FREQ
                                    Max k-mer freq/coverage
              --low-cov-depth-freq LOW_COV_DEPTH_FREQ
                                    Frequencies below this coverage are considered low
                                    coverage
              --max-missing-kmers MAX_MISSING_KMERS
                                    Decimal proportion of maximum allowable missing kmers
                                    before being considered an error. (0.0 - 1.0)
              --min-ambiguous-kmers MIN_AMBIGUOUS_KMERS
                                    Minimum number of missing kmers to be considered an
                                    ambiguous result
              --low-cov-warning LOW_COV_WARNING
                                    Overall kmer coverage below this value will trigger a
                                    low coverage warning
              --max-intermediate-kmers MAX_INTERMEDIATE_KMERS
                                    Decimal proportion of maximum allowable missing kmers
                                    to be considered an intermediate subtype. (0.0 - 1.0)
              --max-degenerate-kmers MAX_DEGENERATE_KMERS
                                    Maximum number of scheme k-mers allowed before
                                    quitting with a usage warning. Default is 100,000
              -t THREADS, --threads THREADS
                                    Number of parallel threads to run analysis (default=1)
              -v, --verbose         Logging verbosity level (-v == show warnings; -vvv ==
                                    show debug info)
              -V, --version         show program's version number and exit
        
        
        
        
        Example Usage
        =============
        
        Analysis of a single FASTA file
        -------------------------------
        
        .. code-block:: bash
        
            hansel -s heidelberg -vv -o results.tab -O match_results.tab /path/to/SRR1002850.fasta
        
        
        Contents of ``results.tab``:
        
        .. code-block::
        
            sample  scheme  subtype all_subtypes    kmers_matching_subtype  are_subtypes_consistent inconsistent_subtypes   n_kmers_matching_all    n_kmers_matching_all_total  n_kmers_matching_positive   n_kmers_matching_positive_total n_kmers_matching_subtype    n_kmers_matching_subtype_total  file_path
            SRR1002850  heidelberg  2.2.2.2.1.4 2; 2.2; 2.2.2; 2.2.2.2; 2.2.2.2.1; 2.2.2.2.1.4  1037658-2.2.2.2.1.4; 2154958-2.2.2.2.1.4; 3785187-2.2.2.2.1.4   True        202 202 17  17  3   3   SRR1002850.fasta
        
        
        Contents of ``match_results.tab``:
        
        .. code-block::
        
            kmername    stitle  pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore    qlen    slen    seq coverage    is_trunc    refposition subtype is_pos_kmer sample  file_path   scheme
            775920-2.2.2.2  NODE_2_length_512016_cov_46.4737_ID_3   100.0   33  0   0   1   33  474875  474907  2.0000000000000002e-11  62.1    33  512016  GTTCAGGTGCTACCGAGGATCGTTTTTGGTGCG   1.0 False   775920  2.2.2.2 True    SRR1002850  SRR1002850.fasta   heidelberg
            negative3305400-2.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  276235  276267  2.0000000000000002e-11  62.1    33  427905  CATCGTGAAGCAGAACAGACGCGCATTCTTGCT   1.0 False   negative3305400 2.1.1.1 False   SRR1002850  SRR1002850.fasta   heidelberg
            negative3200083-2.1 NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  170918  170950  2.0000000000000002e-11  62.1    33  427905  ACCCGGTCTACCGCAAAATGGAAAGCGATATGC   1.0 False   negative3200083 2.1 False   SRR1002850  SRR1002850.fasta   heidelberg
            negative3204925-2.2.3.1.5   NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  175760  175792  2.0000000000000002e-11  62.1    33  427905  CTCGCTGGCAAGCAGTGCGGGTACTATCGGCGG   1.0 False   negative3204925 2.2.3.1.5   False   SRR1002850  SRR1002850.fasta   heidelberg
            negative3230678-2.2.2.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  201513  201545  2.0000000000000002e-11  62.1    33  427905  AGCGGTGCGCCAAACCACCCGGAATGATGAGTG   1.0 False   negative3230678 2.2.2.1.1.1 False   SRR1002850  SRR1002850.fasta   heidelberg
            negative3233869-2.1.1.1.1   NODE_3_length_427905_cov_48.1477_ID_5   100.0   33  0   0   1   33  204704  204736  2.0000000000000002e-11  62.1    33  427905  CAGCGCTGGTATGTGGCTGCACCATCGTCATTA   1.0 False   
            [Next 196 lines omitted.]
        
        
        Analysis of a single FASTQ readset
        ----------------------------------
        
        .. code-block:: bash
        
            hansel -s heidelberg -vv -t 4 -o results.tab -O match_results.tab -p SRR5646583_forward.fastqsanger SRR5646583_reverse.fastqsanger
        
        
        Contents of ``results.tab``:
        
        .. code-block::
        
            sample  scheme  subtype all_subtypes    kmers_matching_subtype  are_subtypes_consistent inconsistent_subtypes   n_kmers_matching_all    n_kmers_matching_all_total  n_kmers_matching_positive   n_kmers_matching_positive_total n_kmers_matching_subtype    n_kmers_matching_subtype_total  file_path
            SRR5646583  heidelberg  2.2.1.1.1.1 2; 2.2; 2.2.1; 2.2.1.1; 2.2.1.1.1; 2.2.1.1.1.1  1983064-2.2.1.1.1.1; 4211912-2.2.1.1.1.1    True        202 202 20  20  2   2   SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger
        
        
        Contents of ``match_results.tab``:
        
        .. code-block::
        
            seq freq    sample  file_path   kmername    is_pos_kmer subtype refposition is_kmer_freq_okay   scheme
            ACGGTAAAAGAGGACTTGACTGGCGCGATTTGC   68  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    21097-2.2.1.1.1 True    2.2.1.1.1   21097   True    heidelberg
            AACCGGCGGTATTGGCTGCGGTAAAAGTACCGT   77  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    157792-2.2.1.1.1    True    2.2.1.1.1   157792  True    heidelberg
            CCGCTGCTTTCTGAAATCGCGCGTCGTTTCAAC   67  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    293728-2.2.1.1  True    2.2.1.1 293728  True    heidelberg
            GAATAACAGCAAAGTGATCATGATGCCGCTGGA   91  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    607438-2.2.1    True    2.2.1   607438  True    heidelberg
            CAGTTTTACATCCTGCGAAATGCGCAGCGTCAA   87  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    691203-2.2.1.1  True    2.2.1.1 691203  True    heidelberg
            CAGGAGAAAGGATGCCAGGGTCAACACGTAAAC   33  SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger    944885-2.2.1.1.1    True    2.2.1.1.1   944885  True    heidelberg
            [Next 200 lines omitted.]
        
        Analysis of all FASTA/FASTQ files in a directory
        ------------------------------------------------
        
        .. code-block:: bash
        
            hansel -s heidelberg -vv --threads <n_cpu> -o results.tab -O match_results.tab -D /path/to/fastas_or_fastqs/
        
        
        ``biohansel`` will only attempt to analyze the FASTA/FASTQ files within the specified directory and will not descend into any subdirectories!
        
        Metadata addition to analysis
        -----------------------------
        
        Add subtype metadata to your analysis results with `-M your-subtype-metadata.tsv`:
        
        .. code-block:: bash
        
            hansel -s heidelberg \
              -M your-subtype-metadata.tsv \
              -o results.tab \
              -O match_results.tab \
              -D ~/your-reads-directory/
        
        Your metadata table **must** contain a field with the field name `subtype`, e.g.
        
        .. list-table::
           :header-rows: 1
        
           * - subtype
             - host_association
             - geoloc
             - genotype_alternative
           * - 1
             - human
             - Canada
             - A
           * - 2
             - cow
             - USA
             - B
        
        ``biohansel`` accepts metadata table files with the following formats and extensions:
        
        .. list-table:: 
           :header-rows: 1
        
           * - Format
             - Extension
             - Example Filename
           * - Tab-delimited table/tab-separated values (TSV)
             - `.tsv`
             - `my-metadata-table.tsv`
           * - Tab-delimited table/tab-separated values (TSV)
             - `.tab`
             - `my-metadata-table.tab`
           * - Comma-separated values (CSV)
             - `.csv`
             - `my-metadata-table.csv`
        
        
        Development
        ===========
        
        
        Get the latest development code using Git from GitHub:
        
        .. code-block:: bash
        
            git clone https://github.com/phac-nml/biohansel.git
            cd biohansel/
            git checkout development
            # Create a virtual environment (virtualenv) for development
            virtualenv -p python3 .venv
            # Activate the newly created virtualenv
            source .venv/bin/activate
            # Install biohansel into the virtualenv in "editable" mode
            pip install -e .
        
        
        Run tests with pytest_:
        
        .. code-block:: bash
        
            # In the biohansel/ root directory, install pytest for running tests
            pip install pytest
            # Run all tests in tests/ directory
            pytest
            # Or run a specific test module
            pytest -s tests/test_qc.py
        
        
        
        Legal
        =====
        
        Copyright Government of Canada 2017
        
        Written by: National Microbiology Laboratory, Public Health Agency of Canada
        
        Licensed under the Apache License, Version 2.0 (the "License"); you may not use
        this work except in compliance with the License. You may obtain a copy of the
        License at:
        
        http://www.apache.org/licenses/LICENSE-2.0
        
        Unless required by applicable law or agreed to in writing, software distributed
        under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
        CONDITIONS OF ANY KIND, either express or implied. See the License for the
        specific language governing permissions and limitations under the License.
        
        Contact
        =======
        
        **Gary van Domselaar**: gary.vandomselaar@canada.ca
        
        
        .. _PyPI: https://pypi.org/project/bio-hansel/
        .. _Conda: https://conda.io/docs/
        .. _Bioconda: https://bioconda.github.io/
        .. _pip: https://pip.pypa.io/en/stable/quickstart/
        .. _numpy: http://www.numpy.org/
        .. _pandas: http://pandas.pydata.org/
        .. _pyahocorasick: http://pyahocorasick.readthedocs.io/en/latest/
        .. _attrs: http://www.attrs.org/en/stable/
        .. _Python: https://www.python.org/
        .. _Galaxy: https://galaxyproject.org/
        .. _pytest: https://docs.pytest.org/en/latest/
        
Keywords: Salmonella enterica Heidelberg Enteritidis SNP kmer subtyping Aho-Corasick
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering
