Metadata-Version: 2.1
Name: TileSeqMut
Version: 0.3.8
Summary: Analysis scripts for TileSeq sequencing data
Home-page: https://github.com/RyogaLi/tilseq_mutcount
Author: ROUJIA LI
Author-email: roujia.li@mail.utoronto.ca
License: UNKNOWN
Description: ## TileSeq mutation count package
        
        This package is made to parse input sequecning files (fastq) with user provided parameter.csv file.
        Output of this pipeline is mutation counts for each pair of fastq files.
        
        ## Dependencies
        
        `python 3.6/3.7/3.8 (tested mainly under py3.6)`
        
        `R 3.4.4+`
        
        `Bowtie2 Bowtie2-build`
        
        ## Installation 
        
        The alpha version is available by running:
        
        `python -m pip install TileSeqMut`
        
        If you are using conda, you can set up the envirment before installing the package: 
        
        `conda install -n <env_name> pandas biopython seaborn`
        
        Then install the package `cigar` with `pip install cigar`. (Sometimes `cigar` is not available on with condas nor
         testpypi
        , this would need to be installed manually)
        
        You will also need the script `csv2json.R` which can be installed via installing [https://github.com/jweile
        /tileseqMave]. Make sure `csv2json.R` can be found in `$PATH`
        
        ### Execution
        ---
        
        After installation, you can run the package: 
        
        ```
        tileseq_mut -p ~/path/to/paramSheet.csv -o ~/path/to/output_folder -f ~/path/to/fastq_file_folder/ -name
         name_of_the_run 
        ```
        
        **Examples:**
        
        * This command would analyze fastq files in the folder: `$HOME/tileseq_data/WT/` and make a time stamped output folder with the prefix: `MTHFR_test` in `$HOME/dev/tilseq_mutcount/output/` (Using all default parameters, see below)
        
        ``` bash
        # on DC
        tileseq_mut -p $HOME/dev/tilseq_mutcount/190506_param_MTHFR.csv -o $HOME/dev/tilseq_mutcount/output/ -f $HOME
        /tileseq_data/WT/ -name MTHFR_test
        
        # on BC2
        tileseq_mut -p $HOME/dev/tilseq_mutcount/190506_param_MTHFR.csv -o $HOME/dev/tilseq_mutcount/output/ -f $HOME
        /tileseq_data/WT/ -name MTHFR_test -env BC2
        ```
        
        
        **Parameters**
        
        * Run `tileseq_mut --help`
        
        ``` bash
        # Required:
        
        -p PARAM, parameter csv or json file (please see details in the input files section)
        -o OUTPUT, Output directory
        -f FASTQ, Path to fastq files you want to analyze (only required when you are running alignment)
        -n NAME, RUN_NAME
        
        # Optional:
        
        -h, --help list all the args
        
        -env ENVIRONMENT, Name of cluster you want to run this script (default = DC), you can pick from DC, BC2 or GURU.
        --skip_alignment, Skip alignment for this run. Alignment output already exist and the output path should be the output generated by a previous run
        -log LOG_LEVEL, Log level of the logging object: debug, info, warning, error, critical (default = debug)
        -r1, R1 SAM file
        -r2, R2 SAM file
        -at, Alignment time required (default = 8h)
        -mt, Mutation calling time required (default = 36h)
        -override, This flag is used when converting the parameter sheet (csf2json). Please provide this flag if you only have one replicate.
        ```
        
        **Example of skipping alignment:**
        
        ```
        tileseq_mut -p $HOME/dev/tilseq_mutcount/190506_param_MTHFR.csv -o /home/rothlab1/rli/dev/tilseq_mutcount/output
        /190506_MTHFR_WT_2020-01-29-17-07-04/ --skip_alignment
        ```
        
        **Example of running on one pair of SAM sam_files**
        
        ```
        tileseq_mut -r1 /home/rothlab1/rli//dev/tilseq_mutcount/output/MTHFR_test_2020-03-19-11-2812/sam_files/45_S45_R1_001.sam
         -r2
         /home/rothlab1/rli//dev/tilseq_mutcount/output/MTHFR_test_2020-03-19-11-28-12/sam_files/45_S45_R2_001.sam -o /home/rothlab1/rli//dev/tilseq_mutcount/output/MTHFR_test_2020-03-19-11-28-12/MTHFR_test_2020-03-19-14-32-21_mut_count -p /home/rothlab1/rli//dev/tilseq_mutcount/paramsheets/190506_MTHFR_WT.csv --skip_alignment -n MTHFR_test
        ```
        
        ### Input files
        ---
        
        `/path/to/fastq/` - Full path to input fastq files
        
        `parameters.csv` - CSV file contains information for this run (please see example
        [here](https://docs.google.com/spreadsheets/d/1tIblmIFgOApPNzWN2KUwj8BKzBiJ1pOL7R4AOUGrqvE/edit?usp=sharing)
        ).
        This file is required to be comma-seperated and saved in csv format.
        
        
        ### Output files
        ---
        
        One output folder is created for each run. The output folder are named with `name_time-stamp`
        
        Within each output folder, the following files and folders will be generated:
        
        `./main.log` - main logging file for alignment
        
        `./args.log` - arguments for this run
        
        `./ref/` - Reference fasta file and bowtie2 index
        
        `./env_aln_sh/` - Bash scripts for submitting the alignment jobs
        
        `./sam_files/` - Alignment output and log files for the raw fastq files
        
        `./name_time-stamped_mut_count/` - Mutation counts in each sample are saved in csv files
        
            - `./main.log` - Main log file for mutation calling
        
            - `./args.log` - command line arguments
            
            - `./info.csv` - Meta information for each sample: sequencing depth, tile starts/ends and # of reads mapped outside of the targeted tile
        
            - `./count_sample_*.csv` - Raw mutation counts for each sample. With meta data in header. Variants are represented in hgvs format
        
            - `./env_mut_sh/` - Bash scripts for summitting the mutation count jobs
        
            - `./sample_id.log` - Log file for each sample
        
        The count_sample_\*\*.csv is passed to tileseqMave for further analysis
        
        ### Alignment
        ---
        
        The pipeline takes the sequence in the parameter file as reference and align the fastq files
        to the whole reference sequence. This is the sequence specified by user in the parameter file.
        
        For each pair of fastq files (R1 and R2), the pipeline submits one alignment job to the cluster. In the folder `env_sh` you can find all the scripts that were submitted to the cluster when you run `main.py`.
        
        Alignments were done using `Bowtie2` with following parameters:
        
        ```
        ~/bowtie2 --no-head --norc --no-sq --local -x {ref} -U {r1} -S {r1_sam_file}
        ~/bowtie2 --no-head --nofw --no-sq --local -x {ref} -U {r2} -S {r2_sam_file}
        ```
        
        ### Mutation Calls
        ---
        
        From each pair of sam files we count mutations for each sample.
        
        We first filter out reads that did not map to reference or reads that are outside of the tile. Then pass the rest of the reads to `count_mut.py`. Please read the wiki page about how to call mutations using CIGAR string and MD:Z tag.
        
        In order to eliminate sequencing errors. We apply a posterior probability cut-off. The posterior probability of a mutation was calculated using the Phred scores provided in SAM files.
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.7
Description-Content-Type: text/markdown
