Usage (advanced)
============

## General Note

If working with extraordinarily large VCF files (>= 1Mb), the following processing steps are likely to be very slow. 
Before starting it might be a good idea to trim down your VCF files to only the region/chromosome of interest, using something like [bcftools](http://samtools.github.io/bcftools/bcftools-man.html#view). 

## Filtering germline variants

If you have access to control tissue, then `germline-filter` is a good place to start. 
All you have to do is provide a simple metadata file that links each experimental sample to its corresponding control sample. 
For example: 
```
experimental_sample_id,germline_sample_id
sample1,gl_sample1
sample2,gl_sample1
sample3,gl_sample2
sample4,gl_sample2
sample5,gl_sample2
```
Once you have made this metadata file you're ready to run `germline-filter.` 
An example command line:   
```
cerebra germline-filter --processes 2 --control_path /path/to/control/vcfs --experimental_path /path/to/experimental/vcfs --metadata /path/to/metadata/file --outdir /path/to/filtered/vcfs
```

This will create a new directory (`/path/to/filtered/vcfs/`) that contains a set of entirely new VCFs. 

## Counting variants 

The module `count-variants` module can be run after `germline-filter`, on the new vcfs contained in `/path/to/filtered/vcfs/`.
However, `germline-filter` is entirely optional -- if you dont have access to germline or control samples, `count-variants` is the place to start. 
An example command line:   
```
cerebra count-variants --processes 2 --cosmicdb /optional/path/to/cosmic/database --refgenome /path/to/genome/annotation --outfile /path/to/output/file /path/to/filtered/vcfs/*
```

NOTE that the cosmic database is also optional. If you'd like you can download one of the database files from [here](https://cancer.sanger.ac.uk/cosmic/download), however, you can also run `count-variants` without this option. 

## Finding peptide variants

Like `count-variants`, `find-peptide-variants` is a standalone module. You can run it on the VCFs generated by `germline-filter` or on unfiltered VCFs. Also like `count-variants`, this module gives you the option of filtering through a cosmic database. 
An example command line:   
```
cerebra find-peptide-variants --processes 2 --cosmicdb /optional/path/to/cosmic/database --annotation /path/to/genome/annotation --genomefa /path/to/genome/fasta --report_coverage 1 --output /path/to/output/file /path/to/filtered/vcfs/*
```

The `report_coverage` option will report counts for both variant and wildtype reads at all variant loci. If indicated this option will report counts for both variant and wildtype reads at all variant loci. We reasoned that variants with a high degree of read support are less likely to be false positives. This option is designed to give the user more confidence in individual variant calls.

## Testing

First install the packages specified in [test_requirements.txt](https://github.com/czbiohub/cerebra/blob/messing-w-docs/test_requirements.txt). 
Now you should be able to run:

` $ make test `

If you've installed `cerebra` in a virtual environment make sure the environment is active. 
Confirm that all tests have passed.
If otherwise, feel free to submit an [issue report](https://github.com/czbiohub/cerebra/blob/messing-w-docs/docs/CONTRIBUTING.md). 
