# Project Links
This repository is part of the [CI-SpliceAI](https://ci-spliceai.com) software package published in [PLOS One](https://doi.org/10.1371/journal.pone.0269159).

This is the project for offline variant annotation using the final models. You may also be interested in the [code to train CI-SpliceAI](https://github.com/YStrauch/CI-SpliceAI__Train), [code comparing different tools on variant data](https://github.com/YStrauch/CI-SpliceAI__Comparison), and the [website providing online annotation of variants](https://ci-spliceai.com).

# Install
We strongly advise you to install [conda](https://docs.conda.io/en/latest/) and create a new conda environment. This keeps your projects and their dependencies separate. So go ahead and install conda first.

CI-SpliceAI can be run on your CPU or GPU. CPU is much easier to set up, but inference is much slower than using a GPU. If you only need to annotate a few variants (less than 50 in total), don't bother setting up GPU support.

If you don't need GPU support, you can skip the next paragraph.

## Setting up your GPU
For experienced users only. GPU is only supported for CUDA devices (i.e. NVIDIA graphic cards).

We suggest you install [conda](https://docs.conda.io/en/latest/), CUDA and tensorflow-gpu first and ensure that your GPU is being used by tensorflow (by running python and importing tensorflow). You can use a command similar to this:

```sh
conda create -n cis_use python=3.8 tensorflow-gpu=2.2.0 cudnn=7.6.5.32=hc0a50b0_1 keras=2.4.3 -c conda-forge
```

We used this command with CUDA/11.0.

Please activate the new environment and make sure that your GPU is being recognised:
```sh
conda activate cis_use
python -c "import keras; from tensorflow.python.client import device_lib; print([x.name for x in device_lib.list_local_devices()])"
```
If you set-up your GPU correctly, the last line generated by this command should list all your GPUs. If it does not, do not proceed with installation until fixed.
Once the GPU shows up, you can proceed with the next section and omit the [cpu] suffix.

## Installing the python module
If you did not set-up your GPU, you need to create the conda environment first:
```sh
conda create -n cis_use python=3.8
conda activate cis_use
```

Then, install cispliceai like this (remove [cpu] suffix if you set up a GPU in the previous step):
```sh
pip install cispliceai[cpu]
```

# Usage

```
cis-vcf [-h] [--annotation ANNOTATION] [--input INPUT] [--output OUTPUT] [--distance DISTANCE] [--batch BATCH] [--all] [--outside] [--mask] reference
```
<pre>
positional arguments:
  reference             path to the reference fasta file (*.fa, *.fa.gz)

optional arguments:
  -h, --help            show this help message and exit

  --annotation ANNOTATION, -a ANNOTATION
                        annotation table with gene names, defaults to "grch38" (table included). You can specify "grch37" or a path to your own table of the same format.

  --input INPUT, -i INPUT
                        input VCF; defaults to stdin

  --output OUTPUT, -o OUTPUT
                        output VCF; defaults to stdout

  --distance DISTANCE, -d DISTANCE
                        maximum distance from the variant; defaults to 1000

  --batch BATCH, -b BATCH
                        maximum input batch size in MB. Be careful to leave enough space for the model and inference process. We recommend to increase this only for GPU processing. Defaults to 10

  --all                 annotate all affected genes/regions, not only the most significant

  --outside             keep nucleotides outside of annotated transcript areas (defined in annotation table); by default outside nucleotides are encoded as N

  --mask, -m            mask events to disregard gains of splice sites and losses of non-splice sites
</pre>

## Example command
Annotates an `input.vcf` file into an `output.vcf` file. Uses GrCh37 coordinates, outputs all genes affected, and masks to canonical losses or non-canonical gains.

```sh
cis-vcf -i input.vcf -o output.vcf -a grch37 --all --mask
```
## Parameter `reference`
Please download a human reference genome file (i.e. the latest primary assembly used by GENCODE: [GRCh38](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/latest_release/GRCh38.primary_assembly.genome.fa.gz), [GRCh37](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/latest_release/GRCh37_mapping/GRCh37.primary_assembly.genome.fa.gz)). You need to unzip it first!

If you want to save space (and decrease performance!), you can use a bgzipped version of the reference genome. To do so, install samtools and biopython (`conda install samtools biopython -c bioconda`), then run `bgzip -c <file>.fa > <file>.fa.gz`.

## Annotation table (`--annotation`)
By default, the module uses an annotation table of collapsed isoforms of GENCODE on GRCh38. You can change this to GRCh37 by specifying "-a grch37".

You can also provide your own annotation table. Please use a table provided within this module, i.e. [the GrCh38 one](cispliceai/data/grch38.csv), as a template. It's a CSV file with a (gene) ID, the chromosome, strand, start and end of the transcript, and comma-separated junction start end end positions. On the forward strand, junction start/end is equivalent to donor/acceptor sites, on the reverse-strand it's the other way round.

## In/Output (`--input` and `--output`)
You can specify the path to VCF files here. Or, you can omit the parameters pipe them to stdin and stdout `cat variants.vcf | cis-vcf <reference> > output.vcf`

## Maximum distance from the variant (`--distance`)
The algorithm predicts on your REF annotation plus some nucleotides around it. This parameter changes the number of nucleotides predicted to the left and right. (Note: An additional 5,000 nucleotides in either direction is extracted for context but not used for annotation).

## Batch size (`--batch`)
This parameter is only useful when using a GPU. We don't recommend changing it for CPU usage.

Each GPU has its own memory limit, so you need to find the optimal batch size yourself. If it's too small, your GPU won't be used efficiently, if it's too high, it will run out of memory.

We suggest finding the optimal value by measuring memory consumption using `nvidia-smi`, extrapolate usage to 100%, and try again. The parameter itself is a maximum value in MB, so it will work for different sequence lengths (depending on `--distance` and REF/ALT length)/

## All annotations per allele (`--all`)
By default, only the most significant annotation per ALT allele is written to VCF (i.e. the one with the highest delta position). You can add the '--all' flag to instead output all alleles.

## Retain sequences outside of annotated regions (`--outside`)
_This is an experimental feature_.

By default, all nucleotides outside of the overlapping annotated region (defined by the `--annotation` table) are replaced with N, except when the variant does not intercept with a region. You can instead always retain the sequences.

## Mask scores (`--mask`)
This parameter masks the delta score: It will remove (i.e. set to zero) all losses of non-splice sites and all gains of splice sites. Therefore only losses of existing splice sites and gains of novel splice sites are annotated.

# Output
The VCF output has an INFO column with the following annotations:

|    ID    | Description |
| -------- | ----------- |
|  ALLELE  | The ALT allele from your input |
|  SYMBOL  | Ensembl gene ID, or strand if no gene overlaps |
|  DS_AG   | Delta score (acceptor gain) |
|  DS_AL   | Delta score (acceptor loss) |
|  DS_DG   | Delta score (donor gain) |
|  DS_DL   | Delta score (donor loss) |
|  DP_AG   | Delta position (acceptor gain) |
|  DP_AL   | Delta position (acceptor loss) |
|  DP_DG   | Delta position (donor gain) |
|  DP_DL   | Delta position (donor loss) |

# Comparison between CI-SpliceAI and similar tools
There is a separate repository [here](https://github.com/YStrauch/splice-variant-comparison) comparing CI-SpliceAI with SpliceAI, MMSplice, SQUIRLS, and two MaxEntScan variants.

# Differences to SpliceAI
This tool is very similar to the [SpliceAI](https://github.com/Illumina/SpliceAI) annotation tool. These are the most important differences:

- Models are trained on collapsed GENCODE isoforms
- The final models are trained on all chromosomes; SpliceAI models are only trained on training chromosomes
- Models are optimised for inference and load/predict faster
- All overlapping genes are analysed, not only the first one
- When no genes overlap, the tool is run on the forward and backward strand
- *All* variants are annotated, even multi-nucleotide variants
- Better batch inference on GPU
- Masking (`--mask`) is done prior to annotation, allowing the next-highest annotation to be returned rather than capping the output to zero (see [this github issue](https://github.com/Illumina/SpliceAI/issues/27))
- Potential differences when filtering GENCODE annotations. SpliceAI's data pipeline was not released and their selection process not disclosed (see [this github issue](https://github.com/Illumina/SpliceAI/issues/87))

# Current limitations
The tool is only using one CPU/GPU max. Multi-processing is currently not supported.

# Development
## To install from a local directory
```sh
pip install -e ./[cpu]
```

## Build to release
```sh
pip install build
python -m build
```

## Pull requests / Bug reports
Pull requests are welcome. Please submit bug reports to the git issues system.

# License
<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.