# FluViewer
A tool for generating influenza A virus genome sequences from FASTQ data

## Installation
1. FluViewer requires the following dependencies, and it is recommended to install them in a FluViewer virtual environment (indicated versions were tested, but later versions can likely be substituted):
- python v3.8.5
- pandas v1.3.5
- spades v3.15.3
- blast v2.12.0
- bwa v0.7.17
- samtools v1.14
- bcftools v1.14
- bedtools v2.30.0
- seqtk v1.3

2. Once the dependencies have been installed, install the latest FluViewer release via PyPI:
```
pip3 install FluViewer
```

3. Download and unzip the default FluViewer DB (FluViewer_db.fa.gz) from this repository. Custom DBs can be created and used as well (instructions below).

## Usage
```
FluViewer -f <path_to_fwd_reads> -r <path_to_rev_reads> -d <path_to_db_file> -o <output_name> -m <mode> [-D <min_depth> -q <min_qual> -c <min_cov> -i <min_id>] [-g]
```

<b>Required arguments:</b>

-f : path to FASTQ file containing forward reads

-r : path to FASTQ file containing reverse reads

-d : path to FASTA file containing FluViewer database (details below)

-o : output name (creates directory with this name for output, includes this name in output files, and in consensus sequence headers)

-m : FluViewer run mode (align or assemble)


<b>Optional arguments:</b>

-D : Minimum read depth for base calling (default = 20)

-q : Minimum PHRED score for base quality and mapping quality (default = 30)

-c : Minimum coverage of database reference sequence by contig (percentage, default = 25)

-i : Minimum nucleotide sequence identity between database reference sequence and contig (percentage, default = 95)


<b>Optional flags:</b>

-g : Set this flag to deactivate garbage collection and retain intermediate files


## FluViewer Database
FluViewer requires a curated FASTA file "database" of influenza A virus reference sequences. Headers for these sequences must be formatted and annotated as follows:
```
>unique_id|strain_name|segment|subtype
```
For example:
```
>MF599463|A/swine/Kansas/A01378028/2017|HA|H3
```

## FluViewer Output
FluViewer generates three outputfiles:
1. A FASTA file containing consensus sequences for influenza A virus genome segments
2. A sorted BAM file with reads mapped to either the choosen reference sequences (align mode) or the assembled contigs (assembly mode)
3. A report TSV file describing segment, subtype, and sequencing metrics for each consensus sequence

Headers in the FASTA file have the following format:
```
>output_name_unique_sequence_number|segment|subject
```
The report TSV file contains the following columns:

<b>consensus_seq</b> : the name of the consensus sequence described by this row

<b>segment</b> : influenza A virus genome segment (PB2, PB1, PA, HA, NP, NA, M, NS)

<b>subtype</b> : HA or NA subtype ("none" for internal segments)

<b>mapped reads</b> : the number of sequencing reads mapped to this segment

<b>seq_length</b> : the length (in nucleotides) of the consensus sequence generated by FluViewer

<b>sequenced_bases</b> : the number of nucleotide positions in the consensus sequence with sufficient depth of coverage (set by -D argument) and a succesful base call (e.g. A, T, G, or C)

<b>segment_cov</b> : the number of sequenced bases in the consensus sequence divided by the typical length of this genome segment (as a percentage). The typical segment length is determined by finding the median length of the segment/subject reference sequences whose contig alignments have the highest bitscore.