## Classify

* Classifies the newly identified ORFs into groups based on the most similar known ORF
* Aligns the newly identified ORFs with reference sequences within these groups and builds a phylogenetic tree for each group.
* Finds clusters of newly identified ORFs within these trees
* Incorporates representative sequences from these clusters into a summary tree for each retroviral gene and genus  (based on classification into *gamma*, *beta*, *spuma*, *alpha*, *lenti*, *epsilon* and *delta* retroviruses as defined by the ICTV (https://talk.ictvonline.org/taxonomy).

1. [makeGroupFastas](#makeGroupFastas)<br>
2. [makeGroupTrees](#makeGroupTrees)<br>
3. [drawGroupTrees](#drawGroupTrees)<br>
4. [makeSummaryFastas](#makeSummaryFastas)<br>
5. [makeSummaryTrees](#makeSummaryTrees)<br>
6. [drawSummaryTrees](#drawSummaryTrees)<br>
7. [summariseClassify](#summariseClassify)<br>
8. [Classify](#Classify)


### makeGroupFastas<a name="makeGroupFastas"></a>

**Input Files**<br>
`grouped.dir/GENE_groups.tsv`<br>
`ERVsearch/phylogenies/group_phylogenies/*fasta`<br>
`ERVsearch/phylogenies/summary_phylogenies/*fasta`<br>
`ERVsearch/phylogenies/outgroups.tsv`<br>

**Output Files**<br>
`group_fastas.dir/GENE_(.*)_GENUS.fasta`<br>
`group_fastas.dir/GENE_(.*)_GENUS_A.fasta`<br>

**Parameters**<br>
`[paths] path_to_ERVsearch`<br>

Two sets of reference fasta files are available (files are stored in `ERVsearch/phylogenies/group_phylogenies` and `ERVsearch/phylogenies/summary_phylogenies`)

* group_phylogenies - groups of closely related ERVs for fine classification of sequences
* summary_phylogenies - groups of most distant ERVs for broad classification of sequences

Sequences have been assigned to groups based on the most similar sequence in the provided ERV database, based on the score using the Exonerate ungapped algorithm.
Where the most similar sequence is not part of a a well defined group, it has been assigned to a genus.

Fasta files are generated containing all members of the group from the group_phylogenies file (plus an outgroup) where possible and using representative sequences from the same genus, using the summary_phylogenies file, where only a genus has been assigned, plus all the newly identified ERVs in the group. These files are saved as GENE_(group_name_)GENUS.fasta.

A "~" is added to all new sequence names so they can be searched for easily.

The files are aligned using the MAFFT fftns algorithm https://mafft.cbrc.jp/alignment/software/manual/manual.html to generate the GENE_(group_name_)GENUS_A.fasta aligned output files.


#### makeGroupTrees<a name="makeGroupTrees"></a>

**Input Files**<br>
`group_fastas.dir/GENE_(.*_)GENUS_A.fasta`<br>

**Output Files**<br>
`group_trees.dir/GENE_(.*_)GENUS.tre`<br>

**Parameters**<br>
None<br>

Builds a phylogenetic tree, using the FastTree2 algorithm (http://www.microbesonline.org/fasttree) with the default settings plus the GTR model, for the aligned group FASTA files generated by the makeGroupFastas function.

#### drawGroupTrees<a name="drawGroupTrees"></a>

**Input Files**<br>
`group_trees.dir/GENE_(.*_)GENUS.tre`<br>

**Output Files**<br>
`group_trees.dir/GENE_(.*_)GENUS.FMT` (png, svg, pdf or jpg)<br>

**Parameters**<br>
`[plots] gag_colour`<br>
`[plots] pol_colour`<br>
`[plots] env_colour`<br>
`[trees] use_gene_colour`<br>
`[trees] maincolour`<br>
`[trees] highlightcolour`<br>
`[trees] outgroupcolour`<br>
`[trees] dpi`<br>
`[trees] format`<br>

Generates an image file for each file generated in the makeGroupTrees step, using ete3 (http://etetoolkit.org). Newly identified sequences are labelled as "~" and shown in a different colour.

By default, newly identified sequences are shown in the colours specified in `plots_gag_colour`, `plots_pol_colour` and `plots_env_colour` - to do this then `trees_use_gene_colour` should be set to True in the `pipeline.ini`. Alternatively, a fixed colour can be used by setting `trees_use_gene_colour` to False and settings `trees_highlightcolour`. The text colour of the reference sequences (default black) can be set using `trees_maincolour` and the outgroup using `trees_outgroupcolour`.

The output file DPI can be specified using `trees_dpi` and the format (which can be png, svg, pdf or jpg) using `trees_format`.

#### makeSummaryFastas<a name="makeSummaryFastas"></a>

**Input Files**<br>
`group_fastas.dir/GENE_(.*_)GENUS.fasta`<br>
`group_trees.dir/GENE_(*_)GENUS.tre`<br>
`ERVsearch/phylogenies/summary_phylogenies/GENE_GENUS.fasta`<br>
`ERVsearch/phylogenies/group_phylogenies/(.*)_GENUS_GENE.fasta`<br>

**Output Files**<br>
`summary_fastas.dir/GENE_GENUS.fasta`<br>
`summary_fastas.dir/GENE_GENUS.tre`<br>

**Parameters**<br>
`[paths] path_to_ERVsearch`<br>

Based on the group phylogenetic trees generated in makeGroupTrees, monophyletic groups of newly idenified ERVs are identified. For each of these groups, a single sequence (the longest) is selected as representative. The representative sequences are combined with the FASTA files in `ERVsearch/phylogenies/summary_phylogenies`, which contain representative sequences for each retroviral gene and genus. These are extended to include further reference sequences from the same small group as the newly identified sequences.

For example, if one MLV-like pol and one HERVF-like pol was identified in the gamma genus, the gamma_pol.fasta summary fasta would contain:
	* The new MLV-like pol sequence
	* The new HERVF-like pol sequence
	* The reference sequences from `ERVsearch/phylogenies/group_phylogenies/MLV-like_gamma_pol.fasta` - highly related sequences from the MLV-like group
	* The reference sequences from `ERVsearch/phylogenies/group_phylogenies/HERVF-like_gamma_pol.fasta` - highly related sequences from the HERVF-like group.
	* The reference sequences from `ERVsearch/phylogenies/summary_phylogenies/gamma_pol.fasta` - a less detailed but more diverse set of gammaretroviral pol ORFs.
	* A epsilonretrovirus outgroup
 
 This ensures sufficient detail in the groups of interest while avoiding excessive detail in groups where nothing new has been identified.
 
 These FASTA files are saved as GENE_GENUS.fasta
 
The files are aligned using the MAFFT fftns algorithm https://mafft.cbrc.jp/alignment/software/manual/manual.html to generate the GENE_GENUS_A.fasta aligned output files.

#### makeSummaryTrees<a name="makeSummaryTrees"></a>

**Input Files**<br>
`summary_fastas.dir/GENE_GENUS_A.fasta`<br>

**Output Files**<br>
`summary_trees.dir/GENE_GENUS.tre`<br>

**Parameters**<br>
None<br>

Builds a phylogenetic tree, using the FastTree2 algorithm (http://www.microbesonline.org/fasttree) with the default settings plus the GTR model, for the aligned group FASTA files generated by the makeSummaryFastas function.

#### drawSummaryTrees<a name="drawSummaryTrees"></a>

**Input Files**<br>
`summary_trees.dir/GENE_GENUS.tre`<br>

**Output Files**<br>
`summary_trees.dir/GENE_GENUS.FMT` (FMT = png, svg, pdf or jpg)<br>

**Parameters**<br>
`[plots] gag_colour`<br>
`[plots] pol_colour`<br>
`[plots] env_colour`<br>
`[trees] use_gene_colour`<br>
`[trees] maincolour`<br>
`[trees] highlightcolour`<br>
`[trees] outgroupcolour`<br>
`[trees] dpi`<br>
`[trees] format`<br>

Generates an image file for each file generated in the makeSummaryTrees step, using ete3 (http://etetoolkit.org). Newly identified sequences are labelled as "~" and shown in a different colour. Monophyletic groups of newly identified ERVs have been collapsed (by choosing a single representative sequence) and the number of sequences in the group is added to the label and represented by the size of the node tip.

By default, newly identified sequences are shown in the colours specified in `plots_gag_colour`, `plots_pol_colour` and `plots_env_colour` - to do this then `trees_use_gene_colour` should be set to True in the `pipeline.ini`. Alternatively, a fixed colour can be used by setting `trees_use_gene_colour` to False and settings `trees_highlightcolour`. The text colour of the reference sequences (default black) can be set using `trees_maincolour` and the outgroup using `trees_outgroupcolour`.

The output file DPI can be specified using `trees_dpi` and the format (which can be png, svg, pdf or jpg) using `trees_format`.

#### summariseClassify<a name="summariseClassify"></a>

**Input Files**<br>

**Output Files**<br>

**Parameters**<br>

#### Classify<a name="Classify"></a>

**Input Files** None<br>

**Output Files** None<br>

**Parameters** None<br>

Helper function to run all screening functions and classification functions (all functions prior to this point).
