# The geNomad pipeline

When you execute the `genomad end-to-end` command, geNomad runs a series of modules sequentially to produce the final output, which contains the identified plasmids and viruses in the input FASTA file.

```{image} _static/figures/pipeline_overview.svg
:width: 550
:class: no-scaled-link
:align: center
```

It is possible to execute the modules sequentially, which allows you to use some advanced parameters that are not available when you use the `genomad end-to-end` command:

```bash
genomad download-database .
genomad annotate metagenome.fna genomad_output genomad_db
genomad find-proviruses metagenome.fna genomad_output genomad_db
genomad marker-classification metagenome.fna genomad_output genomad_db
genomad nn-classification metagenome.fna genomad_output
genomad aggregated-classification metagenome.fna genomad_output
genomad score-calibration metagenome.fna genomad_output
genomad summary metagenome.fna genomad_output
```

For the majority of cases, using the `genomad end-to-end` parameter should be sufficient. However, it's important to understand the processes involved when executing the full pipeline. Here, we will provide an explanation of each module's function. Understanding these functions will help you grasp how geNomad processes your input sequences to identify plasmids and viruses.

(annotate-module)=
## `annotate`

```{image} _static/figures/annotate.svg
:width: 800
:class: no-scaled-link
:align: center
```

The `annotate` module has two main functions: predicting genes in the input sequences using [`prodigal-gv`](https://github.com/apcamargo/prodigal-gv/) and assigning these predicted genes to marker protein families from a dataset of 227,897 profiles specific to chromosomes, plasmids, or viruses using [`MMseqs2`](https://github.com/soedinglab/MMseqs2/). This marker dataset provides comprehensive metadata that can aid in the downstream interpretation of the results. It includes:

- Functional annotations via [Pfam](https://www.ebi.ac.uk/interpro/entry/pfam/#table), [COG](https://www.ncbi.nlm.nih.gov/research/cog), [TIGRFAM](https://www.ncbi.nlm.nih.gov/genome/annotation_prok/tigrfams/), and [KEGG Orthology](https://www.genome.jp/kegg/ko.html) accessions.
- Hallmark genes, which are involved in key plasmid or virus functions.
- Conjugation genes, through [CONJscan](https://link.springer.com/protocol/10.1007/978-1-4939-9877-7_19)) accessions.
- Antimicrobial resistance genes, via [AMRFinder](https://www.ncbi.nlm.nih.gov/pathogens/hmm/)) accessions.
- Universal single-copy genes (USCGs) that are typically present in chromosomes and rare in plasmids and viruses, identified using [BUSCO](https://busco.ezlab.org/).
- Virus taxonomy, through the use of [ICTV's VMR number 19](https://talk.ictvonline.org/taxonomy/vmr/m/vmr-file-repository/13426) lineages.

The `annotate` module generates two primary outputs: taxonomic assignments of the input sequences (you can find an explanation of how geNomad assigns sequences to viral taxa [here](taxonomic_assignment.md)), and gene-level annotations (as shown in the [Quickstart](understanding-the-outputs) example). These outputs are utilized by the `find-proviruses`, `marker-classification`, and `summary` modules.

## `find-proviruses`

```{image} _static/figures/find_proviruses.svg
:width: 800
:class: no-scaled-link
:align: center
```

The `find-proviruses` module is designed to identify proviral regions within host sequences. To achieve this, it uses a conditional random field (CRF) model that takes gene annotations generated by the `annotate` module and demarcates regions that are enriched in viral-specific markers, surrounded by host-specific markers. To refine the boundaries of proviruses, geNomad leverages the fact that phages often integrate next to tRNAs and that integrases are typically found at the edges of integrated phages. This is achieved by extending the edges until neighboring tRNAs (identified with [`ARAGORN`](http://www.ansikte.se/ARAGORN/)) and/or integrases (identified with `MMseqs2`) are reached. For a detailed explanation of geNomad's provirus identification algorithm, please refer to our [provirus identification documentation](provirus_identification.md).

## `marker-classification`

<br>

```{image} _static/figures/marker_classification.svg
:width: 640
:class: no-scaled-link
:align: center
```
<br>

The `marker-classification` module in geNomad is designed to classify sequences as either chromosomes, plasmids, or viruses based on their marker content. To achieve this, the module takes gene annotations and calculates a set of numerical features that describe the gene structure and marker content of the sequences that need to be classified. These features include gene density, as well as the frequency of chromosome, plasmid, and virus markers.

Below is an example of the features that are computed for five input sequences. You can learn more about how each feature is calculated by visiting our [marker features documentation](marker_features.md).

| seq_name   | strand_switch_rate | coding_density | no_rbs_freq | sd_bacteroidetes_rbs_freq | sd_canonical_rbs_freq | tatata_rbs_freq | cc_marker_freq | cp_marker_freq | cv_marker_freq | pc_marker_freq | pp_marker_freq | pv_marker_freq | vc_marker_freq | vp_marker_freq | vv_marker_freq | c_marker_freq | p_marker_freq | v_marker_freq | median_c_spm | median_p_spm | median_v_spm | v_vs_c_score_logistic | v_vs_p_score_logistic | p_vs_c_score_logistic | gv_marker_freq |
|------------|-------------------:|---------------:|------------:|--------------------------:|----------------------:|----------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|--------------:|--------------:|--------------:|-------------:|-------------:|-------------:|----------------------:|----------------------:|----------------------:|---------------:|
| sequence_1 | 0.0000             | 0.9049         | 1.0000      | 0.0000                    | 0.0000                | 0.0000          | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000        | 0.0000        | 0.0000        | 0.0000       | 0.0000       | 0.0000       | 0.5000                | 0.5000                | 0.5000                | 0.0000         |
| sequence_2 | 0.0000             | 0.7845         | 0.5000      | 0.0000                    | 0.0000                | 0.0000          | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.5000         | 0.0000         | 0.0000        | 0.0000        | 0.5000        | 0.0278       | 0.4961       | 0.8678       | 0.6630                | 0.5914                | 0.5762                | 0.0000         |
| sequence_3 | 1.0000             | 0.8704         | 1.0000      | 0.0000                    | 0.0000                | 0.0000          | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.5000         | 0.0000         | 0.0000        | 0.0000        | 0.5000        | 0.0086       | 0.2801       | 0.9599       | 0.6903                | 0.6557                | 0.5392                | 0.0000         |
| sequence_4 | 0.0000             | 0.8087         | 0.0000      | 0.0000                    | 1.0000                | 0.0000          | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 1.0000         | 0.0000         | 0.0000        | 0.0000        | 1.0000        | 0.0027       | 0.2087       | 0.9780       | 0.8398                | 0.8064                | 0.5571                | 0.0000         |
| sequence_5 | 0.0000             | 0.9861         | 0.5000      | 0.0000                    | 0.5000                | 0.0000          | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 0.0000         | 1.0000         | 0.0000        | 0.0000        | 1.0000        | 0.0043       | 0.0000       | 1.0000       | 0.8474                | 0.8479                | 0.4989                | 0.0000         |

`marker-classification` then feeds these features to a tree ensemble classification algorithm, trained with [`XGBoost`](https://xgboost.readthedocs.io/), which produces three scores for each sequence. These scores represent the model's confidence that the sequence represents a chromosome, plasmid, or virus.

| seq_name   | chromosome_score | plasmid_score | virus_score |
|------------|-----------------:|--------------:|------------:|
| sequence_1 | 0.5420           | 0.1397        | 0.3183      |
| sequence_2 | 0.2172           | 0.2148        | 0.5680      |
| sequence_3 | 0.2937           | 0.1957        | 0.5106      |
| sequence_4 | 0.0524           | 0.0718        | 0.8758      |
| sequence_5 | 0.1621           | 0.0168        | 0.8211      |

In the example shown above, the model classified the first sequence as chromosome and the remaining sequences as viruses. With regards to the model's confidence in its classification, it is more certain that `sequence_4` and `sequence_5` are viruses (with virus scores above 0.8) than it is of `sequence_2` and `sequence_3` (with virus scores around 0.5).

(nn-classification-module)=
## `nn-classification`

<br>

```{image} _static/figures/nn_classification.svg
:width: 640
:class: no-scaled-link
:align: center
```
<br>

The `nn-classification` module also classifies input sequences into chromosomes, plasmids, or viruses, similar to the `marker-classification` module. However, unlike the latter, it doesn't rely on marker information. Instead, it directly processes nucleotide sequences using a neural network. The nucleotide sequences are first encoded into a numerical matrix, which is then fed into an [IGLOO](https://arxiv.org/abs/1807.03402) neural network. The network is capable of detecting sequence features that distinguish chromosomes, plasmids, and viruses. Finally, the module produces confidence scores for the classifications.

| seq_name   | chromosome_score | plasmid_score | virus_score |
|------------|-----------------:|--------------:|------------:|
| sequence_1 | 0.3307           | 0.5597        | 0.1096      |
| sequence_2 | 0.0669           | 0.1411        | 0.7920      |
| sequence_3 | 0.6720           | 0.1340        | 0.1940      |
| sequence_4 | 0.2923           | 0.2830        | 0.4247      |
| sequence_5 | 0.0591           | 0.1545        | 0.7864      |

If you're interested in learning more about how the neural network processes and classifies nucleotide sequences, check out the [detailed explanation](nn_classification.md).

## `aggregated-classification`

<br>

```{image} _static/figures/aggregated_classification.svg
:width: 600
:class: no-scaled-link
:align: center
```
<br>

The `aggreggated-classification` module combines the outputs of `marker-classification` and `nn-classification` to produce a set of scores that takes advantage of the strengths of both classifiers.

| seq_name   | chromosome_score | plasmid_score | virus_score |
|------------|-----------------:|--------------:|------------:|
| sequence_1 | 0.2169           | 0.5661        | 0.2170      |
| sequence_2 | 0.0513           | 0.1541        | 0.7946      |
| sequence_3 | 0.4592           | 0.2033        | 0.3375      |
| sequence_4 | 0.0402           | 0.0446        | 0.9153      |
| sequence_5 | 0.0233           | 0.0276        | 0.9491      |

To achieve this, it employs an attention mechanism that weights the contributions of each classifier in such a way that the contribution of `marker-classification` increases proportionally to the proportion of genes assigned to markers. For more details on this process, please refer to the [score aggregation documentation](score_aggregation.md).

## `score-calibration`

<br>

```{image} _static/figures/score_calibration.svg
:width: 670
:class: no-scaled-link
:align: center
```
<br>

The scores generated by `marker-classification`, `nn-classification`, and `aggregated-classification` indicate the confidence of these models in their predictions, with higher values reflecting greater confidence. However, these values are not equivalent to actual probabilities. For example, a sequence with an uncalibrated virus score of 0.87 does not have an 87% chance of being a virus.

`score-calibration` is an optional module that transforms the raw scores produced by the previous modules into estimated probabilities. This ensures that a sequence with a calibrated virus score of 0.87 will have a probability close to 87% probability of being a virus. If you want to understand how the `score-calibration` module works, refer to [its documentation](score_calibration.md). To enable score calibration when using the `end-to-end` command, use the `--enable-score-calibration` parameter.

## `summary`

```{image} _static/figures/summary.svg
:width: 335
:class: no-scaled-link
:align: center
```
<br>

The `summary` module serves three main functions: (1) filtering sequences based on various criteria to present users with the most reliable predictions (read more about the filtering process [here](post_classification_filtering.md)), (2) summarizing the data generated by all previous modules for identified plasmids and viruses, and (3) writing FASTA files containing nucleotide and protein sequences for the identified plasmids and viruses, accompanied by gene annotation files. For examples of the plasmid and virus summary tables, refer to the [Quickstart](understanding-the-outputs) guide.
