KEGG-Decoder
================================================================
### Description ###
Designed to parse through a KEGG-Koala outputs (including blastKOALA, ghostKOALA, KOFAMSCAN) to determine the completeness of various metabolic pathways.

* This module was constructed using manually curated "canonical" pathways described as part of KEGG Pathway Maps. For information regarding which KOs are used to predict a metabolic pathway see the KOALA_definitions.txt

* if you are interested in certain pathway and the genes are listed in KEGG it is possible to add it to file (with some Python scripting)

### KEGG-Decoder Demonstration and Hands-on tutorial ###

[YouTube video](https://youtu.be/1v4UzjE7K2g?t=962) on how KEGG-Decoder intefaces with KEGG and how the heatmap if organized.

Hands-on tutorial [![Binder](https://mybinder.org/badge_logo.svg)](https://gesis.mybinder.org/binder/v2/gh/biovcnet/bvcn-binder-kegg-koala/master?urlpath=lab)

**Developed as part of the [BVCN](https://biovcnet.github.io/)**

### Please Cite ###
If you find that using KEGG Decoder to process your data has been useful, please cite this manuscript. If you are using KEGG Decoder to make figures then definitely cite this manuscript!

* [Graham ED, Heidelberg JF, Tully BJ. (2018) Potential for primary productivity in a globally-distributed bacterial phototroph. ISME J 350, 1–6](https://www.nature.com/articles/s41396-018-0091-3)


### Dependencies ###

* [Pandas](http://pandas.pydata.org/pandas-docs/stable/install.html)

* [Seaborn](http://seaborn.pydata.org/installing.html)

* [matplotlib](http://matplotlib.org/users/installing.html)

* [tanglegram](https://github.com/schlegelp/tanglegram)

## Installation ##
<strong>Recommend installing KEGG-Decoder in it virtual environment (conda or python). The current pip install will set the various dependencies (matplotlib, seaborn, pandas, etc.) to versions that actively work with this version of the script. This will likely revert several dependencies on your system to older versions. </strong>

This is partially due to avoid a bug in matplotlib=3.0.4 that would cut the top and bottom line of the `static` image output.

```
python3 -m pip install KEGGDecoder
```

## Upgrade ##
```
pip install --upgrade KEGGDecoder
```

## Procedure ##
* Start with protein FASTA file (INPUT_PROTEIN.fasta). This file can be multiple genomes combined. Be sure your submitted FASTA file has headers that group genomes together, KEGG-decoder.py groups based on the name provided in FASTA header before the first underscore (_) 
```
For example
>NORP9_1
>NORP9_2
>NORP9_3
>NORP10_1
>NORP10_2
>NORP10_3
In the output this produces two rows of output, one for genome NORP9 and one for genome NORP10 in the list and heat map
```
* Process protein sequences through KEGG-KOALA ([GhostKoala](https://www.kegg.jp/ghostkoala/), [BlastKoala](https://www.kegg.jp/blastkoala/), or [KOFAMSCAN](https://www.genome.jp/tools/kofamkoala/)) and download the tab-delimited KO assignment text file (KOALA_OUTPUT.txt)
* The KOALA output text file should look like this:
```
NORP9_1	K00370
NORP9_2	K00371
```

* Run KEGG-decoder
```
KEGG-decoder --input (-i) <KOALA_OUTPUT.txt> --output (-o) <FUNCTION_OUT.list> --vizoption (-v) <static/interactive/tanglegram>
```

* The FUNCTION_OUT.list generates a TSV version of the heat map. The first row contains pathway/process names, subsequent rows contain submitted groups/genomes and fractional percentage of pathway/process

* 'static' figure output is an SVG file function_heatmap.svg. Each distinct identifier before the underscore in the FASTA file will have a row

* 'interactive' figure output is an HTML file function_heatmap.html. Each distinct identifier before the underscore in the FASTA file will have a row, but can be loaded into a browser and value will be displayed by hovering over a cell with the mouse. Draw a box to zoom in on specific regions. Designed to allow easier parsing of larger sets of genomes.

* 'tanglegram' -- For a little more advanced analysis, KEGGDecoder can generate a tanglegram to compare the order of two trees, one generated by the clustered KEGG metabolic outputs and a Newick format (presumably phylogenetic) tree provided by the user. At least 3 input genomes are required, but more is recommended. Genome names must match. 

KEGG-Expander
================================================================
### UNDER CONSTRUCTION ###
While KEGG-decoder is now a module, KEGG-expander and Decoder_and_Expand will still require running the Python scripts. Using the FUNCTION_OUT.list file will allow you to still make the intended final figure.

### Description ###
Designed to expand on the output from KEGG-Decoder. Within KEGG there is a lack of information regarding several processes of interest. To overcome these shortcomings, a small targeted HMM database was created (and will be updated) to fill in gaps of information.

HMM models are predominantly from the PFam database, but when necessary are pulled from TIGRfam and SFam.

### Dependencies ###
* [HMMER3](http://www.hmmer.org/)

### Additional Information ###
* Details as to which HMM models and genes are in each described pathway or process can be found in the supporting document, Pfam_definitions.txt
* In version 0.7, KEGG-Expander targets several transporter subunits to link with metal transporter columns in KEGG-Decoder. Removed the peptidase entries due to ineffective interpretation.
* In version 0.6, KEGG-Expander targets: phototrophy via proteorhodopsin, (some) peptidases, alternative nitrogenases, ammonia transport, DMSP lyase, and DMSP synthase, and ferrioxamine biosynthesis
* Unfortunately, accuracy depends on the model used, using a bit score cutoff of 75 (approximately an E-value <10E-20) does not always capture the best matches. For example the rhodopsin model does not distinguish between proteorhodopsin and other light driven rhodopsins (we use a tree to determine the proteorhodopsins). Or several of the DMSP lyases at low bit scores will match metalloproteases; in this instance the script has been modified to look for a more stringent bit score (>500). Or the TIGRfam models for the Fe-only and Vanadium nitrogenases generally match the same protein. 

## Prodecure ##
* Using a protein FASTA file with the same gene name set-up as described above - GENOMEID_Number - run a search against the custom HMM database
```
hmmsearch --tblout <NAME>_expanderv0.7.tbl -T 75 /path/to/BioData/KEGGDecoder/HMM_Models/expander_dbv0.7.hmm <INPUT_PROTEIN.fasta>
```
* The HMM results table is used to construct the heatmap by running KEGG-expander.py
```
python KEGG-expander.py <NAME>_expanderv0.7.tbl <HMM_OUT.list>
```
* The OUTPUT LIST generates a text version of the heat map. The first row contains pathway/process names, subsequent rows contain submitted groups/genomes and fractional percentage of pathway/process

* Figure is output as hmm_heatmap.svg. Each distinct identifier before the underscore in the FASTA file will have a row

Decoder and Expand
================================================================
### Description ###
Combines the KEGG and HMM heatmaps in to a final heat map. 

### Procedure ###
* Run the script Decoder_and_Expand.py
```
python Decode_and_Expand.py <FUNCTION_OUT.list> <HMM_OUT.list>
```
* Figure is output as decode-expand_heatmap.py. Each distinct identifier before the underscore in the FASTA file will have a row

Change Log
================================================================
## V1.2
Added several new pathways including:

* PET degradation
* carbon storage, related to starch/gylcogen & polyhydroxybutyrate
* posphate storage, related to the reversible polyphosphate reaction. 

Part of summer research with Sheyla Aviles.

## V1.1
Correcting typos identified by Chris Neely. Adding more complete
pathways components for amino acid biosynthesis identified by
Dr. Eric Webb

* phenylalanine added K01713 pheC; cyclohexadienyl dehydratase OR K05359 ADT; arogenate/prephenate dehydratase OR K04518 pheA2; prephenate dehydratase
* tyrosine added K00220 tyrC; cyclohexadieny/prephenate dehydrogenase OR K24018; cyclohexadieny/prephenate dehydrogenase OR K15226 tyrAa; arogenate dehydrogenase

## V1.0.10 ##
Added the 20 amino acids. In most instances, only the last step in converting precusor to amino acid is assessed (except for valine, isoleucine, leucine, and tryptophan). The following amino acids share detection pathways:

* serine & glycine
* threonine & glycine
* valine & isoleucine
* phenylalanine & tyrosine
* aspartate & glutamate

## V1.0.6-1.0.8 ##
* Updates made as part of the Speeding Up Science Part 2 hackathon. Updates were made by Chris Neely, Jason Fell, and Marisa Lim.
* Changes include reduction of white space in the `static` output, removal of a minimum requirement for the `interactive` output, and increased functioning of `tanglegram` output. Specifically, `tanglegram` now uses complete-linkage Euclidean distance to determine the clusters on the KEGG-Decoder tree. This provides the best resolution for visualizing possible groups with similar functional capacity.
* In V1.0.8.2, a correction to determining the completeness of ubiquinol-cytochrome c reductase. Previously, only checked for the presence of K00411 and K00410. K00410 is a fusion of K00412 and K00413 only present in a subset of Proteobacteria. Identified by Grayson Chadwick.
* In V1.0.8.1, a mismatch in the terms used to identify `bifunctional chitinase/lysozyme` would result in a `0` not matter if K13381 was present. This has been corrected. Identified by Chris Neely.

## V1.0.5 ##
Various upgrades to the tanglegram visualization and enchanced naming efficiency.

## V1.0.2 ##
Fixed an issue with tanglegram support that should fix issue with pandas dependency
V.1.0.2 Adds Na+-transporting NADH:ubiquinone oxidoreductase and several metal transporters. KEGG-Decoder added metal transporters for cobalt (CbiMQ), cobalt (CbtA), cobalt (CorA), nickel ABC-type transporter substrate-binding subunit (NirA), copper (copA), ferrous iron (FeoB),
ferric iron ABC-type transporter substrate-binding subunit (AfuA), Fe/Mn transporter (MntH). Additional metal transporter components were added
through KEGG-expander: Cobalt transporter (CbtB), Copper binding HMA (heavy-metal-associated) protein, Fe, Zn, Mn permease (ZupT)
Removed 'peptidases' from KEGG-expander due to inability to discern intracellular from extracellular activity. Recommend using MetaSanity to
identify extracellular peptidases.
Updated KEGG-expander HMM set to V0.7.

## V1.0 ##
KEGGDecoder can now be installed via pip install. KEGGDecoder now offers 2 visualization outputs - the classic 'static' version and
the new 'interactive' version which will open a heatmap where you zoom and interact with the heatmap output 
Contributions to V1.0 occured as part of the Moore Foundation funded 'Speeding Up Science' hackathon. With contributions provided by: Taylor Reiter (UCDavis), Roth Conrad (GeorgiaTech), Jay Osvatic (UniVienna), Luiz Irber (UCDavis)

## V0.8 ##
Add elements regarding arsenic reduction

## V0.7 ##
Clarifies elements of methane oxidation and adds additional methanol/alcohol dehydrogenase
to KEGG function search. Adds the serine pathway for formaldehyde assimilation

## V0.6 ##
V.0.6 Adds Bacterial Secretion Systems as descrived by KEGG covering Type I, II, III, IV, Vabc, VI, Sec-SRP and Twin Arginine Targeting systems

## V0.5 ##
Adds parameters to force labels to be printed on heatmap. Includes functions
for sulfolipid biosynthesis (key gene sqdB) and C-P lyase

## V0.4 ##
Adds sections that more accurately represents anoxygenic photosynthesis - type-II and type-I reaction centers, adds NiFe hydrogenase Hyd-1 hyaABC, corrected typo leading to missed assignment to hydrogen:quinone oxidoreductase

## V0.3 ##
Latest version adds checks for: retinal biosynthesis, sulfite dehydrogenase (quinone), hydrazine dehydrogenase, hydrazine synthase, DMSP/DMS/DMSO cycling, cobalamin biosynthesis, competence-related DNA transport, anaplerotic reactions
