# SYNBICT
Synthetic Biology Curation Tools

![SYNBICT architecture diagram](synbict_architecture_diagram.png)

## Installation instructions

This project depends on Python 3.

To install, run the command below after changing directories to SYNBICT. Note that you may need to clone pySBOL2 from GitHub (https://github.com/SynBioDex/pySBOL2) and install it manually since SYNBICT requires pySBOL2 version 1.3 and at time of writing there is no public release available for this version yet.

`python setup.py install`

If you want to visualize circuits, you need to install matplotlib and the fork of dnaplotlib at https://github.com/nroehner/dnaplotlib.

`pip install matplotlib`

## Examples

### Command line annotation of FASTA file

`python sequences_to_features.py -n http://mynamespace.org -t ../test/genetic_nand.fa -s annotated -f ../test/cello_library.xml -l ../test/genetic_nand_log.txt -p -np`

The expected output for this example is an SBOL file named "genetic_nand_annotated.xml" that contains a component with displayId "Strain_4_MG1655_Genomic_NAND_Circuit_comp" that represents the annotated genome for an engineered strain of E. coli. Also included are the sub-components for all the features that SYNBICT annotated in this genome.

The log file specified with the -l argument should report 26 annotations made with SYNBICT (search for "DEBUG ; Annotated"). There is one expected warning due to a feature with the displayId "YFP_reporter" that is missing its sequence in the feature library file named "cello_library.xml".

Note the use of the optional arguments -p (for in-place annotation) and -np (for no pruning step). The -p argument is included because there is no previously existing SBOL component to copy prior to annotation: the SBOL component that is annotated is generated by SYNBICT through conversion of the target FASTA file to SBOL. Similarly, the -np argument is included because there are no previously existing SBOL annotations to reconcile with the annotations added by SYNBICT.

### Command line annotation of SBOL file

`python sequences_to_features.py -n http://mynamespace.org -t ../test/genetic_nand.xml -s annotated -f ../test/cello_library.xml -l ../test/genetic_nand_log.txt -d -ni`

The expected output for this example is an SBOL file named "genetic_nand_annotated.xml" that contains two components with the displayId "Strain_4_MG1655_Genomic_NAND_Circuit_comp", one with an identity in the SD2 SynBioHub namespace and one in this example's namespace. The latter component is derived from the former and represents the annotated genome for an engineered strain of E. coli following curation with SYNBICT. Also included are the sub-components for all the features that SYNBICT annotated in this genome.

The log file specified with the -l argument should report 26 annotations made with SYNBICT (search for "DEBUG ; Annotated") and 4,162 annotations removed with SYNBICT (search for "DEBUG ; Removed"). There is one expected warning due to a feature with the displayId "YFP_reporter" that is missing its sequence in the feature library file named "cello_library.xml".

Note the use of the optional arguments -d (for deleting flat annotations) and -ni (for non-interactivity). The -d argument is included to remove existing annotations with no sub-components that were copied by SYNBICT from the genome in this example's target SBOL file. This is done primarily to simplify SYNBICT's output for the purpose of this example. The -ni argument is included to indicate that no additional user input should be solicited during the pruning step. This can be useful when the target SBOL file is very large or when SYNBICT is being run by an automated process that may not be capable of providing additional input. If these arguments are not included, then the user may be prompted for input when SYNBICT analyzes the annotations in its output components and attempts to identify redundant or incorrect annotations.

### Python annotation of sequence string

```
import sbol2
import logging
from sequences_to_features import FeatureLibrary
from sequences_to_features import FeatureAnnotater

# Set pySBOL configuration parameters

sbol2.setHomespace('http://mynamespace.org')
sbol2.Config.setOption('validate', True)
sbol2.Config.setOption('sbol_typed_uris', False)

# Set up log file - can be commented out

logger = logging.getLogger('synbict')
logger.setLevel(logging.DEBUG)
logger.propagate = False

formatter = logging.Formatter('%(asctime)s ; %(levelname)s ; %(message)s')

file_handler = logging.FileHandler('SrpR_annotation_log.txt', "w")
file_handler.setLevel(logging.DEBUG)
file_handler.setFormatter(formatter)

logger.addHandler(file_handler)

# Load Cello genetic circuit feature library

feature_doc = sbol2.Document()
feature_doc.read('../test/cello_library.xml')

feature_library = FeatureLibrary([feature_doc])

# Annotate raw target sequence

min_feature_length = 40

annotater = FeatureAnnotater(feature_library, min_feature_length)    

target_seq = (
    'CTGAAGCGCTCAACGGGTGTGCTTCCCGTTCTGATGAGTCCGTGAGGACGAAAGCGCCTCTA'
    'CAAATAATTTTGTTTAAGAGTCTATGGACTATGTTTTCACAAAGGAAGTACCAGGATGGCAC'
    'GTAAAACCGCAGCAGAAGCAGAAGAAACCCGTCAGCGTATTATTGATGCAGCACTGGAAGTT'
    'TTTGTTGCACAGGGTGTTAGTGATGCAACCCTGGATCAGATTGCACGTAAAGCCGGTGTTAC'
    'CCGTGGTGCAGTTTATTGGCATTTTAATGGTAAACTGGAAGTTCTGCAGGCAGTTCTGGCAA'
    'GCCGTCAGCATCCGCTGGAACTGGATTTTACACCGGATCTGGGTATTGAACGTAGCTGGGAA'
    'GCAGTTGTTGTTGCAATGCTGGATGCAGTTCATAGTCCGCAGAGCAAACAGTTTAGCGAAAT'
    'TCTGATTTATCAGGGTCTGGATGAAAGCGGTCTGATTCATAATCGTATGGTTCAGGCAAGCG'
    'ATCGTTTTCTGCAGTATATTCATCAGGTTCTGCGTCATGCAGTTACCCAGGGTGAACTGCCG'
    'ATTAATCTGGATCTGCAGACCAGCATTGGTGTTTTTAAAGGTCTGATTACCGGTCTGCTGTA'
    'TGAAGGTCTGCGTAGCAAAGATCAGCAGGCACAGATTATCAAAGTTGCACTGGGTAGCTTTT'
    'GGGCACTGCTGCGTGAACCGCCTCGTTTTCTGCTGTGTGAAGAAGCACAGATTAAACAGGTG'
    'AAATCCTTCGAATAATTCAGCCAAAAAACTTAAGACCGCCGGTCTTGTCCACTACCTTGCAG'
    'TAATGCGGTGGACAGGATCGGCGGTTTTCTTTTCTCTTCTCAATCTATGATTGGTCCAGATT'
    'CGTTACCAATTGACAGCTAGCTCAGTCCTAGGTATATACATACATGCTTGTTTGTTTGTAAAC'
)

min_target_length = 0

annotated_comp = annotater.annotate_raw_sequences(target_seq, 'SrpR_RBS_S3_gate', min_target_length)

# Write annotated sequence

annotated_doc = sbol2.Document()
annotated_doc.addComponentDefinition(annotated_comp)

annotated_doc.write('SrpR_RBS_S3_gate.xml')
```

This example assumes that your current working directory is the SYNBICT sub-directory named "example". In this case, the target DNA sequence to be annotated is hard-coded, but it could be loaded from a CSV file or text file. The min_feature_length parameter is set to 40 to prevent annotation with DNA features such as assembly scars that are short enough to occur by chance in a target sequence. The min_target_length parameter is set to 0 since there is a single target sequence and we know that we want to annotate it. You might increase this parameter when you are annotating a large number of target sequences that potentially include smaller sequences you do not wish to annotate.

## sequences\_to\_features.py

sequences_to_features.py annotates sequences in target SBOL, GenBank, or FASTA files and can be used to prune existing annotations on these sequences as well.

### Common arguments for sequences\_to\_features.py

Argument | Short Arg | Type | Description | Example
---- | --- | --- | --- | ---
`--namespace` | `-n` | `String` | **Required**. Namespace that you own or that you are reasonably certain is only used by you. | http://mynamespace.org
`--target_files` | `-t` | `String` | **Optional**. List of paths to input files or directories containing components to curate. Accepted file formats include SBOL XML, FASTA, and GenBank. If any paths to directories are provided, then all of their files of accepted formats are appended to the list. Default is an empty list. | targets1.xml targets2.fa target_directory
`--output_files` | `-o` | `String` | **Optional**. List of paths to output files. If its length is less than that of target_files, then the difference is populated with copies of the corresponding target file paths. If an output suffix is provided, then the copied target file paths are postfixed with this suffix. If no output suffix is provided, then the target files located by these paths will be overwritten. Alternatively, if output_files contains a single path to a directory, then the output list is formed by postfixing the target file names to this directory (with an output suffix if provided). | targets1_curated.xml targets2_curated.xml
`--output_suffix` | `-s` | `String` | **Optional**. Suffix for postfixing target file paths and names used to populate output_files. | curated
`--in_place` | `-p` | `Boolean` | **Optional**. If included, do not copy components prior to curation. Default is to curate copies of components. | -p
`--min_target_length` | `-m` | `Integer` | **Optional**. Minimum length that component must be to curate (annotate and/or prune). Default is 2000 bp. | 2000
`--minimal_output` | `-mo` | `Boolean` | **Optional**. If included, only output annotated components and none of their sub-components or sequences. | -mo
`--non_interactive` | `-ni` | `Boolean` | **Optional**. If included, do not ask user for additional input. Default is to ask user. | -ni
`--log_file` | `-l` | `String` | **Optional**. Log file to populate with more verbose curation history. Default is to not generate a log file. | curation.log
`--validate` | `-v` | `Boolean` | **Optional**. If included, output files will be checked against SBOL validation rules. Default is to not validate output files. | -v
`--sbh_URL` | `-U` | `String` | **Optional**. If included, SYNBICT will attempt to log into the specified SynBioHub instance  | https://synbiohub.org
`--username` | `-u` | `String` | **Optional**. If included, SYNBICT will use as the username to log into the specified SynBioHub instance. | igemer217
`--password` | `-w` | `String` | **Optional**. If included, SYNBICT will use as the password to log into the specified SynBioHub instance. | w3j4d!d5adg6
`--target_URLs` | `-T` | `String` | **Optional**. List of URLs for SBOL objects. If included, SYNBICT will download these objects from the specified SynBioHub instance and curate them. Can be used in combination with the target_files argument. | https://synbiohub.org/public/igem/BBa_K731020/1 https://synbiohub.org/public/iGEMDistributions/iGEM2019Distribution_collection/1


### Annotation arguments for sequences\_to\_features.py

Argument | Short Arg | Type | Description | Example
---- | --- | --- | --- | ---
`--feature_files` | `-f` | `String` | **Optional**. List of paths to input files or directories containing features to create library for annotating components. Default is an empty list. Accepted file format is SBOL XML. | features1.xml feature_directory
`--feature_URLs` | `-F` | `String` | **Optional**. List of URLs for SBOL objects. If included, SYNBICT will download these objects from the specified SynBioHub instance and use them to create a library for annotating components. Can be used in combination with the feature_files argument. | https://synbiohub.programmingbiology.org/public/Cello_Parts/Cello_Parts_collection/1
`--min_feature_length` | `-M` | `Integer` | **Optional**. Minimum length that feature must be to include in library for annotating components. Default is 40 bp. | 40
`--no_annotation` | `-na` | `Boolean` | **Optional**. If included, do not annotate components. Default is to annotate components. | -na
`--extend_features` | `-e` | `Boolean` | **Optional**. If included, attempt to extend feature library. Derives new features from previously existing annotations on components in target file if their names align to the names of library features and if the fraction mismatch between their aligned sequences is less than the extension threshold. Default is to not extend feature library. | -e
`--extension_suffix` | `-xs` | `String` | **Optional**. Suffix for postfixing extended feature files. If not provided, then the original feature files will be overwritten. | extended
`--extension_threshold` | `-x` | `Float` | **Optional**. Maximum fraction mismatch between sequences permitted to extend feature library with new features based on previously existing annotations on components in target file. | 0.05

### Annotation pruning arguments for sequences\_to\_features.py

Argument | Short Arg | Type | Description | Example
---- | --- | --- | --- | ---
`--cover_offset` | `-c` | `Integer` | **Optional**. Maximum distance between the start of one annotation and the start of another annotation (or the end of one annotation and the end of another annotation) to initiate pruning of overlapping annotations. Default is 14 bp. | 14
`--deletion_roles` | `-r` | `String` | **Optional**. List of URIs for Sequence Ontology roles. All annotations for sub-components with these roles will be removed from components in target file. Default is an empty list. | http://identifiers.org/so/SO:0000167 http://identifiers.org/so/SO:0000316
`--delete_flat` | `-d` | `Boolean` | **Optional**. If included, automatically delete annotations that do not refer to a sub-component. Default is no automatic deletion of these annotations. | -d
`--no_pruning` | `-np` | `Boolean` | **Optional**. If included, do not prune component annotations. Default is to prune newly made and previously existing annotations. | -np
`--auto_swap` | `-a` | `Boolean` | **Optional**. If included, automatically merge any overlapping pair of a flat annotation and a sub-component annotation that do not overlap with any other annotations. Default is to ask user if merger should take place. | -a

## features\_to\_circuits.py

features_to_circuits.py infers genetic circuits from annotated components in target SBOL files (one per file) by comparing their annotated DNA sequence features to DNA features with documented interactions in one or more sub-circuit library files (also SBOL).

### Common arguments for features\_to\_circuits.py

Argument | Short Arg | Type | Description | Example
--- | --- | --- | --- | ---
`--namespace` | `-n` | `String` | **Required**. Namespace that you own or that you are reasonably certain is only used by you. | http://mynamespace.org
`--sub_circuit_files` | `-c` | `String` | **Required**. List of paths to input files or directories containing sub-circuits to create library for inferring composite circuits. | subcircuits1.xml subcircuits2.xml
`--no_build` | `-nb` | `Boolean` | **Optional**. If included, do not infer genetic circuits from annotated components in target files. Default is to infer genetic circuits. | -nb
`--log_file` | `-l` | `String` | **Optional**. Log file to populate with more verbose curation history. Default is to not generate a log file. | curation.log
`--validate` | `-v` | `Boolean` | **Optional**. If included, output files will be checked against SBOL validation rules. Default is to not validate output files. | -v

### Circuit inference arguments for features\_to\_circuits.py

Argument | Short Arg | Type | Description | Example
--- | --- | --- | --- | ---
`--target_files` | `-t` | `String` | **Optional**. List of paths to input files or directories containing annotated components from which to infer genetic circuits. Accepted file format is SBOL XML. If any path to a directory is provided, then all of their XML files are appended to the list. | targets1.xml target_directory
`--circuit_IDs` | `-i` | `String` | **Optional**. List of IDs given to the inferred genetic circuits (one per target file). By default uses the names of the corresponding target files (postfixed with circuit_suffix if provided). | targets1_circuit targets2_circuit
`--circuit_suffix` | `-s` | `String` | **Optional**. Suffix for postfixing IDs of inferred genetic circuits. | circuit
`--circuit_version` | `-cv` | `String` | **Optional**. Version given to inferred genetic circuits. Default is 1. | 1
`--output_files` | `-o` | `String` | **Optional**. List of paths to output files. If its length is less than that of target_files, then the difference is populated with copies of the corresponding target file paths. If an output suffix is provided, then the copied target file paths are postfixed with this suffix. If no output suffix is provided, then the target files located by these paths will be overwritten. Alternatively, if output_files contains a single path to a directory, then the output list is formed by postfixing the target file names to this directory (with an output suffix if provided). | mytargets_1_annotated_circuit.xml mytargets_2_annotated_circuit.xml
`--output_suffix` | `-os` | `String` | **Optional**. Suffix for postfixing target file paths and names used to populate output_files. | curated
`--input_identities` | `-ii` | `String` | **Optional**. List of URIs identifying known input species. These species are labeled as inputs in any inferred circuits. Default is an empty list. | 
`--output_identities` | `-oi` | `String` | **Optional**. List of URIs identifying known output species. These species are labeled as outputs in any inferred circuits. Default is an empty list. | 
`--min_target_length` | `-m` | `Integer` | **Optional**. Minimum length that an annotated component must be to consider its features when inferring a genetic circuit. Default is 2000 bp. | 2000
`--no_sensors` | `-ns` | `Boolean` | **Optional**. If included, do not add library sub-circuits for non-covalent interactions between small molecules and proteins to the inferred composite circuit. Default is to add these sub-circuits and attempt to abstract them by deriving stimulation and inhibition interactions from them in the composite circuit. | -ns
`--tx_threshold` | `-d` | `Integer` | **Optional**. Maximum distance between an annotated promoter feature and an annotated CDS feature that is permitted to infer an interaction between them (an interaction not present in the sub-circuit library). Default is 200 bp. | 200

### Sub-circuit library extension arguments for features\_to\_circuits.py

Argument | Short Arg | Type | Description | Example
--- | --- | --- | --- | ---
`--extend_sub_circuits` | `-e` | `Boolean` | **Optional**. If included, attempt to extend the sub-circuit library. Derives new sub-circuits from DNA features in the sub-circuit files only if their names align to the names of other DNA features in the sub-circuit library and if the fraction mismatch between their aligned sequences is less than the extension threshold. Default is to not extend the sub-circuit library. | -e
`--extension_suffix` | `-xs` | `String` | **Optional**. Suffix for postfixing extended sub-circuit files. If not provided, then the original sub-circuit files will be overwritten. | circuit
`--extension_threshold` | `-x` | `Float` | **Optional**. Maximum fraction mismatch between sequences permitted to extend the sub-circuit library. New sub-circuits are derived from DNA features in the sub-circuit files that are not part of an existing sub-circuit but are similar to a DNA feature that is part of such a sub-circuit. | 0.05

-i circuit name (required)
-c parts collection (required)

-n http://foo.bar -i bob 
-t ~/tmp/cpc/Cello_Parts_collection/Strain_3_MG1655_Genomic_IcaR_Gate_annotated.xml 
-c ~/tmp/cpc/Cello_Parts_collection/Cello_Parts_collection.xml 

## circuit_visualization.py

### Arguments
 -c file to visualize, which must have at least one ModuleDefinition
   -h, --help            show this help message and exit
  -c CIRCUIT_FILE, --circuit_file CIRCUIT_FILE
  -f [FEATURE_FILES [FEATURE_FILES ...]], --feature_files [FEATURE_FILES [FEATURE_FILES ...]]
  -l [CURATION_LOG], --curation_log [CURATION_LOG]
  -m [MIN_FEATURES], --min_features [MIN_FEATURES]
  -v, --validate
