Metadata-Version: 2.1
Name: pygor3
Version: 0.0.8.dev3
Summary: Python package to manipulate and run IGoR data files
Home-page: https://github.com/alfaceor/pygor3
Author: Carlos Olivares
Author-email: carlos.olivares@phys.ens.fr
License: GNU GPLv3
Platform: UNKNOWN
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# Pygor3

Pygor3 is a python3 framework to analyze, vizualize, generate 
and infer V(D)J recombination models using the [IGoR](https://github.com/statbiophys/IGoR) software. 
Pygor3 provides a python interface to execute and encapsulate 
IGoR’s input/outputs by using a sqlite3 database that 
contains input sequences, alignments, model parameters, 
conditional probabilities of the model Bayes network, 
best scenarios and generation probabilities in a single db file.
Pygor3 also has command line utilities to import/export 
IGoR generated files to [AIRR standard format](https://docs.airr-community.org/en/latest/index.html).



## Installation
1. First install IGoR in your sytem [IGoR](https://github.com/statbiophys/IGoR) if you don't have it already.
Pygor will use default IGoR's path to execute it's commands.

2. (Optional) Install [conda](https://docs.conda.io/en/latest/) or 
[anaconda](https://www.anaconda.com/) and create (or use ) a virtual environment.

    ```console
      $ conda create --name statbiophys python=3.7
      $ conda activate statbiophys
    ```
3. Use the package manager [pip](https://pip.pypa.io/en/stable/)
    
    ```console
    (statbiophys) $ pip install pygor3 
    ```
### Github installation
For the most version of pygor (from Github)
    
    ```console
    (statbiophys) $ git clone https://github.com/statbiophys/pygor3.git
    (statbiophys) $ cd pygor3
    (statbiophys) $ pip install -e .
    ```

## Command Line Usage

### Quickstart

#### Get demo sample data
Download a copy of demo sequences in current directory

```console
$ pygor demo-get-data
--------------------------------
Copy data from :  /home/olivares/GitHub/statbiophys/pygor3/pygor3/demo
to:  /home/olivares/testing_pygor/demo
```
This command creates a directory demo with the following structure, with sequences to infer and evaluate
a new model.

```console
demo/
└── data
    └── IgL
        ├── IgL_seqs_memory_Functional.txt
        ├── IgL_seqs_memory_Nofunctional.txt
        ├── IgL_seqs_naive_Functional.txt
        └── IgL_seqs_naive_Nofunctional.txt
```

#### New Model
Now to create a model from scratch, donwload gene templates and anchors from IMGT website [IMGT](http://www.imgt.org/)
A list of available species to download from IMGT can be queried with imgt-get-genomes command and option --info.

    ```console
    $ pygor imgt-get-genomes --info
    --------------------------------
    http://www.imgt.org
    Downloading data from ... 
    List of IMGT available species:
    
    Gallus+gallus
    Cercocebus+atys
    Mustela+putorius+furo
    Macaca+nemestrina
    Vicugna+pacos
    Mus+cookii
    Bos+taurus
    Canis+lupus+familiaris
    Ornithorhynchus+anatinus
    Macaca+mulatta
    Rattus+rattus
    Mus+minutoides
    Danio+rerio
    Oncorhynchus+mykiss
    Tursiops+truncatus
    Felis+catus
    Homo+sapiens
    Salmo+salar
    Macaca+fascicularis
    Mus+musculus
    Mus+saxicola
    Capra+hircus
    Sus+scrofa
    Mus+pahari
    Ovis+aries
    Equus+caballus
    Camelus+dromedarius
    Oryctolagus+cuniculus
    Papio+anubis+anubis
    Mus+spretus
    Rattus+norvegicus
    For more details access:
    http://www.imgt.org/download/GENE-DB/IMGTGENEDB-GeneList
    ```

2. Download genomic templates using VJ or VDJ corresponding to the type of chain.
 
    ```console
    $ pygor imgt-get-genomes --imgt-species Homo+sapiens --imgt-chain IGL -t VJ
    --------------------------------
    http://www.imgt.org
    get_ref_genome
    Homo+sapiens IGLV http://www.imgt.org/genedb/GENElect?query=7.2+IGLV&species=Homo+sapiens
    http://www.imgt.org/genedb/GENElect?query=7.2+IGLV&species=Homo+sapiens
    Homo+sapiens IGLJ http://www.imgt.org/genedb/GENElect?query=7.2+IGLJ&species=Homo+sapiens
    http://www.imgt.org/genedb/GENElect?query=7.2+IGLJ&species=Homo+sapiens
    http://www.imgt.org/genedb/GENElect?query=8.1+IGLV&species=Homo+sapiens&IMGTlabel=2nd-CYS
    No anchor is found for : AC279423|IGLV(I)-11-1*01|Homo sapiens|P|V-REGION|22452..22620|169 nt|1| | | | |169+0=169|partial in 5'| |
    No anchor is found for : D87007|IGLV(I)-20*01|Homo sapiens|P|V-REGION|15573..15858|286 nt|1| | | | |286+0=286| | |
    No anchor is found for : AC279208|IGLV(I)-20*02|Homo sapiens|P|V-REGION|19943..20228|286 nt|1| | | | |286+0=286| | |
    
    ...
    
    Number of features: 0
    Seq('TGCTGTGTTCGGAGGAGGCACCCAGCTGACCGTCCTCG')
    ID: D87017|IGLJ7*02|Homo
    Name: D87017|IGLJ7*02|Homo
    Description: D87017|IGLJ7*02|Homo sapiens|F|J-REGION|18513..18550|38 nt|2| | | | |38+0=38| | |
    Number of features: 0
    Seq('TGCTGTGTTCGGAGGAGGCACCCAGCTGACCGCCCTCG')
    ----------------------
    Genomic VJ templates in files: 
    models/Homo+sapiens/IGL/ref_genome/genomicVs__imgt.fasta models/Homo+sapiens/IGL/ref_genome/genomicJs__imgt.fasta

    ```

    This command creates a directory **models** with the following structure
    
    ```
    models/
    └── Homo+sapiens
        └── TRB
            ├── models
            └── ref_genome
                ├── genomicDs.fasta
                ├── genomicDs__imgt.fasta
                ├── genomicDs__imgt.fasta_short
                ├── genomicJs.fasta
                ├── genomicJs__imgt.fasta
                ├── genomicJs__imgt.fasta_short
                ├── genomicJs__imgt.fasta_trim
                ├── genomicVs.fasta
                ├── genomicVs__imgt.fasta
                ├── genomicVs__imgt.fasta_short
                ├── genomicVs__imgt.fasta_trim
                ├── J_gene_CDR3_anchors.csv
                ├── J_gene_CDR3_anchors__imgt.csv
                ├── J_gene_CDR3_anchors__imgt.csv_short
                ├── V_gene_CDR3_anchors.csv
                ├── V_gene_CDR3_anchors__imgt.csv
                └── V_gene_CDR3_anchors__imgt.csv_short
    
    ```

    ---
    **Important Note**
    It is important to review carefully your downloaded genes templates. Pygor automatically rename from long
    IMGT descriptions to a short ones. For instance
    
    D86996|IGLV(I)-56*01|Homo sapiens|P|V-REGION|12276..12571|296 nt|1| | | | |296+0=296| | |
    
    D86996|IGLV(I)-56*01|Homo sapiens|P|V-REGION|12576..12876|301 nt|1| | | | |301+0=301| | |
    
    Are renamed as :
    
    IGLV(I)-56*01
     
    IGLV(I)-56*01 
    
    For these cases, is important to rename it or remove it manually, before  create a new model.
    For simplicity in this demo we remove the second IGLV(I)-56*01
    
    ---

3. Create a new initial default model, with uniform distribution for the conditional probabilities
of Bayes network ("model_marginals.txt" file). Notice that in IGoR this file is called marginals,
but it is not the marginal probability of a recombination event.
    
    ```console
    $ pygor model-create -M models/Homo+sapiens/IGL/ -t VJ
    --------------------------------
    No D genes were found.
    [Errno 2] No such file or directory: 'models/Homo+sapiens/IGL//ref_genome//genomicDs.fasta'
    No D genes were found.
    [Errno 2] No such file or directory: 'models/Homo+sapiens/IGL//ref_genome//genomicDs.fasta'
    igortask.igor_model_dir_path:  models/Homo+sapiens/IGL/
    Writing model parms in file  models/Homo+sapiens/IGL//models/model_parms.txt
    Writing model marginals in file  models/Homo+sapiens/IGL//models/model_marginals.txt    
    ```
   
   Initial models with uniform parameters model files will be created in files **model_parms.txt** and **model_marginals.txt** at directory path
    ```console
    models/
    └── Homo+sapiens
        └── IGL
            ├── models
            │   ├── model_marginals.txt
            │   └── model_parms.txt
            └── ref_genome
                ├── genomicJs.fasta
                ├── genomicJs__imgt.fasta
                ├── genomicJs__imgt.fasta_short
                ├── genomicJs__imgt.fasta_trim
                ├── genomicVs.fasta
                ├── genomicVs__imgt.fasta
                ├── genomicVs__imgt.fasta_short
                ├── genomicVs__imgt.fasta_trim
                ├── J_gene_CDR3_anchors.csv
                ├── J_gene_CDR3_anchors__imgt.csv
                ├── J_gene_CDR3_anchors__imgt.csv_short
                ├── V_gene_CDR3_anchors.csv
                ├── V_gene_CDR3_anchors__imgt.csv
                └── V_gene_CDR3_anchors__imgt.csv_short

    
    ```
   
   At this point you can use a set of non-productive sequence to infer a model within IGoR directly 
   or by using pygor command (the simpler option).
   
    ```console
    $ pygor igor-infer -M models/Homo+sapiens/IGL/ -i data/IgL/IgL_seqs_naive_Nofunctional.txt -o new_IgL_naive
    --------------------------------
    ===== Running inference =====
    ...
    WARNING: write_model_parms path  [Errno 2] No such file or directory: ''
    Writing model parms in file  new_IgL_naive_parms.txt
    WARNING: IgorModel_Marginals.write_model_marginals path  [Errno 2] No such file or directory: ''
    Writing model marginals in file  new_IgL_naive_marginals.txt
    Database file :  new_IgL_naive

    ```
   
   This will output the following files
   
   ```console
   new_IgL_naive.db
   new_IgL_naive_BN.pdf
   new_IgL_naive_PM.pdf
   new_IgL_naive_marginals.txt
   new_IgL_naive_parms.txt
   ```
   
   where new_hs_trb.db is a database with the encapsulated information about the new model and 
   the date used by IGoR to infer it, new_IgL_naive_BN.pdf is a plot of the Bayesian network(BN) of inferred
   model, new_IgL_naive_PM.pdf are plots of the real marginals of events in BN, and finally the 
   new_IgL_naive_parms.txt and new_IgL_naive_marginals.txt the inferred model in IGoR's format.
   
   
   
#### Model Plots
A model can be plotted from a database file, model directory or by passing the model_parms.txt and 
model_marginals.txt


```    

$ pygor model-plot -M models/Homo+sapiens/IGL/ -o IgL_plot

or 

$ pygor model-plot -D new_IgL_naive.db -o IgL_plot
```
This will output two pdf files with the Marginal Probabilities and Conditional probabilities of events

![](BayesNetwork.png)
![](GeneChoice_MP.png)
![](GeneChoice_CP.png)
 

#### Database files
The .db files can contain all the information in IGoR's standard files in a single sqlite database file,
and can be examinated with any sqlite client, like sqlite3 or sqlitebrowser

```console
$ sqlite3 new_IgL_naive.db 
SQLite version 3.33.0 2020-08-14 13:23:32
Enter ".help" for usage hints.
sqlite> .tables
IgorDAlignments       IgorIndexedSeq        IgorMM_vj_ins       
IgorDGeneTemplate     IgorJAlignments       IgorMP_Edges        
IgorER_j_5_del        IgorJGeneCDR3Anchors  IgorMP_ErrorRate    
IgorER_j_choice       IgorJGeneTemplate     IgorMP_Event_list   
IgorER_v_3_del        IgorMM_j_5_del        IgorVAlignments     
IgorER_v_choice       IgorMM_j_choice       IgorVGeneCDR3Anchors
IgorER_vj_dinucl      IgorMM_v_3_del        IgorVGeneTemplate   
IgorER_vj_ins         IgorMM_v_choice     
IgorIndexedCDR3       IgorMM_vj_dinucl    

```

Pygor has its own methods to maniputate data a database file. 
For instance, db-ls list the contents of the database and the number of records

```console
$ pygor db-ls -D new_IgL_naive.db 
--------------------------------
=== Sequences tables igor-reads: 
IgorIndexedSeq  :  24985
=== Genomes References tables igor-genomes: 
IgorVGeneTemplate  :  151
IgorJGeneTemplate  :  10
IgorDGeneTemplate  :  0
IgorVGeneCDR3Anchors  :  111
IgorJGeneCDR3Anchors  :  10
=== Alignments tables igor-alignments: 
IgorIndexedCDR3  :  24985
IgorVAlignments  :  846743
IgorJAlignments  :  257400
IgorDAlignments  :  0
=== Model tables igor-model: 
IgorMP_Event_list  :  6
IgorMP_Edges  :  3
IgorMP_ErrorRate  :  1
IgorER_v_choice  :  151
IgorER_j_choice  :  10
IgorER_v_3_del  :  21
IgorER_j_5_del  :  21
IgorER_vj_ins  :  41
IgorER_vj_dinucl  :  4
IgorMM_v_choice  :  151
IgorMM_j_choice  :  1510
IgorMM_v_3_del  :  3171
IgorMM_j_5_del  :  210
IgorMM_vj_ins  :  41
IgorMM_vj_dinucl  :  16
=== Output tables igor-pgen and igor-scenarios: 

```

In a similar way the commands db-rm, db-cp, db-import and db-export can be used to manipulate
database files.

```console
$ pygor db-cp -D new_IgL_naive.db -o new_IgL_naive_mdl.db --igor-genomes --igor-model
--------------------------------
**** Tables in source database :  new_IgL_naive.db
=== Sequences tables igor-reads: 
IgorIndexedSeq  :  24985
=== Genomes References tables igor-genomes: 
IgorVGeneTemplate  :  151
IgorJGeneTemplate  :  10
IgorDGeneTemplate  :  0
IgorVGeneCDR3Anchors  :  111
IgorJGeneCDR3Anchors  :  10
=== Alignments tables igor-alignments: 
IgorIndexedCDR3  :  24985
IgorVAlignments  :  846743
IgorJAlignments  :  257400
IgorDAlignments  :  0
=== Model tables igor-model: 
IgorMP_Event_list  :  6
IgorMP_Edges  :  3
IgorMP_ErrorRate  :  1
IgorER_v_choice  :  151
IgorER_j_choice  :  10
IgorER_v_3_del  :  21
IgorER_j_5_del  :  21
IgorER_vj_ins  :  41
IgorER_vj_dinucl  :  4
IgorMM_v_choice  :  151
IgorMM_j_choice  :  1510
IgorMM_v_3_del  :  3171
IgorMM_j_5_del  :  210
IgorMM_vj_ins  :  41
IgorMM_vj_dinucl  :  16
=== Output tables igor-pgen and igor-scenarios: 
**** Tables in destiny database:  new_IgL_naive_mdl.db
=== Sequences tables igor-reads: 
=== Genomes References tables igor-genomes: 
IgorVGeneTemplate  :  151
IgorJGeneTemplate  :  10
IgorDGeneTemplate  :  0
IgorVGeneCDR3Anchors  :  111
IgorJGeneCDR3Anchors  :  10
=== Alignments tables igor-alignments: 
=== Model tables igor-model: 
IgorMP_Event_list  :  6
IgorMP_Edges  :  3
IgorMP_ErrorRate  :  1
IgorER_v_choice  :  151
IgorER_j_choice  :  10
IgorER_v_3_del  :  21
IgorER_j_5_del  :  21
IgorER_vj_ins  :  41
IgorER_vj_dinucl  :  4
IgorMM_v_choice  :  151
IgorMM_j_choice  :  1510
IgorMM_v_3_del  :  3171
IgorMM_j_5_del  :  210
IgorMM_vj_ins  :  41
IgorMM_vj_dinucl  :  16
=== Output tables igor-pgen and igor-scenarios: 

```
 


#### Model evaluation
Once we have an inferred model we can evaluate the probability of a particular sequence to be 
generated (pgen) and get the most probable scenarios for the recombination of input sequences 
or generate synthetic sequences. We can evaluate sequences using the following files: model_parms.txt, model_marginals.txt, genomicXs.fasta, and X_gene_CDR3_anchors.csv or we can just use a database file with all this information like above. For instance the "new_IgL_naive_mdl.db" file, in the example above, contains 
only the model and genomes information, which is necessary for the alignment and evaluation for IGoR.

```console
$ pygor igor-evaluate -D new_IgL_naive_mdl.db -i data/IgL/IgL_seqs_naive_Functional_small.txt  -o IgL_naive_evaluated
```


An tsv airr standard format is created with the rearragement. 

```
sequence_id	sequence	rev_comp	productive	v_call	d_call	j_call	sequence_alignment	germline_alignment	junction	junction_aa	v_cigar	d_cigar	j_cigar	v_score	v_identity	v_support	v_sequence_start	v_sequence_end	v_germline_start	v_germline_end	v_alignment_start	v_alignment_end	d_score	d_identity	d_support	d_sequence_start	d_sequence_end	d_germline_start	d_germline_end	d_alignment_start	d_alignment_end	j_score	j_identity	j_support	j_sequence_start	j_sequence_end	j_germline_start	j_germline_end	j_alignment_start	j_alignment_end	sequence_aa	vj_in_frame	stop_codon	complete_vdj	locus	sequence_alignment_aa	n1_length	np1	np1_aa	np1_length	n2_length	np2	np2_aa	np2_length	p3v_length	p5d_length	p3d_length	p5j_length	scenario_rank	scenario_proba_cond_seq	pgen	quality	quality_alignment
0	CAGTCTCCCAGGTACAAAGTCACAAAGAGGGGACAGGATGTAACTCTCAGGTGTGATCCAATTTCGAGTCATGCAACCCTTTATTGGTATCAACAGGCCCTGGGGCAGGGCCCAGAGTTTCTGACTTACTTCAATTATGAAGCTCAACCAGACAAATCAGGGCTGCCCAGTGATCGGTTCTCTGCAGAGAGGCCTGAGGGATCCATCTCCACTCTGACGATTCAGCGCACAGAGCAGCGGGACTCAGCCATGTATCGCTGTGCTAGCAGCATTCCTCGGGCTGTCAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCG	F		TRBV7-7*01	TRBD2*02	TRBJ2-3*01	GGTGCTGGAGTCTCCCAGTCTCCCAGGTACAAAGTCACAAAGAGGGGACAGGATGTAACTCTCAGGTGTGATCCAATTTCGAGTCATGCAACCCTTTATTGGTATCAACAGGCCCTGGGGCAGGGCCCAGAGTTTCTGACTTACTTCAATTATGAAGCTCAACCAGACAAATCAGGGCTGCCCAGTGATCGGTTCTCTGCAGAGAGGCCTGAGGGATCCATCTCCACTCTGACGATTCAGCGCACAGAGCAGCGGGACTCAGCCATGTATCGCTGTGCCAGCAGCATTCCTCGGGCTGTCAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCG		TGTGCTAGCAGCATTCCTCGGGCTGTCAGATACGCAGTATTTT		285M	4M	45M	1425			2	285	16	283			20			290	292	10	13			225			7	50	6	50		6ATTCCT		6	4	CTGT		4	0	0	0	0	1	0.02729091.34834e-19		
0	CAGTCTCCCAGGTACAAAGTCACAAAGAGGGGACAGGATGTAACTCTCAGGTGTGATCCAATTTCGAGTCATGCAACCCTTTATTGGTATCAACAGGCCCTGGGGCAGGGCCCAGAGTTTCTGACTTACTTCAATTATGAAGCTCAACCAGACAAATCAGGGCTGCCCAGTGATCGGTTCTCTGCAGAGAGGCCTGAGGGATCCATCTCCACTCTGACGATTCAGCGCACAGAGCAGCGGGACTCAGCCATGTATCGCTGTGCTAGCAGCATTCCTCGGGCTGTCAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCG	F		TRBV7-7*01	TRBD2*01	TRBJ2-3*01	GGTGCTGGAGTCTCCCAGTCTCCCAGGTACAAAGTCACAAAGAGGGGACAGGATGTAACTCTCAGGTGTGATCCAATTTCGAGTCATGCAACCCTTTATTGGTATCAACAGGCCCTGGGGCAGGGCCCAGAGTTTCTGACTTACTTCAATTATGAAGCTCAACCAGACAAATCAGGGCTGCCCAGTGATCGGTTCTCTGCAGAGAGGCCTGAGGGATCCATCTCCACTCTGACGATTCAGCGCACAGAGCAGCGGGACTCAGCCATGTATCGCTGTGCCAGCAGCATTCCTCGGGCTGTCAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCG		TGTGCTAGCAGCATTCCTCGGGCTGTCAGATACGCAGTATTTT		285M	4M	45M	1425			2	285	16	283			20			290	292	10	13			225			7	50	6	50		6ATTCCT		6	4	CTGT		4	0	0	0	0	2	0.02729091.34834e-19		
...
```


## Documentation

All the command line interface commands can be used in a python environment, like jupyter notebook, by 
exporting the pygor3 package

```python
import pygor3 as p3
mdl = p3.IgorModel(model_parms_file="model_parms.txt", model_marginals_file="model_marginals.txt")
```

For further details checkout the [documentation](https://pygor3.readthedocs.io/en/latest/) and notebooks directory.















