Genetic distance calculation¶
Fast pairwise distance estimation¶
For a limited number of evolutionary models a fast implementation is available.
from cogent3 import available_distances
available_distances()
| Abbreviation | Suitable for moltype |
|---|---|
| paralinear | dna, rna, protein |
| logdet | dna, rna, protein |
| jc69 | dna, rna |
| tn93 | dna, rna |
| hamming | dna, rna, protein, text, bytes |
| percent | dna, rna, protein, text, bytes |
6 rows x 2 columns
Computing genetic distances using the Alignment object¶
Abbreviations listed from available_distances() can be used as values for the distance_matrix(calc=<abbreviation>).
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/primate_brca1.fasta", moltype="dna")
dists = aln.distance_matrix(calc="tn93", show_progress=False)
dists
| names | Chimpanzee | Galago | Gorilla | HowlerMon | Human | Orangutan | Rhesus |
|---|---|---|---|---|---|---|---|
| Chimpanzee | 0.0000 | 0.1921 | 0.0054 | 0.0704 | 0.0089 | 0.0140 | 0.0396 |
| Galago | 0.1921 | 0.0000 | 0.1923 | 0.2157 | 0.1965 | 0.1944 | 0.1962 |
| Gorilla | 0.0054 | 0.1923 | 0.0000 | 0.0700 | 0.0086 | 0.0137 | 0.0393 |
| HowlerMon | 0.0704 | 0.2157 | 0.0700 | 0.0000 | 0.0736 | 0.0719 | 0.0736 |
| Human | 0.0089 | 0.1965 | 0.0086 | 0.0736 | 0.0000 | 0.0173 | 0.0423 |
| Orangutan | 0.0140 | 0.1944 | 0.0137 | 0.0719 | 0.0173 | 0.0000 | 0.0411 |
| Rhesus | 0.0396 | 0.1962 | 0.0393 | 0.0736 | 0.0423 | 0.0411 | 0.0000 |
Using the distance calculator directly¶
from cogent3 import load_aligned_seqs, get_distance_calculator
aln = load_aligned_seqs("data/primate_brca1.fasta")
dist_calc = get_distance_calculator("tn93", alignment=aln)
dist_calc
<cogent3.evolve.fast_distance.TN93Pair at 0x7fd6009c8640>
dist_calc.run(show_progress=False)
dists = dist_calc.get_pairwise_distances()
dists
| names | Chimpanzee | Galago | Gorilla | HowlerMon | Human | Orangutan | Rhesus |
|---|---|---|---|---|---|---|---|
| Chimpanzee | 0.0000 | 0.1921 | 0.0054 | 0.0704 | 0.0089 | 0.0140 | 0.0396 |
| Galago | 0.1921 | 0.0000 | 0.1923 | 0.2157 | 0.1965 | 0.1944 | 0.1962 |
| Gorilla | 0.0054 | 0.1923 | 0.0000 | 0.0700 | 0.0086 | 0.0137 | 0.0393 |
| HowlerMon | 0.0704 | 0.2157 | 0.0700 | 0.0000 | 0.0736 | 0.0719 | 0.0736 |
| Human | 0.0089 | 0.1965 | 0.0086 | 0.0736 | 0.0000 | 0.0173 | 0.0423 |
| Orangutan | 0.0140 | 0.1944 | 0.0137 | 0.0719 | 0.0173 | 0.0000 | 0.0411 |
| Rhesus | 0.0396 | 0.1962 | 0.0393 | 0.0736 | 0.0423 | 0.0411 | 0.0000 |
The distance calculation object can provide more information. For instance, the standard errors.
dist_calc.stderr
| Seq1 \ Seq2 | Galago | HowlerMon | Rhesus | Orangutan | Gorilla | Human | Chimpanzee |
|---|---|---|---|---|---|---|---|
| Galago | 0 | 0.010274827058395896 | 0.009616307832648562 | 0.009535646532276787 | 0.009491382249540176 | 0.009615033091864917 | 0.009469268026590141 |
| HowlerMon | 0.010274827058395896 | 0 | 0.005411811712554772 | 0.0053348584951611175 | 0.005265612474694246 | 0.005406760238748984 | 0.005273572620183854 |
| Rhesus | 0.009616307832648562 | 0.005411811712554772 | 0 | 0.0039408549417865755 | 0.003852798161903095 | 0.004005045920100125 | 0.0038665597157698894 |
| Orangutan | 0.009535646532276787 | 0.0053348584951611175 | 0.0039408549417865755 | 0 | 0.0022291124743011375 | 0.0025151838791803655 | 0.0022606571679022955 |
| Gorilla | 0.009491382249540176 | 0.005265612474694246 | 0.003852798161903095 | 0.0022291124743011375 | 0 | 0.0017596919902326876 | 0.0013848543487237903 |
| Human | 0.009615033091864917 | 0.005406760238748984 | 0.004005045920100125 | 0.0025151838791803655 | 0.0017596919902326876 | 0 | 0.0017949285088691988 |
| Chimpanzee | 0.009469268026590141 | 0.005273572620183854 | 0.0038665597157698894 | 0.0022606571679022955 | 0.0013848543487237903 | 0.0017949285088691988 | 0 |
7 rows x 8 columns
Likelihood based pairwise distance estimation¶
The standard cogent3 likelihood function can also be used to estimate distances. Because these require numerical optimisation they can be significantly slower than the fast estimation approach above.
The following will use the F81 nucleotide substitution model and perform numerical optimisation.
from cogent3 import load_aligned_seqs, get_model
from cogent3.evolve import distance
aln = load_aligned_seqs("data/primate_brca1.fasta", moltype="dna")
d = distance.EstimateDistances(aln, submodel=get_model("F81"))
d.run(show_progress=False)
dists = d.get_pairwise_distances()
dists
| names | Chimpanzee | Galago | Gorilla | HowlerMon | Human | Orangutan | Rhesus |
|---|---|---|---|---|---|---|---|
| Chimpanzee | 0.0000 | 0.1892 | 0.0054 | 0.0697 | 0.0089 | 0.0140 | 0.0395 |
| Galago | 0.1892 | 0.0000 | 0.1891 | 0.2112 | 0.1934 | 0.1915 | 0.1930 |
| Gorilla | 0.0054 | 0.1891 | 0.0000 | 0.0693 | 0.0086 | 0.0136 | 0.0391 |
| HowlerMon | 0.0697 | 0.2112 | 0.0693 | 0.0000 | 0.0729 | 0.0713 | 0.0729 |
| Human | 0.0089 | 0.1934 | 0.0086 | 0.0729 | 0.0000 | 0.0173 | 0.0421 |
| Orangutan | 0.0140 | 0.1915 | 0.0136 | 0.0713 | 0.0173 | 0.0000 | 0.0410 |
| Rhesus | 0.0395 | 0.1930 | 0.0391 | 0.0729 | 0.0421 | 0.0410 | 0.0000 |