Annotating alignments and sequences
===================================

.. sectionauthor:: Peter Maxwell, Gavin Huttley

A Sequence with a couple of exons on it.

.. jupyter-execute::

    from cogent3 import DNA
    from cogent3.core.annotation import Feature

    s = DNA.make_seq("AAGAAGAAGACCCCCAAAAAAAAAATTTTTTTTTTAAAAAAAAAAAAA", name="Orig")
    exon1 = s.add_annotation(Feature, "exon", "fred", [(10, 15)])
    exon2 = s.add_annotation(Feature, "exon", "trev", [(30, 40)])

The corresponding sequence can be extracted either with slice notation or by asking the feature to do it, since the feature knows what sequence it belongs to.

.. jupyter-execute::

    s[exon1]
    exon1.get_slice()

Usually the only way to get a ``Feature`` object like ``exon1`` is to ask the sequence for it. There is one method for querying annotations by type and optionally by name:

.. jupyter-execute::

    exons = s.get_annotations_matching("exon")
    print(exons)

If the sequence does not have a matching feature you get back an empty list, and slicing the sequence with that returns a sequence of length 0.

.. jupyter-execute::

    dont_exist = s.get_annotations_matching("dont_exist")
    dont_exist
    s[dont_exist]

To construct a pseudo-feature covering (or excluding) multiple features, use ``get_region_covering_all``:

.. jupyter-execute::

    print(s.get_region_covering_all(exons))
    print(s.get_region_covering_all(exons).get_shadow())

eg: all the exon sequence:

.. jupyter-execute::

    s.get_region_covering_all(exons).get_slice()

or with slice notation:
    
.. jupyter-execute::

    s[exon1, exon2]

Though ``.get_region_covering_all`` also guarantees no overlaps within the result, slicing does not:

.. jupyter-execute::
    :raises: ValueError

    print(s.get_region_covering_all(exons + exons))
    s[exon1, exon1, exon1, exon1, exon1]

You can use features, maps, slices or integers, but non-monotonic slices are not allowed:

.. jupyter-execute::
    :raises: ValueError

    s[15:20, 5:16]

Features are themselves sliceable:

.. jupyter-execute::

    exon1[0:3].get_slice()

When sequences are concatenated they keep their (non-overlapping) annotations:
    
.. jupyter-execute::

    c = s[exon1[4:]] + s
    print(len(c))
    for feat in c.annotations:
        print(feat)

Since features know their parents you can't use a feature from one sequence to slice another:
    
.. jupyter-execute::
    :raises: ValueError

    print(c[exon1])

Features are generally attached to the thing they annotate, but in those cases where a free-floating feature is created it can later be attached:

.. jupyter-execute::

    len(s.annotations)
    region = s.get_region_covering_all(exons)
    len(s.annotations)
    region.attach()
    len(s.annotations)
    region.detach()
    len(s.annotations)

When dealing with sequences that can be reverse complemented (e.g. ``DnaSequence``) features are **not** reversed. Features are considered to have strand specific meaning (.e.g CDS, exons) and so stay on their original strands. We create a sequence with a CDS that spans multiple exons, and show that after getting the reverse complement we have exactly the same result from getting the CDS annotation.

.. jupyter-execute::

    plus = DNA.make_seq("AAGGGGAAAACCCCCAAAAAAAAAATTTTTTTTTTAAA", name="plus")
    plus_cds = plus.add_annotation(Feature, "CDS", "gene", [(2, 6), (10, 15), (25, 35)])
    print(plus_cds.get_slice())
    minus = plus.rc()
    minus_cds = minus.get_annotations_matching("CDS")[0]
    print(minus_cds.get_slice())

Sequence features can be accessed via a containing ``Alignment``:

.. jupyter-execute::

    from cogent3 import make_aligned_seqs

    aln = make_aligned_seqs(
        [["x", "-AAAAAAAAA"], ["y", "TTTT--TTTT"]], array_align=False
    )
    print(aln)
    exon = aln.get_seq("x").add_annotation(Feature, "exon", "fred", [(3, 8)])
    aln_exons = aln.get_annotations_from_seq("x", "exon")
    aln_exons = aln.get_annotations_from_any_seq("exon")

But these will be returned as **alignment** features with locations in alignment coordinates.

.. jupyter-execute::

    print(exon)
    print(aln_exons[0])
    print(aln_exons[0].get_slice())
    aln_exons[0].attach()
    len(aln.annotations)

Similarly alignment features can be projected onto the aligned sequences, where they may end up falling across gaps:

.. jupyter-execute::

    exons = aln.get_projected_annotations("y", "exon")
    print(exons)
    print(aln.get_seq("y")[exons[0].map.without_gaps()])

We copy the annotations from another sequence,

.. jupyter-execute::

    aln = make_aligned_seqs(
        [["x", "-AAAAAAAAA"], ["y", "TTTT--CCCC"]], array_align=False
    )
    s = DNA.make_seq("AAAAAAAAA", name="x")
    exon = s.add_annotation(Feature, "exon", "fred", [(3, 8)])
    exon = aln.get_seq("x").copy_annotations(s)
    aln_exons = list(aln.get_annotations_from_seq("x", "exon"))
    print(aln_exons)

even if the name is different.

.. jupyter-execute::

    exon = aln.get_seq("y").copy_annotations(s)
    aln_exons = list(aln.get_annotations_from_seq("y", "exon"))
    print(aln_exons)
    print(aln[aln_exons])

If the feature lies outside the sequence being copied to, you get a lost span

.. jupyter-execute::

    aln = make_aligned_seqs([["x", "-AAAA"], ["y", "TTTTT"]], array_align=False)
    seq = DNA.make_seq("CCCCCCCCCCCCCCCCCCCC", "x")
    exon = seq.add_feature("exon", "A", [(5, 8)])
    aln.get_seq("x").copy_annotations(seq)
    copied = list(aln.get_annotations_from_seq("x", "exon"))
    copied
    copied[0].get_slice()

You can copy to a sequence with a different name, in a different alignment if the feature lies within the length

.. jupyter-execute::

    aln = make_aligned_seqs(
        [["x", "-AAAAAAAAA"], ["y", "TTTT--TTTT"]], array_align=False
    )
    seq = DNA.make_seq("CCCCCCCCCCCCCCCCCCCC", "x")
    match_exon = seq.add_feature("exon", "A", [(5, 8)])
    aln.get_seq("y").copy_annotations(seq)
    copied = list(aln.get_annotations_from_seq("y", "exon"))
    copied

If the sequence is shorter, again you get a lost span.

.. jupyter-execute::

    aln = make_aligned_seqs(
        [["x", "-AAAAAAAAA"], ["y", "TTTT--TTTT"]], array_align=False
    )
    diff_len_seq = DNA.make_seq("CCCCCCCCCCCCCCCCCCCCCCCCCCCC", "x")
    nonmatch = diff_len_seq.add_feature("repeat", "A", [(12, 14)])
    aln.get_seq("y").copy_annotations(diff_len_seq)
    copied = list(aln.get_annotations_from_seq("y", "repeat"))
    copied

We consider cases where there are terminal gaps.

.. jupyter-execute::

    aln = make_aligned_seqs(
        [["x", "-AAAAAAAAA"], ["y", "------TTTT"]], array_align=False
    )
    exon = aln.get_seq("x").add_feature("exon", "fred", [(3, 8)])
    aln_exons = list(aln.get_annotations_from_seq("x", "exon"))
    print(aln_exons)
    print(aln_exons[0].get_slice())
    aln = make_aligned_seqs(
        [["x", "-AAAAAAAAA"], ["y", "TTTT--T---"]], array_align=False
    )
    exon = aln.get_seq("x").add_feature("exon", "fred", [(3, 8)])
    aln_exons = list(aln.get_annotations_from_seq("x", "exon"))
    print(aln_exons[0].get_slice())

In this case, only those residues included within the feature are covered - note the omission of the T in ``y`` opposite the gap in ``x``.

.. jupyter-execute::

    aln = make_aligned_seqs(
        [["x", "C-CCCAAAAA"], ["y", "-T----TTTT"]], moltype="dna", array_align=False
    )
    print(aln)
    exon = aln.get_seq("x").add_feature("exon", "ex1", [(0, 4)])
    print(exon)
    print(exon.get_slice())
    aln_exons = list(aln.get_annotations_from_seq("x", "exon"))
    print(aln_exons)
    print(aln_exons[0].get_slice())

``Feature.as_one_span()``, is applied to the exon that straddles the gap in ``x``. The result is we preserve that feature.

.. jupyter-execute::

    print(aln_exons[0].as_one_span().get_slice())

These properties also are consistently replicated with reverse complemented sequences.

.. jupyter-execute::

    aln_rc = aln.rc()
    rc_exons = list(aln_rc.get_annotations_from_any_seq("exon"))
    print(aln_rc[rc_exons])  # not using as_one_span, so gap removed from x
    print(aln_rc[rc_exons[0].as_one_span()])

Features can provide their coordinates, useful for custom analyses.
    
.. jupyter-execute::

    all_exons = aln.get_region_covering_all(aln_exons)
    coords = all_exons.get_coordinates()
    assert coords == [(0, 1), (2, 5)]

Annotated regions can be masked (observed sequence characters replaced by another), either through the sequence on which they reside or by projection from the alignment. Note that ``mask_char`` must be a valid character for the sequence ``MolType``. Either the features (multiple can be named), or their shadow, can be masked.

We create an alignment with a sequence that has two different annotation types.

.. jupyter-execute::

    aln = make_aligned_seqs(
        [["x", "C-CCCAAAAAGGGAA"], ["y", "-T----TTTTG-GTT"]], array_align=False
    )
    print(aln)
    exon = aln.get_seq("x").add_feature("exon", "norwegian", [(0, 4)])
    print(exon.get_slice())
    repeat = aln.get_seq("x").add_feature("repeat", "blue", [(9, 12)])
    print(repeat.get_slice())
    repeat = aln.get_seq("y").add_feature("repeat", "frog", [(5, 7)])
    print(repeat.get_slice())

Each sequence should correctly mask either the single feature, it's shadow, or the multiple features, or shadow.

.. jupyter-execute::

    print(aln.get_seq("x").with_masked_annotations("exon", mask_char="?"))
    print(aln.get_seq("x").with_masked_annotations("exon", mask_char="?", shadow=True))
    print(aln.get_seq("x").with_masked_annotations(["exon", "repeat"], mask_char="?"))
    print(
        aln.get_seq("x").with_masked_annotations(
            ["exon", "repeat"], mask_char="?", shadow=True
        )
    )
    print(aln.get_seq("y").with_masked_annotations("exon", mask_char="?"))
    print(aln.get_seq("y").with_masked_annotations("repeat", mask_char="?"))
    print(
        aln.get_seq("y").with_masked_annotations("repeat", mask_char="?", shadow=True)
    )

The same methods can be applied to annotated Alignment's.

.. jupyter-execute::

    print(aln.with_masked_annotations("exon", mask_char="?"))
    print(aln.with_masked_annotations("exon", mask_char="?", shadow=True))
    print(aln.with_masked_annotations("repeat", mask_char="?"))
    print(aln.with_masked_annotations("repeat", mask_char="?", shadow=True))
    print(aln.with_masked_annotations(["repeat", "exon"], mask_char="?"))
    print(aln.with_masked_annotations(["repeat", "exon"], shadow=True))

It shouldn't matter whether annotated coordinates are entered separately, or as a series.

.. jupyter-execute::

    data = [["human", "CGAAACGTTT"], ["mouse", "CTAAACGTCG"]]
    as_series = make_aligned_seqs(data, array_align=False)
    as_items = make_aligned_seqs(data, array_align=False)

We add annotations to the sequences as a series.

.. jupyter-execute::

    as_series.get_seq("human").add_feature("cpgsite", "cpg", [(0, 2), (5, 7)])
    as_series.get_seq("mouse").add_feature("cpgsite", "cpg", [(5, 7), (8, 10)])

We add the annotations to the sequences one segment at a time.

.. jupyter-execute::

    as_items.get_seq("human").add_feature("cpgsite", "cpg", [(0, 2)])
    as_items.get_seq("human").add_feature("cpgsite", "cpg", [(5, 7)])
    as_items.get_seq("mouse").add_feature("cpgsite", "cpg", [(5, 7)])
    as_items.get_seq("mouse").add_feature("cpgsite", "cpg", [(8, 10)])

These different constructions should generate the same output.

.. jupyter-execute::

    serial = as_series.with_masked_annotations(["cpgsite"])
    print(serial)
    itemwise = as_items.with_masked_annotations(["cpgsite"])
    print(itemwise)

Annotations should be correctly masked, whether the sequence has been reverse complemented or not. We use the plus/minus strand CDS containing sequences created above.

.. jupyter-execute::

    print(plus.with_masked_annotations("CDS"))
    print(minus.with_masked_annotations("CDS"))