Metadata-Version: 2.1
Name: text-alignment-tool
Version: 0.2.15
Summary: A tool for performing complex text alignment processes.
Home-page: https://gitlab.com/sofer_mahir/text_alignment_tool
License: MIT
Keywords: alignment,needleman,wunsch,pipeline
Author: Bronson Brown-deVost
Author-email: bronsonbdevost@gmail.com
Requires-Python: >=3.7.1,<3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: bidict (>=0.21.4,<0.22.0)
Requires-Dist: cursive-re (>=0.0.4,<0.0.5)
Requires-Dist: dotmap (>=1.3.26,<2.0.0)
Requires-Dist: edlib (>=1.3.9,<2.0.0)
Requires-Dist: fuzzywuzzy (>=0.18.0,<0.19.0)
Requires-Dist: lxml (>=4.7.1,<5.0.0)
Requires-Dist: minineedle (>=3.0.0,<4.0.0)
Requires-Dist: numpy (>=1.21.4,<2.0.0)
Requires-Dist: pandas (>=1.3.5,<2.0.0)
Requires-Dist: python-Levenshtein (>=0.12.2,<0.13.0)
Requires-Dist: swalign (>=0.3.6,<0.4.0)
Requires-Dist: terminaltables (>=3.1.10,<4.0.0)
Project-URL: Repository, https://gitlab.com/sofer_mahir/text_alignment_tool
Project-URL: issues, https://gitlab.com/sofer_mahir/text_alignment_tool/issues
Description-Content-Type: text/markdown

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

# The Text Alignment Tool

This Python text alignment tool is intended to be a general purpose tool for aligning texts in a robust and easily extensible way. It tracks any changes to the original text so that there is an end-to-end mapping of the alignment data.

## Architecture

Inline-style: 
![Diagram of Alignment Tool Pipeline Structure](./aligner_pipeline.svg "Alignment Tool Pipeline Structure")
1.  The alignment tool consists of a main class `TextAlignmentTool`, which coordinates the alignment process. 
2.  The alignment tool receives a single `TextLoader` for the query text and a single `TextLoader` for the target text (you must keep track of the mapping from the original input text(s) in the `TextLoader` and its output to the rest of the pipeline).
3.  The alignment tool is then fed *n* `TextTransformer`s and for each text and *n* `AlignmentAlgorithm`s. These can be used in any combination and order, for example: the query text could pass through 3 `TextTransformer`s and the target text could pass through 1 `TextTransformer`, then they go through a single `AlignmentAlgorithm`, the target text then passes through 2 `TextTransformer`s and we could perform a final `AlignmentAlgorithm` on the pair of texts.
4. `find_alignment_to_query` and `find_alignment_to_target` will backtrack through the text mappings and provide a key for mapping either the query to the target or the target to the query.

A somewhat basic alignment process could look something like this:

```python
# Create text loaders for query and target
query_loader = PgpAltoXMLTextLoader(list(QUERY_TEMP_FOLDER.glob("**/*.xml")))
target_loader = PgpXmlTeiTextLoader(list(TARGET_TEMP_FOLDER.glob("**/*.xml")))

# Create the alignment tool
aligner = TextAlignmentTool(query_loader, target_loader)

# Perform three transformation operations on the target
normalize_target_sigla = PgpTeiNormalizeSiglaTransformer()
remove_target_extras = PgpTeiRemoveExtrasTransformer()
relocate_insertions = PgpTeiRelocateInsertionsTransformer()

aligner.target_text_transforms(
    [normalize_target_sigla, remove_target_extras, relocate_insertions]
)

# Create and run one alignment process
first_alignment_algorithm = LineAlignmentAlgorithm()
aligner.align_text(first_alignment_algorithm)

# Get the mapping information for the alignment
alignment_mappings = aligner.latest_alignment
```

## Functionality

Tracking of text changes and mappings to aligned text use a system of index maps. The `TextLoader` will ingest the input text and output a 1-dimensional numpy uint32 array consisting of one number for each letter in the input text in the order it occurs within the text (the number is simply the unicode value of the character using python's [`ord`](https://docs.python.org/3/library/functions.html#ord) function).

### Text Loader

For example, let's imagine we have our initial text in a simple text file, and we will assume the line breaks are significant for the alignment process:

```
Lorem ipsum dolor sit amet 
consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
```
We would write a simple loader for this to ingest the text and preserve a record of the line breaks:
```python
from text_alignment_tool import TextChunk
import numpy as np

text = """Lorem ipsum dolor sit amet 
consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua"""


def parse_text(text: str) -> Tuple[List[Tuple[int, int]], List[TextChunk], np.array]:
    input_output_map: List[Tuple[int, int]] = []
    text_chunk_indices: List[TextChunk] = []
    output_text: List[int] = []

    text_chunk_start_idx = 0
    for input_idx, char in enumerate(text):
        output_idx = len(output_text)
        input_output_map.append((input_idx, output_idx))
        if char == "\n":
            text_chunk_indices.append(TextChunk(text_chunk_start_idx, output_idx))
            text_chunk_start_idx = output_idx + 1
            continue
        output_text.append(ord(char))

    return input_output_map, text_chunk_indices, np.array(output_text, dtype=np.uint32) 

input_output_map, text_chunk_indices, output_text = parse_text(text)

# Inspect the results
print(input_output_map[25:30])
print(text_chunk_indices)
print(output_text[0:5]) 

# Deserialize text
print(''.join([chr(x) for x in output_text[0:5]]))

```
Output:
```
[(25, 25), (26, 26), (27, 27), (28, 27), (29, 28)]
[TextChunk(start_idx=0, end_idx=27), TextChunk(start_idx=28, end_idx=55)]
[ 76 111 114 101 109 ]
Lorem
```

When creating a custom text loader, you should subclass `TextLoader` and make sure to calculate `self._output`, `self._input_output_map`, and `self._text_chunk_indices`.  You can modify the \_\_init\_\_() method to take whatever variables you need, and you can modify the class however it is needed to perform the parsing operation.  It is a nice addition to include a method in the custom `TextLoader` to rebuild text in the input format based upon the data from the alignment operation.

### Text Transformer

The output of the `TextLoader` may be exactly what is needed for the alignment process, but often it will be necessary to perform other alterations such as stripping out unneccesary characters, performing some rule based character conversions, or refining the text_chunks.  Any number of `TextTransformer`s can be used in series to accomplish this.  Using narrowly focused `TextTransformer`s will make it easier to debug and to mix and match `TextTransformer` as needed to achieve the desired alignment.

When passing a text through a `TextTransformer`, the transformer must use its `_input_output_map` to track how it has changed the input. For instance, if we wanted to create a transformer to remove the word "the", we might start with a text input like "the quick brown dog jumped over the lazy fox.", which in the alignment tool is:
```[116, 104, 101, 32, 113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]```

The output from the `TextTransformer` would be:
```[113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]```

And the `_input_output_map` would show the mappings from the index of the input array to the index of the output array: 
```[(4,0),(5,1),(6,2),(7,3), ...]```

| input val | map input idx to output idx | output val |
| :-------: | :-------------------------: | :--------: |
|    113    |            (4,0)            |    113     |
|    117    |            (5,1)            |    117     |
|    105    |            (6,2)            |    105     |
|    99     |            (7,3)            |     99     |
|    107    |            (8,4)            |    107     |
|    32     |            (9,5)            |     32     |
|    ...    |             ...             |    ...     |

Changing the order of individual elements in the list is also possible, for instance for the same input above we could instead have the output: 
```[98, 114, 111, 119, 110, 32, 113, 117, 105, 99, 107, 32, 116, 104, 101, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]```

The words "the" and "brown" have been transposed, and the resulting `_input_output_map` would be:
```[(10,0),(11,1),(12,2),(13,3),(14,4),(3,5),(4,6),(5,7),(6,8),(7,9),(8,10),(9,11),(0,12),(1,13),(2,14),(15,15), ...]```

The `TextTransformer` may also redefine text chunks with the `_text_chunk_indices` property, which is a simple ordered list of starting + ending indices that define *n* sections of the output text (you may use overlapping sections if desired), e.g., `[(0,20),(21,35),(30,91)]` with three chunks of the text using indices: 0–20, 21–35, and 30–91.

### Alignment Algorithm

The `AlignmentAlgorithm` class can be subclassed to perform analysis of both the query and target text at the same time. Any number of such classes may be used at any place within the alignment pipeline. The `AlignmentAlgorithm` will always receive a `self._query` and a `self._target` property, both of which are provided automatically to it by the `TextAlignmentTool` from the output of the latest transformation of the query and target texts. It will also automatically receive the latest `_text_chunk_indices` for the query and for the target as `self._input_query_text_chunk_indices` and `self._input_target_text_chunk_indices`.

An `AlignmentAlgorithm` will produce a mapping in the `_alignment` property, a simplified example of which might be: query = `['h','e','l','l','o',' ','w','o','r','l','d']` and target = `['h','e','l','l','o',' ','w','a','d','d']` (of course these would be lists of uint32's in our system, not strings) could be aligned as `[(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(10,9)]` (for Wadd, see https://en.wikipedia.org/wiki/Wadd).

An `AlignmentAlgorithm` can also be used to redefine text chunks based on mutual analysis of the query and target texts.  That is, the `AlignmentAlgorithm` may be used both for gross alignments—defining possibly corresponding text chunks with the properties `_output_query_text_chunk_indices` and `_output_target_text_chunk_indices` in addition to the fine grained alignment using the `_alignment` property, which is simply a list of the corresponding character indices in the query and source text.

### Alignment Operation Tracking

The `TextAlignmentTool` automatically keeps track of the order of operations and the transforms that have been performed in the `__operation_list` property which contains a list of `AlignmentOperation`s. This simplifies peeking in on any part of the alignment process for debugging purposes and also enables custom mappings between query and target.  

The convenience methods `find_alignment_to_query` and `find_alignment_to_target` enable you to walk the alignments and transforms back to the first initial input provided by the `TextLoader`. You will need to provide your own function within the `TextLoader` to transform the aligned text into your desired format.

### Debugging Help

When you use the `TextAlignmentTool` in a debugging context, it will inject an instance of the `DebugHelper` class into the global context as `dbg`. This helper provides four convenience methods to inspect your aligment pipeline: `dbg.display_text`, `dbg.display_text_chunk`, `dbg.display_text_chunks`, and `dbg.display_text_region`. These methods will output the human readable text for the internal uint32 numpy array numeric representation of the text and can extract specified ranges and text chunks as well.
