Metadata-Version: 2.1
Name: Quid
Version: 2.1.0
Summary: Quid is a tool for quotation detection in texts and can deal with common properties of quotations, for example, ellipses or inaccurate quotations.
Home-page: https://hu.berlin/quid
Author: Frederik Arnold
Author-email: frederik.arnold@hu-berlin.de
Project-URL: Source, https://hu.berlin/quid
Keywords: quotation detection,quotation identification,literal citation extraction,key passages,natural language processing,nlp,text reuse
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# Readme

Quid is a tool for quotation detection in texts and can deal with common properties of quotations, for example, ellipses or inaccurate quotations.

## Overview
Quid is a tool to find quotations in two texts, called source and target. If known, the source text should be the one that is quoted by the target text. This allows the algorithm to handle things like ellipsis in quotations, e.g.
~~~
0	52	This is a long Text and the long test goes on and on
0	45	This is a long Text [...] test goes on and on
~~~

## Installation
~~~
pip install Quid
~~~

## Usage
There are two ways to use the algorithm. The following two sections describe the use of the algorithm in code and from the command line.

### In code
The algorithm can be found in the package `quid`. To use it create a `Quid` object which expects the following arguments:
- The length of the shortest match (default: 5)
- The number of tokens to skip when looking backwards (default: 10)
- The number of tokens to skip when looking ahead (default: 3)
- The maximum distance in tokens between to matches considered for merging (default: 2)
- The maximum distance in tokens between two matches considered for merging where the target text contains an ellipsis between the matches (default: 10)

Then call the `compare` method on the object which expects two texts to be compared.
The method returns a list with the following structure: `List[Match]`. `Match` stores two `MatchSpans`. One for the source text and one for the target text. `MatchSpan` stores the `start` and `end` character positions for the matching spans in the source and target text.

~~~
from quid.core.Quid import Quid

quid = Quid()
matches = quid.compare('file 1 content', 'file 2 content')
~~~

### Command line
The `quid compare` command provides a command line interface to the algorithm.

~~~
usage: QuidCLI.py compare [-h] [--text] [--no-text]
                           [--output-type {json,text, csv}]
                           [--csv-sep CSV_SEP]
                           [--output-folder-path OUTPUT_FOLDER_PATH]
                           [--min-match-length MIN_MATCH_LENGTH]
                           [--look-back-limit LOOK_BACK_LIMIT]
                           [--look-ahead-limit LOOK_AHEAD_LIMIT]
                           [--max-merge-distance MAX_MERGE_DISTANCE]
                           [--max-merge-ellipsis-distance MAX_MERGE_ELLIPSIS_DISTANCE]
                           [--create-dated-subfolder]
                           [--no-create-dated-subfolder]
                           [--max-num-processes MAX_NUM_PROCESSES]
                           [--keep-ambiguous-matches]
                           [--no-keep-ambiguous-matches]
                           source-file-path target-path

Quid compare allows the user to find quotations in two texts, a source text
and a target text. If known, the source text should be the one that is quoted
by the target text. This allows the algorithm to handle things like ellipsis
in quotations.

positional arguments:
  source-file-path      Path to the source text file
  target-path           Path to the target text file or folder

optional arguments:
  -h, --help            show this help message and exit
  --text                Include matched text in the returned data structure
  --no-text             Don't include matched text in the returned data
                        structure
  --output-type {json,text, csv}
                        The output type
  --csv-sep CSV_SEP     output separator for csv (default: '\t')
  --output-folder-path OUTPUT_FOLDER_PATH
                        The output folder path. If this option is set the
                        output will be saved to a file created in the
                        specified folder
  --min-match-length MIN_MATCH_LENGTH
                        The length of the shortest match (>= 1, default: 5)
  --look-back-limit LOOK_BACK_LIMIT
                        The number of tokens to skip when looking backwards
                        (>= 0, default: 10), (Very rarely needed)
  --look-ahead-limit LOOK_AHEAD_LIMIT
                        The number of tokens to skip when looking ahead (>= 0,
                        default: 3)
  --max-merge-distance MAX_MERGE_DISTANCE
                        The maximum distance in tokens between two matches
                        considered for merging (>= 0, default: 2)
  --max-merge-ellipsis-distance MAX_MERGE_ELLIPSIS_DISTANCE
                        The maximum distance in tokens between two matches
                        considered for merging where the target text contains
                        an ellipsis between the matches (>= 0, default: 10)
  --create-dated-subfolder
                        Create a subfolder named with the current date to
                        store the results
  --no-create-dated-subfolder
                        Don't create a subfolder named with the current date
                        to store the results
  --max-num-processes MAX_NUM_PROCESSES
                        Maximum number of processes to use for parallel
                        processing
  --keep-ambiguous-matches
                        Keep ambiguous matches
  --no-keep-ambiguous-matches
                        Don't ambiguous matches
~~~

By default, the result is returned as a json structure: `List[Match]`. `Match` stores two `MatchSpans`. One for the source text and one for the target text. `MatchSpan` stores the `start` and `end` character positions for the matching spans in the source and target text.
For example,

~~~
[
  {
    "source_span": {
      "start": 0,
      "end": 52,
      "text": "This is a long Text and the long test goes on and on"
    },
    "target_span": {
      "start": 0,
      "end": 45,
      "text": "This is a long Text [...] test goes on and on"
    }
  }
]
~~~

Alternatively, the result can be printed in a human-readable text format, e.g.:

~~~
0	52	This is a long Text and the long test goes on and on
0	45	This is a long Text [...] test goes on and on 
~~~

In case the matching text is not needed, the option --no-text allows to exclude the text from the output.

## Passager
The package `passager` contains code to extract key passages from the found matches. The `passage` command produces several json files.
The resulting data structure is documented in the [data structure readme](DATA_STRUCTURE_README.md).

### Usage
~~~
usage: QuidCLI.py passage [-h]
                              source-file-path target-folder-path
                              matches-folder-path output-folder-path

Quid passage allows the user to extract key passages from the found
matches.

positional arguments:
  source-file-path     Path to the source text file
  target-folder-path   Path to the target texts folder path
  matches-folder-path  Path to the folder with the match files
  output-folder-path   Path to the output folder
~~~

## Visualization
The package `visualization` contains code to create the content for a web page to visualize the key passages.
For a white label version of the website, see [QuidEx-wh](https://scm.cms.hu-berlin.de/schluesselstellen/quidex-wh).

### Usage
~~~
usage: QuidCLI.py visualize [-h] [--title TITLE] [--author AUTHOR]
                             [--year YEAR] [--censor]
                             source-file-path target-folder-path
                             passages-folder-path output-folder-path

Quid visualize allows the user to create the files needed for a website that
visualizes the Quid algorithm results.

positional arguments:
  source-file-path      Path to the source text file
  target-folder-path    Path to the target texts folder path
  passages-folder-path
                        Path to the folder with the key passages files, i.e.
                        the resulting files from Quid passage
  output-folder-path    Path to the output folder

optional arguments:
  -h, --help            show this help message and exit
  --title TITLE         Title of the work
  --author AUTHOR       Author of the work
  --year YEAR           Year of the work
~~~

## History
Quid was formerly known as Lotte and later renamed. Earlier publications use the name Lotte.

## Citation
If you use Quid or base your work on our code, please cite our paper:
~~~
@inproceedings{arnold2021lotte,
  title = {{L}otte and {A}nnette: {A} {F}ramework for {F}inding and {E}xploring {K}ey {P}assages in {L}iterary {W}orks},
  author = {Arnold, Frederik and Jäschke, Robert},
  booktitle = {Proceedings of the Workshop on Natural Language Processing for Digital Humanities at ICON 2021},
  year = {2021}
}
~~~
For a prepint, see [Lotte and Annette: A Framework for Finding and Exploring Key Passages in Literary Works](https://amor.cms.hu-berlin.de/~arnolfre/paper/NLP4DH_2021_arnold_lotte_preprint.pdf)

## Acknowledgements
The algorithm is inspired by _sim_text_ by Dick Grune [^1]
and _Similarity texter: A text-comparison web tool based on the “sim_text” algorithm_ by Sofia Kalaidopoulou (2016) [^2]

[^1]: https://dickgrune.com/Programs/similarity_tester/ (Stand: 12.04.2021)

[^2]: https://people.f4.htw-berlin.de/~weberwu/simtexter/522789_Sofia-Kalaidopoulou_bachelor-thesis.pdf (Stand: 12.04.2021)
