Metadata-Version: 2.1
Name: averell
Version: 1.2.2
Summary: Corpora downloader and reader for Spanish sources
Home-page: https://github.com/linhd-postdata/averell
Author: LINHD POSTDATA Project
Author-email: info@linhd.uned.es
License: Apache-2.0
Project-URL: Documentation, https://averell.readthedocs.io/
Project-URL: Changelog, https://averell.readthedocs.io/en/latest/changelog.html
Project-URL: Issue Tracker, https://github.com/linhd-postdata/averell/issues
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: Unix
Classifier: Operating System :: POSIX
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Utilities
Requires-Python: >3.6.*
License-File: LICENSE
License-File: AUTHORS.rst

=======
Averell
=======



Averell, the python library and command line interface that facilitates working
with existing repositories of annotated poetry. \
Averell is able to download an annotated corpus and reconcile different
TEI entities to provide a unified JSON output at the desired granularity.
That is, for their investigations some researchers
might need the entire poem, poems split line by line,
or even word by word if that is available. Averell allows to specify the
granularity of the final generated dataset, which is a combined JSON with all
the entities in the selected corpora.
Each corpus in the catalog must specify the parser to produce the expected data format.

* Free software: Apache Software License 2.0


Available corpora (version 1.1.0)
=================================

====  ===================  ======  ======  ======  ========  =============  ===========
  id  name                 lang    size      docs     words  granularity    license
====  ===================  ======  ======  ======  ========  =============  ===========
   1  Disco V2.1           es      22M       4088    381539  stanza         CC-BY
      (disco2_1)                                             line
   2  Disco V3             es      28M       4080    377978  stanza         CC-BY
      (disco3)                                               line
   3  Sonetos Siglo        es      6.8M      5078    466012  stanza         CC-BY-NC
      de Oro                                                 line           4.0
      (adso)
   4  ADSO 100             es      128K       100      9208  stanza         CC-BY-NC
      poems corpus                                           line           4.0
      (adso100)
   5  Poesía Lírica        es      3.8M       475    299402  stanza         CC-BY-NC
      Castellana Siglo                                       line           4.0
      de Oro                                                 word
      (plc)                                                  syllable
   6  Gongocorpus (gongo)  es      9.2M       481     99079  stanza         CC-BY-NC-ND
                                                             line           3.0
                                                             word           FR
                                                             syllable
   7  Eighteenth Century   en      2400M     3084   2063668  stanza         CC
      Poetry Archive                                         line           BY-SA
      (ecpa)                                                 word           4.0
   8  For Better           en      39.5M      103     41749  stanza         Unknown
      For Verse                                              line
      (4b4v)
   9  Métrique en          fr      183M      5081   1850222  stanza         Unknown
      Ligne (mel)                                            line
  10  Biblioteca Italiana  it      242M     25341   7121246  stanza         Unknown
      (bibit)                                                line
                                                             word
  11  Corpus of            cs      4100M    66428  12636867  stanza         CC-BY-SA
      Czech Verse                                            line
      (czverse)                                              word
  12  Stichotheque         pt      11.8M     1702    168411  stanza         Unkwown
      (stichopt)                                             line
====  ===================  ======  ======  ======  ========  =============  ===========


Documentation
=============

https://averell.readthedocs.io/

Installation
============

To install averell, run this command in your terminal::

    pip install averell

This is the preferred method to install averell, as it will always install
the most recent stable release.

If you don't have `pip`_ installed, this `Python installation guide`_ can guide
you through the process.

.. _pip: https://pip.pypa.io
.. _Python installation guide: http://docs.python-guide.org/en/latest/starting/installation/


Usage
=====


To show averell help::

    averell --help

To list all available corpora::

    averell list

Visualization example of one of the available corpora:

.. code-block:: text

      id  name                 lang    size      docs    words  granularity    license
    ----  -------------------  ------  ------  ------  -------  -------------  -----------
       1  Disco V2.1           es      22M       4088   381539  stanza         CC-BY
                                                                line

download
--------

Download desired corpora into "mycorpora" folder::

    averell download 2 3 --corpora-folder my_corpora

Example of poem in TEI format obtained from one of the corpora:

.. code-block:: XML

    <TEI xmlns="http://www.tei-c.org/ns/1.0">
        <teiHeader>
            <fileDesc>
                <titleStmt>
                    <title> Spanish Metrical Patterns Bank: Golden Age Sonnets.</title>
                    <principal>Borja Navarro Colorado</principal>
                    <respStmt>
                        <name>María Ribes Lafoz</name>
                        <name>Noelia Sánchez López</name>
                        <name>Borja Navarro Colorado</name>
                        <resp>Metrical patterns annotation</resp>
                    </respStmt>
                </titleStmt>
                <publicationStmt>
                    <publisher>Natural Language Processing Group. Department of Software and Computing Systems. University of Alicante (Spain)</publisher>
                </publicationStmt>
                <sourceDesc>
                    <bibl><title>Sonetos</title> de <author>Garcilaso de La Vega</author>. <publisher>Biblioteca Virtual Miguel de Cervantes</publisher>, edición de <editor role="editor">Ramón García González</editor>.</bibl>
                </sourceDesc>
            </fileDesc>
            <encodingDesc>
                <metDecl xml:id="bncolorado" type="met" pattern="((\+|\-)+)*">
                    <metSym value="+">stressed syllable</metSym>
                    <metSym value="-">unstressed syllable</metSym>
                </metDecl>
                <metDecl>
                    <p>All metrical patterns have been manually checked.</p>
                </metDecl>
            </encodingDesc>
        </teiHeader>
        <text>
            <body>
                <head>
                    <title>-XX-</title>
                </head>
                <lg type="cuarteto">
                    <l n="1" met="-++--++--+-">Con tal fuerza y vigor son concertados</l>
                    <l n="2" met="-----+-+-+-">para mi perdición los duros vientos,</l>
                    <l n="3" met="--+--+---+-">que cortaron mis tiernos pensamientos</l>
                    <l n="4" met="+----++--+-">luego que sobre mí fueron mostrados.</l>
                </lg>
                <lg type="terceto">
                    <l n="5" met="-++--+---+-">El mal es que me quedan los cuidados</l>
                    <l n="6" met="---+-----+-">en salvo de estos acontecimientos,</l>
                    <l n="7" met="-++--+---+-">que son duros, y tienen fundamentos</l>
                </lg>
            </body>
        </text>
    </TEI>

Generated example JSON file from input XML/TEI poem into
my_corpora/{corpus}/averell/parser/{author_name}/{poem_name}.json

.. code-block:: JSON

    {
        "manually_checked": true,
        "poem_title": "-XX-",
        "author": "Garcilaso de La Vega",
        "stanzas": [
            {
                "stanza_number": "1",
                "stanza_type": "cuarteto",
                "lines": [
                    {
                        "line_number": "1",
                        "line_text": "Con tal fuerza y vigor son concertados",
                        "metrical_pattern": "-++--++--+-"
                    },
                    {
                        "line_number": "2",
                        "line_text": "para mi perdición los duros vientos,",
                        "metrical_pattern": "-----+-+-+-"
                    },
                    {
                        "line_number": "3",
                        "line_text": "que cortaron mis tiernos pensamientos",
                        "metrical_pattern": "--+--+---+-"
                    },
                    {
                        "line_number": "4",
                        "line_text": "luego que sobre mí fueron mostrados.",
                        "metrical_pattern": "+----++--+-"
                    }
                ],
                "stanza_text": "Con tal fuerza y vigor son concertados\npara mi perdición los duros vientos,\nque cortaron mis tiernos pensamientos\nluego que sobre mí fueron mostrados."
            },
            {
                "stanza_number": "2",
                "stanza_type": "terceto",
                "lines": [
                    {
                        "line_number": "5",
                        "line_text": "El mal es que me quedan los cuidados",
                        "metrical_pattern": "-++--+---+-"
                    },
                    {
                        "line_number": "6",
                        "line_text": "en salvo de estos acontecimientos,",
                        "metrical_pattern": "---+-----+-"
                    },
                    {
                        "line_number": "7",
                        "line_text": "que son duros, y tienen fundamentos",
                        "metrical_pattern": "-++--+---+-"
                    }
                ],
                "stanza_text": "El mal es que me quedan los cuidados\nen salvo de estos acontecimientos,\nque son duros, y tienen fundamentos"
            }
        ]
    }

export
------

Now we can combine and join these corpora through "granularity" selection::

    averell export 2 3 --granularity line --corpora-folder my_corpora --filename export_1

It produces an single JSON file with information about all the lines in
those corpora. Example of **two** random lines in the file mycorpora/export_1.json:

.. code-block:: JSON

    {
        "line_number": "5",
        "line_text": "¿Has visto que en el mismo lugar donde",
        "metrical_pattern": "++---+--++-",
        "stanza_number": "2",
        "manually_checked": false,
        "poem_title": " - II - ",
        "author": "Mira de Amescua",
        "stanza_text": "¿Has visto que en el mismo lugar donde\nbordado estuvo el cristalino velo\nun bordado terliz de escarcha y hielo\nhace que el campo de verdor se monde?",
        "stanza_type": "cuarteto"
    }
    {
        "line_number": "10",
        "line_text": "el que a lo cierto no a lo incierto mira,",
        "metrical_pattern": "---+-+-+-+-",
        "stanza_number": "3",
        "manually_checked": false,
        "poem_title": "- VIII - Considerando un sepulcro y los que están en él ",
        "author": "Lope de Zarate",
        "stanza_text": "De aquí si que consigue el ser dichoso\nel que a lo cierto no a lo incierto mira,\npues le adorna lo eterno fastuoso;",
        "stanza_type": "terceto"
    }

By default, ``export`` will download corpora if needed. To avoid this behaviour, the flag ``--no-download`` can be passed in.

Exported corpora can be easily loaded into Pandas

.. code-block:: bash

    averell export adso100 --filename adso100.json

.. code-block:: python

    import pandas as pd

    adso100 = pd.read_json(open("adso100.json"))


A note on IDS
-------------

IDS can be numeric identifiers in the ``averell list`` output, corpus shortcodes (shown between parenthesis), the speciall literal ``all`` to refer to all corpora, or two-letter ISO language codes to refer to avaliable corpora in a specific language.

For example, the command ``averell export 1 bibit fr`` will export DISCO V2.1, the Biblioteca Italiana poetry corpus, and all corpora tagged with the French languge tag in a single file spliting poems line by line.



Changelog
=========


1.2.1 (2021-07-14)
------------------

* Added two new readers:
    * `Stichotheque Portuguese corpus <https://gitlab.com/stichotheque/stichotheque-pt>`_ 
    * `Corpus of Czech Verse <https://github.com/versotym/corpusCzechVerse/>`_
* `export_filename` is also returned as an output of `export_corpora`
* Fix writing function so as not to duplicate information
* Change `name` key to `corpus` for clarity
* Fix path split on Windows systems
* Add corpus name to averell output files

1.1.0 (2020-09-18)
------------------

* Added **Biblioteca Italiana (bibit)** reader
* Added Archivio Metrico Italiano info to Biblioteca Italiana reader
* Reduced fixtures file size
* Adding a tmp file to git ignore
* Adding languages and some other cosmetic changes
* Fixing an error with the expected output of the ``averell list`` command
* Adding slugs, langs, and 'all' to ``download`` and ``export``
* Fixing coverage
* Adding documentation and fixing a test

1.0.3 (2020-09-03)
------------------

* Added ``export --filename`` option
* Added two new readers:

  * **For better for verse**

  * **Métrique en ligne**

1.0.2 (2020-06-23)
------------------

* Added two new readers:

  * **ECPA corpus**

  * **Gongocorpus**

* Minor bug fixes

1.0.1 (2020-05-18)
------------------

* Setting up bumbpversion
* Integration with Zenodo

1.0.0 (2020-04-29)
------------------

* Remove commits-since code block
* Adding automated deployments to PyPI on tag releases
* Added menu
* Remove comments and cleaner code fixes
* Fix sorted output of tests
* Added proper documentation and coverage tests
* Added tests for ``export`` function
* Added ``export`` function
* Added ``TEI_NAMESPACE`` as a constant
* Fixed docs. Fixed loads with ``Path``. Fixed logging errors
* Added tests

0.0.1 (2020-01-08)
------------------

* First release on PyPI.


