Metadata-Version: 2.1
Name: corpy
Version: 0.2.4
Summary: Tools for processing language data.
Home-page: https://github.com/dlukes/corpy
License: GPL-3.0+
Keywords: corpus,linguistics,NLP
Author: David Lukes
Author-email: dafydd.lukes@gmail.com
Requires-Python: >=3.6,<4.0
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: click (>=7.0,<8.0)
Requires-Dist: lazy (>=1.4,<2.0)
Requires-Dist: lxml (>=4.6.1,<5.0.0)
Requires-Dist: matplotlib (>=3.1,<4.0)
Requires-Dist: numpy (>=1.16,<2.0)
Requires-Dist: regex
Requires-Dist: ufal.morphodita (>=1.10,<2.0)
Requires-Dist: ufal.udpipe (>=1.2,<2.0)
Requires-Dist: wordcloud (>=1.8.1,<2.0.0)
Project-URL: Repository, https://github.com/dlukes/corpy
Description-Content-Type: text/x-rst

=====
CorPy
=====

.. image:: https://readthedocs.org/projects/corpy/badge/?version=stable
   :target: https://corpy.readthedocs.io/en/stable/?badge=stable
   :alt: Documentation status

.. image:: https://badge.fury.io/py/corpy.svg
   :target: https://badge.fury.io/py/corpy
   :alt: PyPI package

.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
   :target: https://github.com/python/black
   :alt: Code style

Installation
============

.. code:: bash

   $ python3 -m pip install corpy

Only recent versions of Python 3 (3.6+) are supported by design.

Help and feedback
=================

The project is developed on GitHub_. You can ask for help via `GitHub
discussions`_ and report bugs and give other kinds of feedback via `GitHub
issues`_. Support is provided gladly, time and other engagements permitting, but
cannot be guaranteed.

.. _GitHub: https://github.com/dlukes/corpy
.. _GitHub discussions: https://github.com/dlukes/corpy/discussions
.. _GitHub issues: https://github.com/dlukes/corpy/issues

What is CorPy?
==============

A fancy plural for *corpus* ;) Also, a collection of handy but not especially
mutually integrated tools for dealing with linguistic data. It abstracts away
functionality which is often needed in practice for teaching and/or day to day
work at the `Czech National Corpus <https://korpus.cz>`__, without aspiring to
be a fully featured or consistent NLP framework.

The short URL to the docs is: https://corpy.rtfd.io/

Here's an idea of what you can do with CorPy:

- add linguistic annotation to raw textual data using either `UDPipe
  <https://corpy.rtfd.io/en/stable/guides/udpipe.html>`__ or `MorphoDiTa
  <https://corpy.rtfd.io/en/stable/guides/morphodita.html>`__

.. note::

   **Should I pick UDPipe or MorphoDiTa?**

   UDPipe_ is the successor to MorphoDiTa_, extending and improving upon the
   original codebase. It has more features at the cost of being somewhat more
   complex: it does both `morphological tagging (including lemmatization) and
   syntactic parsing <https://corpy.rtfd.io/en/stable/guides/udpipe.html>`__,
   and it handles a number of different input and output formats. You can also
   download `pre-trained models <http://ufal.mff.cuni.cz/udpipe/models>`__ for
   many different languages.

   By contrast, MorphoDiTa_ only has `pre-trained models for Czech and English
   <http://ufal.mff.cuni.cz/morphodita/users-manual>`__, and only performs
   `morphological tagging (including lemmatization)
   <https://corpy.rtfd.io/en/stable/guides/morphodita.html>`__. However, its
   output is more straightforward -- it just splits your text into tokens and
   annotates them, whereas UDPipe can (depending on the model) introduce
   additional tokens necessary for a more explicit analysis, add multi-word
   tokens etc. This is because UDPipe is tailored to the type of linguistic
   analysis conducted within the UniversalDependencies_ project, using the
   CoNLL-U_ data format.

   MorphoDiTa can also help you if you just want to tokenize text and don't have
   a language model available.

.. _UDPipe: http://ufal.mff.cuni.cz/udpipe
.. _MorphoDiTa: http://ufal.mff.cuni.cz/morphodita
.. _UniversalDependencies: https://universaldependencies.org
.. _CoNLL-U: https://universaldependencies.org/format.html

- `easily generate word clouds
  <https://corpy.rtfd.io/en/stable/guides/vis.html>`__
- `generate phonetic transcripts of Czech texts
  <https://corpy.rtfd.io/en/stable/guides/phonetics_cs.html>`__
- `wrangle corpora in the vertical format
  <https://corpy.rtfd.io/en/stable/guides/vertical.html>`__ devised originally
  for `CWB <http://cwb.sourceforge.net/>`__, used also by `(No)SketchEngine
  <https://nlp.fi.muni.cz/trac/noske/>`__
- plus some `command line utilities
  <https://corpy.rtfd.io/en/stable/guides/cli.html>`__

.. development-marker

Development
===========

Dependencies and building the docs
----------------------------------

``corpy`` needs to be installed in the ReadTheDocs virtualenv for ``autodoc`` to
work. That's configured in ``.readthedocs.yml``. However, ``pip`` doesn't
install ``[tool.poetry.dev-dependencies]``, which contain the Sphinx version and
theme we're using. Maybe there's a way of forcing that, but we probably don't
want to anyway -- it's a waste of time to install linters, testing frameworks
etc. that won't be used. So instead, we have a ``docs/requirements.txt`` file
managed by ``check.sh`` which only contains Sphinx + the theme, and which we
specify via ``.readthedocs.yml``.

.. license-marker

License
=======

Copyright © 2016--present `ÚČNK <http://korpus.cz>`__/David Lukeš

Distributed under the `GNU General Public License v3
<http://www.gnu.org/licenses/gpl-3.0.en.html>`__.

