Metadata-Version: 2.1
Name: pycldf
Version: 1.24.0
Summary: A python library to read and write CLDF datasets
Home-page: https://github.com/cldf/pycldf
Author: Robert Forkel
Author-email: robert_forkel@eva.mpg.de
License: Apache 2.0
Description: # pycldf
        
        A python package to read and write [CLDF](http://cldf.clld.org) datasets.
        
        [![Build Status](https://github.com/cldf/pycldf/workflows/tests/badge.svg)](https://github.com/cldf/pycldf/actions?query=workflow%3Atests)
        [![codecov](https://codecov.io/gh/cldf/pycldf/branch/master/graph/badge.svg)](https://codecov.io/gh/cldf/pycldf)
        [![Requirements Status](https://requires.io/github/cldf/pycldf/requirements.svg?branch=master)](https://requires.io/github/cldf/pycldf/requirements/?branch=master)
        [![Documentation Status](https://readthedocs.org/projects/pycldf/badge/?version=latest)](https://pycldf.readthedocs.io/en/latest/?badge=latest)
        [![PyPI](https://img.shields.io/pypi/v/pycldf.svg)](https://pypi.org/project/pycldf)
        
        
        ## Install
        
        Install `pycldf` from [PyPI](https://pypi.org/project/pycldf):
        ```shell
        pip install pycldf
        ```
        
        
        ## Command line usage
        
        Installing the `pycldf` package will also install a command line interface `cldf`, which provides some sub-commands to manage CLDF datasets.
        
        
        ### Summary statistics
        
        ```shell
        $ cldf stats mydataset/Wordlist-metadata.json 
        <cldf:v1.0:Wordlist at mydataset>
        
        Path                   Type          Rows
        ---------------------  ----------  ------
        forms.csv              Form Table       1
        mydataset/sources.bib  Sources          1
        ```
        
        
        ### Validation
        
        Arguably the most important functionality of `pycldf` is validating CLDF datasets.
        
        By default, data files are read in strict-mode, i.e. invalid rows will result in an exception
        being raised. To validate a data file, it can be read in validating-mode.
        
        For example the following output is generated
        
        ```sh
        $ cldf validate mydataset/forms.csv
        WARNING forms.csv: duplicate primary key: (u'1',)
        WARNING forms.csv:4:Source missing source key: Mei2005
        ```
        
        when reading the file
        
        ```
        ID,Language_ID,Parameter_ID,Value,Segments,Comment,Source
        1,abcd1234,1277,word,,,Meier2005[3-7]
        1,stan1295,1277,hand,,,Meier2005[3-7]
        2,stan1295,1277,hand,,,Mei2005[3-7]
        ```
        
        
        ### Extracting human readable metadata
        
        The information in a CLDF metadata file can be converted to [markdown](https://en.wikipedia.org/wiki/Markdown)
        (a human readable markup language) running
        ```shell
        cldf markdown PATH/TO/metadata.json
        ```
        A typical usage of this feature is to create a `README.md` for your dataset
        (which, when uploaded to e.g. GitHub will be rendered nicely in the browser).
        
        
        ### Converting a CLDF dataset to an SQLite database
        
        A very useful feature of CSVW in general and CLDF in particular is that it
        provides enough metadata for a set of CSV files to load them into a relational
        database - including relations between tables. This can be done running the
        `cldf createdb` command:
        
        ```shell script
        $ cldf createdb -h
        usage: cldf createdb [-h] [--infer-primary-keys] DATASET SQLITE_DB_PATH
        
        Load a CLDF dataset into a SQLite DB
        
        positional arguments:
          DATASET               Dataset specification (i.e. path to a CLDF metadata
                                file or to the data file)
          SQLITE_DB_PATH        Path to the SQLite db file
        ```
        
        For a specification of the resulting database schema refer to the documentation in
        [`src/pycldf/db.py`](src/pycldf/db.py).
        
        
        ## Python API
        
        For a detailed documentation of the Python API, refer to the
        [docs on ReadTheDocs](https://pycldf.readthedocs.io/en/latest/index.html).
        
        
        ### Reading CLDF
        
        As an example, we'll read data from [WALS Online, v2020](https://github.com/cldf-datasets/wals/tree/v2020):
        
        ```python
        >>> from pycldf import Dataset
        >>> wals2020 = Dataset.from_metadata('https://raw.githubusercontent.com/cldf-datasets/wals/v2020/cldf/StructureDataset-metadata.json')
        ```
        
        For exploratory purposes, accessing a remote dataset over HTTP is fine. But for real analysis, you'd want to download
        the datasets first and then access them locally, passing a local file path to `Dataset.from_metadata`.
        
        Let's look at what we got:
        ```python
        >>> print(wals2020)
        <cldf:v1.0:StructureDataset at https://raw.githubusercontent.com/cldf-datasets/wals/v2020/cldf/StructureDataset-metadata.json>
        >>> for c in wals2020.components:
          ...     print(c)
        ...
        ValueTable
        ParameterTable
        CodeTable
        LanguageTable
        ExampleTable
        ```
        As expected, we got a [StructureDataset](https://github.com/cldf/cldf/tree/master/modules/StructureDataset), and in
        addition to the required `ValueTable`, we also have a couple more [components](https://github.com/cldf/cldf#cldf-components).
        
        We can investigate the values using [`pycldf`'s ORM](src/pycldf/orm.py) functionality, i.e. mapping rows in the CLDF
        data files to convenient python objects. (Take note of the limitations describe in [orm.py](src/pycldf/orm.py), though.)
        
        ```python
        >>> for value in wals2020.objects('ValueTable'):
          ...     break
        ...
        >>> value
        <pycldf.orm.Value id="81A-aab">
        >>> value.language
        <pycldf.orm.Language id="aab">
        >>> value.language.cldf
        Namespace(glottocode=None, id='aab', iso639P3code=None, latitude=Decimal('-3.45'), longitude=Decimal('142.95'), macroarea=None, name='Arapesh (Abu)')
        >>> value.parameter
        <pycldf.orm.Parameter id="81A">
        >>> value.parameter.cldf
        Namespace(description=None, id='81A', name='Order of Subject, Object and Verb')
        >>> value.references
        (<Reference Nekitel-1985[94]>,)
        >>> value.references[0]
        <Reference Nekitel-1985[94]>
        >>> print(value.references[0].source.bibtex())
        @misc{Nekitel-1985,
            olac_field = {syntax; general_linguistics; typology},
            school     = {Australian National University},
            title      = {Sociolinguistic Aspects of Abu', a Papuan Language of the Sepik Area, Papua New Guinea},
            wals_code  = {aab},
            year       = {1985},
            author     = {Nekitel, Otto I. M. S.}
        }
        ```
        
        If performance is important, you can just read rows of data as python `dict`s, in which case the references between
        tables must be resolved "by hand":
        
        ```python
        >>> params = {r['id']: r for r in wals2020.iter_rows('ParameterTable', 'id', 'name')}
        >>> for v in wals2020.iter_rows('ValueTable', 'parameterReference'):
            ...     print(params[v['parameterReference']]['name'])
        ...     break
        ...
        Order of Subject, Object and Verb
        ```
        
        Note that we passed names of CLDF terms to `Dataset.iter_rows` (e.g. `id`) specifying which columns we want to access 
        by CLDF term - rather than by the column names they are mapped to in the dataset.
        
        
        ## Writing CLDF
        
        **Warning:** Writing CLDF with `pycldf` does not automatically result in valid CLDF!
        It does result in data that can be checked via `cldf validate` (see [below](#validation)),
        though, so you should always validate after writing.
        
        ```python
        from pycldf import Wordlist, Source
        
        dataset = Wordlist.in_dir('mydataset')
        dataset.add_sources(Source('book', 'Meier2005', author='Hans Meier', year='2005', title='The Book'))
        dataset.write(FormTable=[
            {
                'ID': '1', 
                'Form': 'word', 
                'Language_ID': 'abcd1234', 
                'Parameter_ID': '1277', 
                'Source': ['Meier2005[3-7]'],
            }])
        ```
        
        results in
        ```
        $ ls -1 mydataset/
        forms.csv
        sources.bib
        Wordlist-metadata.json
        ```
        
        - `mydataset/forms.csv`
        ```
        ID,Language_ID,Parameter_ID,Value,Segments,Comment,Source
        1,abcd1234,1277,word,,,Meier2005[3-7]
        ```
        - `mydataset/sources.bib`
        ```bibtex
        @book{Meier2005,
            author = {Meier, Hans},
            year = {2005},
            title = {The Book}
        }
        
        ```
        - `mydataset/Wordlist-metadata.json`
        
        
        ### Advanced writing
        
        To add predefined CLDF components to a dataset, use the `add_component` method:
        ```python
        from pycldf import StructureDataset, term_uri
        
        dataset = StructureDataset.in_dir('mydataset')
        dataset.add_component('ParameterTable')
        dataset.write(
            ValueTable=[{'ID': '1', 'Language_ID': 'abc', 'Parameter_ID': '1', 'Value': 'x'}],
        	ParameterTable=[{'ID': '1', 'Name': 'Grammatical Feature'}])
        ```
        
        It is also possible to add generic tables:
        ```python
        dataset.add_table('contributors.csv', term_uri('id'), term_uri('name'))
        ```
        which can also be linked to other tables:
        ```python
        dataset.add_columns('ParameterTable', 'Contributor_ID')
        dataset.add_foreign_key('ParameterTable', 'Contributor_ID', 'contributors.csv', 'ID')
        ```
        
        ### Addressing tables and columns
        
        Tables in a dataset can be referenced using a `Dataset`'s `__getitem__` method,
        passing
        - a full CLDF Ontology URI for the corresponding component,
        - the local name of the component in the CLDF Ontology,
        - the `url` of the table.
        
        Columns in a dataset can be referenced using a `Dataset`'s `__getitem__` method,
        passing a tuple `(<TABLE>, <COLUMN>)` where `<TABLE>` specifies a table as explained
        above and `<COLUMN>` is
        - a full CLDF Ontolgy URI used as `propertyUrl` of the column,
        - the `name` property of the column.
        
        See also https://pycldf.readthedocs.io/en/latest/dataset.html#accessing-schema-objects-components-tables-columns-etc
        
        
        ## Object oriented access to CLDF data
        
        The [`pycldf.orm`](src/pycldf/orm.py) module implements functionality
        to access CLDF data via an [ORM](https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping).
        See https://pycldf.readthedocs.io/en/latest/orm.html for
        details.
        
        
        ## Accessing CLDF data via SQL
        
        The [`pycldf.db`](src/pycldf/db.py) module implements functionality
        to load CLDF data into a [SQLite](https://sqlite.org) database. See https://pycldf.readthedocs.io/en/latest/db.html
        for details.
        
        
        ## See also
        - https://github.com/frictionlessdata/datapackage-py
        
Platform: any
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Provides-Extra: catalogs
Provides-Extra: dev
Provides-Extra: docs
Provides-Extra: test
