Metadata-Version: 1.2
Name: py3langid
Version: 0.2.0
Summary: Fork of the language identification tool langid.py, featuring a modernized codebase and faster execution times.
Home-page: https://github.com/adbar/py3langid
Author: Adrien Barbaresi
Author-email: barbaresi@bbaw.de
License: BSD
Description: =============
        ``py3langid``
        =============
        
        
        ``py3langid`` is a fork of the standalone language identification tool ``langid.py`` by Marco Lui.
        
        Original license: BSD-2-Clause. Fork license: BSD-3-Clause.
        
        
        
        Changes in this fork
        --------------------
        
        Execution speed has been improved and the code base has been optimized for Python 3.6+:
        
        - Loading the module with ``import`` is now about 10x faster
        - Language detection with ``langid.classify`` is now about 5x faster
        
        
        Usage
        -----
        
        Drop-in replacement
        ~~~~~~~~~~~~~~~~~~~
        
        
        1. Install the package:
        
           * ``pip3 install py3langid`` (or ``pip`` where applicable)
        
        2. Use it:
        
           * with Python: ``import py3langid as langid``
           * on the command-line: ``langid``
        
        
        With Python
        ~~~~~~~~~~~
        
        Basics:
        
        .. code-block:: python
        
            >>> import py3langid as langid
            
            >>> text = 'This text is in English.'
            # identified language and probability
            >>> langid.classify(text)
            ('en', -56.77428913116455)
            # unpack the result tuple in variables
            >>> lang, prob = langid.classify(text)
            # all potential languages
            >>> langid.rank(text)
        
        
        More options:
        
        .. code-block:: python
        
            >>> from py3langid.langid import LanguageIdentifier, MODEL_FILE
        
            # subset of target languages
            >>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE)
            >>> identifier.set_languages(['de', 'en', 'fr'])
            # this won't work well...
            >>> identifier.classify('这样不好')
            ('en', -81.83166265487671)
        
            # normalization of probabilities to an interval between 0 and 1
            >>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
            >>> identifier.classify('This should be enough text.'))
            ('en', 1.0)
        
        
        Note: the Numpy data type for the feature vector has been changed to optimize for speed. If results are inconsistent, try restoring the original setting:
        
        .. code-block:: python
        
            >>> langid.classify(text, datatype='uint32')
        
        
        On the command-line
        ~~~~~~~~~~~~~~~~~~~
        
        .. code-block:: bash
        
            # basic usage with probability normalization
            $ echo "This should be enough text." | langid -n
            ('en', 1.0)
        
            # define a subset of target languages
            $ echo "This won't be recognized properly." | langid -n -l fr,it,tr
            ('it', 0.9703832808613264)
        
        
        
        Legacy documentation
        --------------------
        
        
        **The docs below are provided for reference, only part of the functions are currently tested and maintained.**
        
        
        Introduction
        ------------
        
        ``langid.py`` is a standalone Language Identification (LangID) tool.
        
        The design principles are as follows:
        
        1. Fast
        2. Pre-trained over a large number of languages (currently 97)
        3. Not sensitive to domain-specific features (e.g. HTML/XML markup)
        4. Single .py file with minimal dependencies
        5. Deployable as a web service
        
        All that is required to run ``langid.py`` is Python >= 3.6 and numpy. 
        
        The accompanying training tools are still Python2-only.
        
        ``langid.py`` is WSGI-compliant.  ``langid.py`` will use ``fapws3`` as a web server if 
        available, and default to ``wsgiref.simple_server`` otherwise.
        
        ``langid.py`` comes pre-trained on 97 languages (ISO 639-1 codes given):
        
            af, am, an, ar, as, az, be, bg, bn, br, 
            bs, ca, cs, cy, da, de, dz, el, en, eo, 
            es, et, eu, fa, fi, fo, fr, ga, gl, gu, 
            he, hi, hr, ht, hu, hy, id, is, it, ja, 
            jv, ka, kk, km, kn, ko, ku, ky, la, lb, 
            lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, 
            nb, ne, nl, nn, no, oc, or, pa, pl, ps, 
            pt, qu, ro, ru, rw, se, si, sk, sl, sq, 
            sr, sv, sw, ta, te, th, tl, tr, ug, uk, 
            ur, vi, vo, wa, xh, zh, zu
        
        The training data was drawn from 5 different sources:
        
        * JRC-Acquis 
        * ClueWeb 09
        * Wikipedia
        * Reuters RCV2
        * Debian i18n
        
        
        Usage
        -----
        
            langid [options]
        
        optional arguments:
          -h, --help            show this help message and exit
          -s, --serve           launch web service
          --host=HOST           host/ip to bind to
          --port=PORT           port to listen on
          -v                    increase verbosity (repeat for greater effect)
          -m MODEL              load model from file
          -l LANGS, --langs=LANGS
                                comma-separated set of target ISO639 language codes
                                (e.g en,de)
          -r, --remote          auto-detect IP address for remote access
          -b, --batch           specify a list of files on the command line
          --demo                launch an in-browser demo application
          -d, --dist            show full distribution over languages
          -u URL, --url=URL     langid of URL
          --line                process pipes line-by-line rather than as a document
          -n, --normalize       normalize confidence scores to probability values
        
        
        The simplest way to use ``langid.py`` is as a command-line tool, and you can 
        invoke using ``python langid.py``. If you installed ``langid.py`` as a Python 
        module (e.g. via ``pip install langid``), you can invoke ``langid`` instead of 
        ``python langid.py -n`` (the two are equivalent).  This will cause a prompt to 
        display. Enter text to identify, and hit enter::
        
          >>> This is a test
          ('en', -54.41310358047485)
          >>> Questa e una prova
          ('it', -35.41771221160889)
        
        
        ``langid.py`` can also detect when the input is redirected (only tested under Linux), and in this
        case will process until EOF rather than until newline like in interactive mode::
        
          python langid.py < README.rst 
          ('en', -22552.496054649353)
        
        
        The value returned is the unnormalized probability estimate for the language. Calculating 
        the exact probability estimate is disabled by default, but can be enabled through a flag::
        
          python langid.py -n < README.rst 
          ('en', 1.0)
        
        More details are provided in this README in the section on `Probability Normalization`.
        
        You can also use ``langid.py`` as a Python library::
        
          # python
          Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
          [GCC 4.6.1] on linux2
          Type "help", "copyright", "credits" or "license" for more information.
          >>> import langid
          >>> langid.classify("This is a test")
          ('en', -54.41310358047485)
          
        Finally, ``langid.py`` can use Python's built-in ``wsgiref.simple_server`` (or ``fapws3`` if available) to
        provide language identification as a web service. To do this, launch ``python langid.py -s``, and
        access http://localhost:9008/detect . The web service supports GET, POST and PUT. If GET is performed
        with no data, a simple HTML forms interface is displayed.
        
        The response is generated in JSON, here is an example::
        
          {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}
        
        A utility such as curl can be used to access the web service::
        
          # curl -d "q=This is a test" localhost:9008/detect
          {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}
        
        You can also use HTTP PUT::
        
          # curl -T readme.rst localhost:9008/detect
            % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed
          100  2871  100   119  100  2752    117   2723  0:00:01  0:00:01 --:--:--  2727
          {"responseData": {"confidence": -22552.496054649353, "language": "en"}, "responseDetails": null, "responseStatus": 200}
        
        If no "q=XXX" key-value pair is present in the HTTP POST payload, ``langid.py`` will interpret the entire
        file as a single query. This allows for redirection via curl::
        
          # echo "This is a test" | curl -d @- localhost:9008/detect
          {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}
        
        ``langid.py`` will attempt to discover the host IP address automatically. Often, this is set to localhost(127.0.1.1), even 
        though the machine has a different external IP address. ``langid.py`` can attempt to automatically discover the external
        IP address. To enable this functionality, start ``langid.py`` with the ``-r`` flag.
        
        ``langid.py`` supports constraining of the output language set using the ``-l`` flag and a comma-separated list of ISO639-1 
        language codes (the ``-n`` flag enables probability normalization)::
        
          # python langid.py -n -l it,fr
          >>> Io non parlo italiano
          ('it', 0.99999999988965627)
          >>> Je ne parle pas français
          ('fr', 1.0)
          >>> I don't speak english
          ('it', 0.92210605672341062)
        
        When using ``langid.py`` as a library, the set_languages method can be used to constrain the language set::
        
          python                      
          Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
          [GCC 4.6.1] on linux2
          Type "help", "copyright", "credits" or "license" for more information.
          >>> import langid
          >>> langid.classify("I do not speak english")
          ('en', 0.57133487679900674)
          >>> langid.set_languages(['de','fr','it'])
          >>> langid.classify("I do not speak english")
          ('it', 0.99999835791478453)
          >>> langid.set_languages(['en','it'])
          >>> langid.classify("I do not speak english")
          ('en', 0.99176190378750373)
        
        
        Batch Mode
        ----------
        
        ``langid.py`` supports batch mode processing, which can be invoked with the ``-b`` flag.
        In this mode, ``langid.py`` reads a list of paths to files to classify as arguments.
        If no arguments are supplied, ``langid.py`` reads the list of paths from ``stdin``,
        this is useful for using ``langid.py`` with UNIX utilities such as ``find``.
        
        In batch mode, ``langid.py`` uses ``multiprocessing`` to invoke multiple instances of
        the classifier, utilizing all available CPUs to classify documents in parallel. 
        
        
        Probability Normalization
        -------------------------
        
        The probabilistic model implemented by ``langid.py`` involves the multiplication of a
        large number of probabilities. For computational reasons, the actual calculations are
        implemented in the log-probability space (a common numerical technique for dealing with
        vanishingly small probabilities). One side-effect of this is that it is not necessary to
        compute a full probability in order to determine the most probable language in a set
        of candidate languages. However, users sometimes find it helpful to have a "confidence"
        score for the probability prediction. Thus, ``langid.py`` implements a re-normalization
        that produces an output in the 0-1 range.
        
        ``langid.py`` disables probability normalization by default. For
        command-line usages of ``langid.py``, it can be enabled by passing the ``-n`` flag. For
        probability normalization in library use, the user must instantiate their own 
        ``LanguageIdentifier``. An example of such usage is as follows::
          
          >> from py3langid.langid import LanguageIdentifier, MODEL_FILE
          >> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
          >> identifier.classify("This is a test")
          ('en', 0.9999999909903544)
        
        
        Training a model
        ----------------
        
        So far Python 2.7 only, see the `original instructions <https://github.com/saffsd/langid.py#training-a-model>`_.
        
        
        Read more
        ---------
        
        ``langid.py`` is based on published research. [1] describes the LD feature selection technique in detail,
        and [2] provides more detail about the module ``langid.py`` itself.
        
        [1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, 
        In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), 
        Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062
        
        [2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, 
        In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), 
        Demo Session, Jeju, Republic of Korea. Available from www.aclweb.org/anthology/P12-3005
        
Keywords: language detection,language identification,langid
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.6
