Metadata-Version: 2.1
Name: the-nordic-pile
Version: 0.0.2
Summary: The Nordic Pile
Home-page: https://gits-15.sys.kth.se/bmoell/tmh
Author: Ariel Ekgren, Birger Moell
Author-email: <bmoell@kth.se>
License: UNKNOWN
Project-URL: Bug Tracker, https://gits-15.sys.kth.se/bmoell/tmh/issues
Keywords: python,pile,dataset
Platform: UNKNOWN
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Description-Content-Type: text/markdown
License-File: LICENSE


# The Nordic Pile replication code

The Nordic Pile is a repository with the aim of providing tools and code to download and
replicate a large nordic language dataset. The dataset consists of many smaller datasets combined together.
With the objective to cover a broad set of language modalities.

## Workflow

To propose a new dataset be added to the Nordic Pile, [open an issue](https://github.com/AI-Nordics/the-nordic-pile/issues/new).
Your issue should include a description of the dataset, its size, what language(s) it is in, 
a link to the data, and any other relevant information. If a project manger approves your proposal, 
they will change its label to [![Datasets](https://img.shields.io/github/labels/EleutherAI/The-Pile/Dataset)](https://github.com/AI-Nordics/the-nordic-pile/labels/Dataset) and add it to [![Project: Datasets](https://img.shields.io/badge/Project-Datasets-lightgrey)](https://github.com/AI-Nordics/the-nordic-pile/projects/1). Datasets that we elect to not include in the current version of the Pile will receive a [![Deferred](https://img.shields.io/github/labels/EleutherAI/The-Pile/Deferred%20to%20v2)](https://github.com/AI-Nordics/the-nordic-pile/labels/Deferred%20to%20v2) or [![Declined](https://img.shields.io/github/labels/EleutherAI/The-Pile/Declined)](https://github.com/AI-Nordics/the-nordic-pile/labels/Declined) 
label. We will now focus on datasets in the languages of the nordics: Swedish, Danish, Norwegian and Finnish.

To claim responsibility for implementing an unclaimed dataset, 
leave a comment on one of our unassigned issues. Once a dataset 
has been assigned to you, make the necessary changes to `datasets.py` and `pile.py` 
in a fork and submit a pull request. If you require, you can also 
submit a script for processing the data as shown [here](https://github.com/EleutherAI/pile_enron_emails).

To raise an issue that is not proposing a new dataset, 
open an issue with the tag [![Feature Request](https://img.shields.io/github/labels/EleutherAI/The-Pile/Feature%20Request)](https://github.com/EleutherAI/The-Pile/labels/Feature%20Request) or [![Bug](https://img.shields.io/github/labels/EleutherAI/The-Pile/Bug)](https://github.com/ekgren/the-nordic-pile/labels/Bug) as appropriate.

Data ready for final implementation should meet the following criteria:

- The data must be in [lm_dataformat](https://github.com/leogao2/lm_dataformat/) format.
- The data must be shuffled.

## Attribution
This initiative is heavily inspired by Eleuther AIs The Pile project.  
[https://www.eleuther.ai/](https://www.eleuther.ai/)  
[https://pile.eleuther.ai/](https://pile.eleuther.ai/)  

## Datasets
| Dataset      | Status |
| ----------- | ----------- |
| Wikipedia-Swedish      | 🙋‍♀️ Waiting for contributor      |
| Wikipedia-Danish      | 🙋‍♀️ Waiting for contributor      |
| Wikipedia-Norwegian      | 🙋‍♀️ Waiting for contributor     |
| Wikipedia-Finnish      | 🙋‍♀️ Waiting for contributor       |
| Swedish Parliament   | 🙋‍♀️ Waiting for contributor        |

