Metadata-Version: 2.1
Name: rdsa-utils
Version: 0.2.0
Summary: A suite of PySpark, Pandas, and general pipeline utils for Reproducible Data Science and Analysis (RDSA) projects.
Home-page: https://github.com/ONSdigital/rdsa-utils
Author: Reproducible Data Science & Analysis, ONS
Author-email: Diego.Lara.De.Andres@ons.gov.uk, Meg.Scammell@ons.gov.uk, Dominic.Bean@ons.gov.uk
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: <3.12,>=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: cloudpathlib[gs]>=0.15.1
Requires-Dist: humanfriendly>=9.1
Requires-Dist: more-itertools>=9.0.0
Requires-Dist: pandas==1.5.3
Requires-Dist: pydantic>=2.6.2
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: tomli>=2.0.1
Requires-Dist: google-cloud-bigquery>=3.17.2
Requires-Dist: google-cloud-storage>=2.14.0
Provides-Extra: dev
Requires-Dist: bump2version>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=2.6.0; extra == "dev"
Requires-Dist: ruff>=0.0.270; extra == "dev"
Requires-Dist: chispa>=0.9.2; extra == "dev"
Requires-Dist: coverage[toml]>=7.1.0; extra == "dev"
Requires-Dist: pytest<8.0.0,>=7.1.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-lazy-fixture>=0.6.0; extra == "dev"
Requires-Dist: pytest-mock>=3.8.0; extra == "dev"
Requires-Dist: pyspark==3.5.1; extra == "dev"
Provides-Extra: doc
Requires-Dist: mkdocs>=1.4.2; extra == "doc"
Requires-Dist: mkdocs-tech-docs-template>=0.1.2; extra == "doc"
Requires-Dist: mkdocstrings[python]>=0.22.0; extra == "doc"
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.2.1; extra == "doc"
Requires-Dist: mkdocs-jupyter>=0.24.3; extra == "doc"
Requires-Dist: mkdocs-mermaid2-plugin>=1.1.1; extra == "doc"

# 🧰 rdsa-utils 

[![Deploy to PyPI](https://github.com/ONSdigital/rdsa-utils/actions/workflows/deploy_pypi.yaml/badge.svg?branch=main)](https://github.com/ONSdigital/rdsa-utils/actions/workflows/deploy_pypi.yaml)
[![Deploy MkDocs](https://github.com/ONSdigital/rdsa-utils/actions/workflows/deploy_mkdocs.yaml/badge.svg?branch=main)](https://github.com/ONSdigital/rdsa-utils/actions/workflows/deploy_mkdocs.yaml)
[![PyPI version](https://badge.fury.io/py/rdsa-utils.svg)](https://pypi.org/project/rdsa-utils/)

A suite of PySpark, Pandas, and general pipeline utils for **Reproducible Data Science and Analysis (RDSA)** projects.

The RDSA team sits within the Economic Statistics Change Directorate, and uses cutting-edge data science and engineering skills to produce the next generation of economic statistics. Current priorities include overhauling legacy systems and developing new systems for key statistics. More information about work at RDSA can be found here: [Using Data Science for Next-Gen Statistics](https://dataingovernment.blog.gov.uk/2023/02/14/using-data-science-for-next-gen-statistics/).

`rdsa-utils` is a Python codebase built with Python 3.8 and higher, and uses `setup.py`, `setup.cfg`, and `pyproject.toml` for dependency management and packaging.

## 📋 Prerequisites 

- Python 3.8 or higher

## 💾 Installation 

`rdsa-utils` is available for installation via [PyPI](https://pypi.org/project/rdsa-utils/) and can also be found on [GitHub Releases](https://github.com/ONSdigital/rdsa-utils/releases) for direct downloads and version history.

To install via `pip`, simply run:

```bash
pip install rdsa-utils
```

## 📖 Documentation and Further Information 

Our documentation is automatically generated using **GitHub Actions** and **MkDocs**. For an in-depth understanding of `rdsa-utils`, how to contribute to `rdsa-utils`, and more, please refer to our [MkDocs-generated documentation](https://onsdigital.github.io/rdsa-utils/).

## 📘 Further Reading on Reproducible Analytical Pipelines

While `rdsa-utils` provides essential tools for data processing, it represents just one aspect of the broader development process necessary for building and maintaining robust, high-quality codebases. Adhering to best practices and utilising the right tools are crucial for success.

We highly recommend exploring the following resources to gain deeper insights into the creation of Reproducible Analytical Pipelines (RAP), which delve into key topics such as version control, modular code development, unit testing, and peer review -- all vital for developing these pipelines:

- [Reproducible Analytical Pipelines (RAP) Resource](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/) - This resource offers an overview of Reproducible Analytical Pipelines, covering benefits, case studies, and guidelines on building a RAP. It discusses minimising manual steps, using open source software like R or Python, enhancing quality assurance through peer review, and ensuring auditability with version control. It also addresses challenges and considerations for implementing RAPs, such as data access restrictions or confidentiality, and underscores the importance of collaborative development.

- [Quality Assurance of Code for Analysis and Research](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html) - This book details methods and practices for ensuring high-quality coding in research and analysis, including unit testing and peer reviews.

- [PySpark Introduction and Training Book](https://best-practice-and-impact.github.io/ons-spark/intro.html) - An introduction to using PySpark for large-scale data processing.

## 🛡️ Licence

Unless stated otherwise, the codebase is released under the [MIT License][mit].
This covers both the codebase and any sample code in the documentation.

The documentation is [© Crown copyright][copyright] and available under the terms of the [Open Government 3.0][ogl] licence.

[mit]: LICENSE
[copyright]: http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/
[ogl]: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
