The mini-kedro Kedro starter¶
Introduction¶
Mini-Kedro makes it possible to use the DataCatalog functionality in Kedro.
Use Kedro to configure and explore data sources in a Jupyter notebook using the DataCatalog feature.
The DataCatalog allows you to specify data sources that you interact with for loading and saving purposes using a YAML API. See an example:
# conf/base/catalog.yml
example_dataset_1:
type: pandas.CSVDataSet
filepath: folder/filepath.csv
example_dataset_2:
type: spark.SparkDataSet
filepath: s3a://your_bucket/data/01_raw/example_dataset_2*
credentials: dev_s3
file_format: csv
save_args:
if_exists: replace
Later on, you can build a full pipeline using the same configuration, with the following steps:
Create a new empty Kedro project in a new directory
kedro new
Let’s assume that the new project is created at /path/to/your/project.
Copy the
conf/anddata/directories over to the new project
cp -fR {conf,data} `/path/to/your/project`
This makes it possible to use something like df = catalog.load("example_dataset_1") and df_2 = catalog.save("example_dataset_2") to interact with data in a Jupyter notebook.
The advantage of this approach is that you never need to specify file paths for loading or saving data in your Jupyter notebook.
Content¶
The starter contains:
A
conf/directory, which contains an exampleDataCatalogconfiguration (catalog.yml)A
data/directory, which contains an example dataset identical to the one used by thepandas-irisstarterAn example notebook showing how to instantiate the
DataCatalogand interact with the example datasetA
README.md, which explains how to use the project created by the starter and how to transition to a full Kedro project afterwards