Kedro IO¶
In this tutorial, we cover advanced uses of the Kedro IO module to understand the underlying implementation. The relevant API documentation is kedro.io.AbstractDataSet and kedro.io.DataSetError.
Error handling¶
We have custom exceptions for the main classes of errors that you can handle to deal with failures.
from kedro.io import *
io = DataCatalog(data_sets=dict()) # empty catalog
try:
cars_df = io.load("cars")
except DataSetError:
print("Error raised.")
AbstractDataSet¶
To understand what is going on behind the scenes, you should study the AbstractDataSet interface. AbstractDataSet is the underlying interface that all datasets extend. It requires subclasses to override the _load and _save and provides load and save methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override _describe, which is used in logging the internal information about the instances of your custom AbstractDataSet implementation.
If you have a dataset called parts, you can make direct calls to it like so:
parts_df = parts.load()
However, we recommend using a DataCatalog instead (for more details, see this section in the User Guide) as it has been designed to make all datasets available to project members.
For contributors, if you would like to submit a new dataset, you will have to extend AbstractDataSet. For a complete guide, please read Creating a new dataset.
Versioning¶
In order to enable versioning, you need to update the catalog.yml config file and set the versioned attribute to true for the given dataset. If this is a custom dataset, the implementation must also:
extend
kedro.io.core.AbstractVersionedDataSetANDadd
versionnamedtuple as an argument to its__init__method ANDcall
super().__init__()with positional argumentsfilepath,version, and, optionally, withglobandexistsfunctions if it uses a non-local filesystem (see kedro.extras.datasets.pandas.CSVDataSet as an example) ANDmodify its
_describe,_loadand_savemethods respectively to support versioning (see kedro.extras.datasets.pandas.CSVDataSet for an example implementation)
Note
If a new version of a dataset is created mid-run, for instance by an external system adding new files, it will not interfere in the current run, i.e. the load version stays the same throughout subsequent loads.
An example dataset could look similar to the below:
from pathlib import Path, PurePosixPath
import pandas as pd
from kedro.io import AbstractVersionedDataSet
class MyOwnDataSet(AbstractVersionedDataSet):
def __init__(self, filepath, version, param1, param2=True):
super().__init__(PurePosixPath(filepath), version)
self._param1 = param1
self._param2 = param2
def _load(self) -> pd.DataFrame:
load_path = self._get_load_path()
return pd.read_csv(load_path)
def _save(self, df: pd.DataFrame) -> None:
save_path = self._get_save_path()
df.to_csv(save_path)
def _exists(self) -> bool:
path = self._get_load_path()
return Path(path).exists()
def _describe(self):
return dict(version=self._version, param1=self._param1, param2=self._param2)
With catalog.yml specifying:
my_dataset:
type: <path-to-my-own-dataset>.MyOwnDataSet
filepath: data/01_raw/my_data.csv
versioned: true
param1: <param1-value> # param1 is a required argument
# param2 will be True by default
version namedtuple¶
Versioned dataset __init__ method must have an optional argument called version with a default value of None. If provided, this argument must be an instance of kedro.io.core.Version. Its load and save attributes must either be None or contain string values representing exact load and save versions:
If
versionisNonethen the dataset is considered not versioned.If
version.loadisNonethen the latest available version will be used to load the dataset, otherwise a string representing exact load version must be provided.If
version.saveisNonethen a new save version string will be generated by callingkedro.io.core.generate_timestamp(), otherwise a string representing exact save version must be provided.
Versioning using the YAML API¶
The easiest way to version a specific dataset is to change the corresponding entry in the catalog.yml. For example, if the following dataset was defined in the catalog.yml:
cars:
type: pandas.CSVDataSet
filepath: data/01_raw/company/car_data.csv
versioned: true
The DataCatalog will create a versioned CSVDataSet called cars. The actual csv file location will look like data/01_raw/company/car_data.csv/<version>/car_data.csv, where <version> corresponds to a global save version string formatted as YYYY-MM-DDThh.mm.ss.sssZ. Every time the DataCatalog is instantiated, it generates a new global save version, which is propagated to all versioned datasets it contains.
catalog.yml only allows you to version your datasets but it does not allow you to choose which version to load or save. This is deliberate because we have chosen to separate the data catalog from any runtime configuration. If you need to pin a dataset version, you can either specify the versions in a separate yml file and call it at runtime or instantiate your versioned datasets using Code API and define a version parameter explicitly.
By default, the DataCatalog will load the latest version of the dataset. However, it is also possible to specify an exact load version. In order to do that, you can pass a dictionary with exact load versions to DataCatalog.from_config:
load_versions = {"cars": "2019-02-13T14.35.36.518Z"}
io = DataCatalog.from_config(catalog_config, credentials, load_versions=load_versions)
cars = io.load("cars")
The last row in the example above would attempt to load a CSV file from data/01_raw/company/car_data.csv/2019-02-13T14.35.36.518Z/car_data.csv:
load_versionsconfiguration has an effect only if a dataset versioning has been enabled in the catalog config file - see the example above.We recommend that you do not override
save_versionargument inDataCatalog.from_configunless strongly required to do so, since it may lead to inconsistencies between loaded and saved versions of the versioned datasets.
Attention
The DataCatalog does not re-generate save versions between instantiations. Therefore, if you call catalog.save('cars', some_data) twice, then the second call will fail, since it tries to overwrite a versioned dataset using the same save version. To mitigate this, reload your data catalog by calling %reload_kedro line magic. This limitation does not apply to load operation.
Versioning using the Code API¶
Although we recommend enabling versioning using the catalog.yml config file as described in the section above, you may require more control over load and save versions of a specific dataset. To achieve this you can instantiate Version and pass it as a parameter to the dataset initialisation:
from kedro.io import DataCatalog, Version
from kedro.extras.datasets.pandas import CSVDataSet
import pandas as pd
data1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
data2 = pd.DataFrame({"col1": [7], "col2": [8], "col3": [9]})
version = Version(
load=None, # load the latest available version
save=None, # generate save version automatically on each save operation
)
test_data_set = CSVDataSet(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_data_set": test_data_set})
# save the dataset to data/01_raw/test.csv/<version>/test.csv
io.save("test_data_set", data1)
# save the dataset into a new file data/01_raw/test.csv/<version>/test.csv
io.save("test_data_set", data2)
# load the latest version from data/test.csv/*/test.csv
reloaded = io.load("test_data_set")
assert data2.equals(reloaded)
Note
In the example above we did not fix any versions. If we do, then the behaviour of load and save operations becomes slightly different:
version = Version(
load="my_exact_version", # load exact version
save="my_exact_version", # save to exact version
)
test_data_set = CSVDataSet(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_data_set": test_data_set})
# save the dataset to data/01_raw/test.csv/my_exact_version/test.csv
io.save("test_data_set", data1)
# load from data/01_raw/test.csv/my_exact_version/test.csv
reloaded = io.load("test_data_set")
assert data1.equals(reloaded)
# raises DataSetError since the path
# data/01_raw/test.csv/my_exact_version/test.csv already exists
io.save("test_data_set", data2)
Attention
Passing exact load and/or save versions to the dataset instantiation is not recommended, since it may lead to inconsistencies between operations. For example, if versions for load and save operations do not match, save operation would result in a UserWarning indicating that save a load versions do not match. Load after save may also return an error if the corresponding load version is not found:
version = Version(
load="exact_load_version", # load exact version
save="exact_save_version", # save to exact version
)
test_data_set = CSVDataSet(
filepath="data/01_raw/test.csv", save_args={"index": False}, version=version
)
io = DataCatalog({"test_data_set": test_data_set})
io.save("test_data_set", data1) # emits a UserWarning due to version inconsistency
# raises DataSetError since the data/01_raw/test.csv/exact_load_version/test.csv
# file does not exist
reloaded = io.load("test_data_set")
Supported datasets¶
Currently the following datasets support versioning:
kedro.extras.datasets.matplotlib.MatplotlibWriterkedro.extras.datasets.holoviews.HoloviewsWriterkedro.extras.datasets.networkx.NetworkXDataSetkedro.extras.datasets.pandas.CSVDataSetkedro.extras.datasets.pandas.ExcelDataSetkedro.extras.datasets.pandas.FeatherDataSetkedro.extras.datasets.pandas.HDFDataSetkedro.extras.datasets.pandas.JSONDataSetkedro.extras.datasets.pandas.ParquetDataSetkedro.extras.datasets.pickle.PickleDataSetkedro.extras.datasets.pillow.ImageDataSetkedro.extras.datasets.text.TextDataSetkedro.extras.datasets.spark.SparkDataSetkedro.extras.datasets.yaml.YAMLDataSetkedro.extras.datasets.api.APIDataSetkedro.extras.datasets.tensorflow.TensorFlowModelDatasetkedro.extras.datasets.json.JSONDataSet
Note
Although, HTTPs is a supported file system in the dataset implementations, it does not support versioning.
Partitioned dataset¶
These days distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you may encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like PySpark and the corresponding SparkDataSet cater for such use cases, but the use of Spark is not always feasible.
This is the reason why Kedro provides a built-in PartitionedDataSet, which has the following features:
PartitionedDataSetcan recursively load all or specific files from a given location.Is platform agnostic and can work with any filesystem implementation supported by fsspec including local, S3, GCS, and many more.
Implements a lazy loading approach and does not attempt to load any partition data until a processing node explicitly requests it.
Note
In this section each individual file inside a given location is called a partition.
Partitioned dataset definition¶
PartitionedDataSet definition can be put in your catalog.yml like any other regular dataset definition; the definition represents the following structure:
# conf/base/catalog.yml
my_partitioned_dataset:
type: PartitionedDataSet
path: s3://my-bucket-name/path/to/folder # path to the location of partitions
dataset: pandas.CSVDataSet # shorthand notation for the dataset which will handle individual partitions
credentials: my_credentials
load_args:
load_arg1: value1
load_arg2: value2
Note
As any other dataset PartitionedDataSet can also be instantiated programmatically in Python:
from kedro.extras.datasets.pandas import CSVDataSet
from kedro.io import PartitionedDataSet
my_credentials = {...} # credentials dictionary
my_partitioned_dataset = PartitionedDataSet(
path="s3://my-bucket-name/path/to/folder",
dataset=CSVDataSet,
credentials=my_credentials,
load_args={"load_arg1": "value1", "load_arg2": "value2"},
)
Alternatively, if you need more granular configuration of the underlying dataset, its definition can be provided in full:
# conf/base/catalog.yml
my_partitioned_dataset:
type: PartitionedDataSet
path: s3://my-bucket-name/path/to/folder
dataset: # full dataset config notation
type: pandas.CSVDataSet
load_args:
delimiter: ","
save_args:
index: false
credentials: my_credentials
load_args:
load_arg1: value1
load_arg2: value2
filepath_arg: filepath # the argument of the dataset to pass the filepath to
filename_suffix: ".csv"
Here is an exhaustive list of the arguments supported by PartitionedDataSet:
Argument |
Required |
Supported types |
Description |
|---|---|---|---|
|
Yes |
|
Path to the folder containing partitioned data. If path starts with the protocol (e.g., |
|
Yes |
|
Underlying dataset definition, for more details see the section below |
|
No |
|
Protocol-specific options that will be passed to |
|
No |
|
Keyword arguments to be passed into |
|
No
(defaults to |
|
Argument name of the underlying dataset initializer that will contain a path to an individual partition |
|
No (defaults to an empty string) |
|
If specified, partitions that don’t end with this string will be ignored |
Dataset definition¶
Dataset definition should be passed into the dataset argument of the PartitionedDataSet. The dataset definition is used to instantiate a new dataset object for each individual partition, and use that dataset object for load and save operations. Dataset definition supports shorthand and full notations.
Shorthand notation¶
Requires you to only specify a class of the underlying dataset either as a string (e.g. pandas.CSVDataSet or a fully qualified class path like kedro.extras.datasets.pandas.CSVDataSet) or as a class object that is a subclass of the AbstractDataSet.
Full notation¶
Full notation allows you to specify a dictionary with the full underlying dataset definition except the following arguments:
The argument that receives the partition path (
filepathby default) - if specified, aUserWarningwill be emitted stating that this value will be overridden by individual partition pathscredentialskey - specifying it will result inDataSetErrorbeing raised; dataset credentials should be passed intocredentialsargument of thePartitionedDataSetrather than underlying dataset definition - see the section below for detailsversionedflag - specifying it will result inDataSetErrorbeing raised; versioning cannot be enabled for the underlying datasets
Partitioned dataset credentials¶
Note
Support for dataset_credentials key in the credentials for PartitionedDataSet is now deprecated. The dataset credentials should be specified explicitly inside the dataset config.
Credentials management for PartitionedDataSet is somewhat special in a sense that it may contain credentials for both PartitionedDataSet itself and the underlying dataset that is used for partition load and save. Top-level credentials are passed to the underlying dataset config (unless such config already has credentials configured), but not the other way around - dataset credentials are never propagated to the filesystem.
Here is the full list of possible scenarios:
Top-level credentials |
Underlying dataset credentials |
Example |
Description |
|---|---|---|---|
Undefined |
Undefined |
|
Credentials are not passed to the underlying dataset or the filesystem |
Undefined |
Specified |
|
Underlying dataset credentials are passed to the |
Specified |
Undefined |
|
Top-level credentials are passed to the underlying |
Specified |
|
|
Top-level credentials are passed to the filesystem, |
Specified |
Specified |
|
Top-level credentials are passed to the filesystem, underlying dataset
credentials are passed to the |
Partitioned dataset load¶
Let’s assume that the Kedro pipeline that you are working with contains the node defined as follows:
from kedro.pipeline import node
node(concat_partitions, inputs="my_partitioned_dataset", outputs="concatenated_result")
The underlying node function concat_partitions may look like this:
from typing import Any, Callable, Dict
import pandas as pd
def concat_partitions(partitioned_input: Dict[str, Callable[[], Any]]) -> pd.DataFrame:
"""Concatenate input partitions into one pandas DataFrame.
Args:
partitioned_input: A dictionary with partition ids as keys and load functions as values.
Returns:
Pandas DataFrame representing a concatenation of all loaded partitions.
"""
result = pd.DataFrame()
for partition_key, partition_load_func in sorted(partitioned_input.items()):
partition_data = partition_load_func() # load the actual partition data
# concat with existing result
result = pd.concat([result, partition_data], ignore_index=True, sort=True)
return result
As you can see from the example above, on load PartitionedDataSet does not automatically load the data from the located partitions. Instead, PartitionedDataSet returns a dictionary with partition IDs as keys and the corresponding load functions as values. It allows the node that consumes the PartitionedDataSet to implement the logic that defines what partitions need to be loaded and how this data is going to be processed.
Partition ID does not represent the whole partition path, but only a part of it that is unique for a given partition and filename suffix:
Example 1: if
path=s3://my-bucket-name/folderand partition is stored ins3://my-bucket-name/folder/2019-12-04/data.csvthen its Partition ID is2019-12-04/data.csv.Example 2: if
path=s3://my-bucket-name/folderandfilename_suffix=".csv"and partition is stored ins3://my-bucket-name/folder/2019-12-04/data.csvthen its Partition ID is2019-12-04/data.
PartitionedDataSet implements caching on load operation, which means that if multiple nodes consume the same PartitionedDataSet, they will all receive the same partition dictionary even if some new partitions were added to the folder after the first load has been completed. This is done deliberately to guarantee the consistency of load operations between the nodes and avoid race conditions. You can reset the cache by calling release() method of the partitioned dataset object.
Partitioned dataset save¶
PartitionedDataSet also supports a save operation. Let’s assume the following configuration:
# conf/base/catalog.yml
new_partitioned_dataset:
type: PartitionedDataSet
path: s3://my-bucket-name
dataset: pandas.CSVDataSet
filename_suffix: ".csv"
node definition:
from kedro.pipeline import node
node(create_partitions, inputs=None, outputs="new_partitioned_dataset")
and underlying node function create_partitions:
from typing import Any, Dict
import pandas as pd
def create_partitions() -> Dict[str, Any]:
"""Create new partitions and save using PartitionedDataSet.
Returns:
Dictionary with the partitions to create.
"""
return {
# create a file "s3://my-bucket-name/part/foo.csv"
"part/foo": pd.DataFrame({"data": [1, 2]}),
# create a file "s3://my-bucket-name/part/bar.csv.csv"
"part/bar.csv": pd.DataFrame({"data": [3, 4]}),
}
Note
Writing to an existing partition may result in its data being overwritten, if this case is not specifically handled by the underlying dataset implementation. You should implement your own checks to ensure that no existing data is lost when writing to a PartitionedDataSet. The simplest safety mechanism could be to use partition IDs that have a high chance of uniqueness: for example, the current timestamp.
PartitionedDataSet also supports lazy saving, where the partition’s data is not materialized until it’s time to write.
To use this, simply return Callable types in the dictionary:
from typing import Any, Dict, Callable
import pandas as pd
def create_partitions() -> Dict[str, Callable[[], Any]]:
"""Create new partitions and save using PartitionedDataSet.
Returns:
Dictionary of the partitions to create to a function that creates them.
"""
return {
# create a file "s3://my-bucket-name/part/foo.csv"
"part/foo": lambda: pd.DataFrame({"data": [1, 2]}),
# create a file "s3://my-bucket-name/part/bar.csv"
"part/bar": lambda: pd.DataFrame({"data": [3, 4]}),
}
Note: When using lazy saving the dataset will be written after the
after_node_runhook.
Incremental loads with IncrementalDataSet¶
IncrementalDataSet is a subclass of PartitionedDataSet, which stores the information about the last processed partition in the so-called checkpoint. IncrementalDataSet addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs.
This checkpoint, by default, is persisted to the location of the data partitions. For example, for IncrementalDataSet instantiated with path s3://my-bucket-name/path/to/folder the checkpoint will be saved to s3://my-bucket-name/path/to/folder/CHECKPOINT, unless the checkpoint configuration is explicitly overwritten.
The checkpoint file is only created after the partitioned dataset is explicitly confirmed.
Incremental dataset load¶
Loading IncrementalDataSet works similarly to PartitionedDataSet with several exceptions:
IncrementalDataSetloads the data eagerly, so the values in the returned dictionary represent the actual data stored in the corresponding partition, rather than a pointer to the load function.IncrementalDataSetconsiders a partition relevant for processing if its ID satisfies the comparison function, given the checkpoint value.IncrementalDataSetdoes not raise aDataSetErrorif load finds no partitions to return - an empty dictionary is returned instead. An empty list of available partitions is part of a normal workflow forIncrementalDataSet.
Incremental dataset save¶
IncrementalDataSet save operation is identical to the save operation of the PartitionedDataSet.
Incremental dataset confirm¶
Note
The checkpoint value is not automatically updated by the fact that a new set of partitions was successfully loaded or saved.
Partitioned dataset checkpoint update is triggered by an explicit confirms instruction in one of the nodes downstream. It can be the same node, which processes the partitioned dataset:
from kedro.pipeline import node
# process and then confirm `IncrementalDataSet` within the same node
node(
process_partitions,
inputs="my_partitioned_dataset",
outputs="my_processed_dataset",
confirms="my_partitioned_dataset",
)
Alternatively, confirmation can be deferred to one of the nodes downstream, allowing you to implement extra validations before the loaded partitions are considered successfully processed:
from kedro.pipeline import node, pipeline
pipeline(
[
node(
func=process_partitions,
inputs="my_partitioned_dataset",
outputs="my_processed_dataset",
),
# do something else
node(
func=confirm_partitions,
# note that the node may not require 'my_partitioned_dataset' as an input
inputs="my_processed_dataset",
outputs=None,
confirms="my_partitioned_dataset",
),
# ...
node(
func=do_something_else_with_partitions,
# will return the same partitions even though they were already confirmed
inputs=["my_partitioned_dataset", "my_processed_dataset"],
outputs=None,
),
]
)
Important notes about the confirmation operation:
Confirming a partitioned dataset does not affect any subsequent loads within the same run. All downstream nodes that input the same partitioned dataset as input will all receive the same partitions. Partitions that are created externally during the run will also not affect the dataset loads and won’t appear in the list of loaded partitions until the next run or until the release() method is called on the dataset object.
A pipeline cannot contain more than one node confirming the same dataset.
Checkpoint configuration¶
IncrementalDataSet does not require explicit configuration of the checkpoint unless there is a need to deviate from the defaults. To update the checkpoint configuration, add a checkpoint key containing the valid dataset configuration. This may be required if, say, the pipeline has read-only permissions to the location of partitions (or write operations are undesirable for any other reason), in such case IncrementalDataSet can be configured to save the checkpoint elsewhere. checkpoint key also supports partial config updates where only some checkpoint attributes are overwritten, while the defaults are kept for the rest:
my_partitioned_dataset:
type: IncrementalDataSet
path: s3://my-bucket-name/path/to/folder
dataset: pandas.CSVDataSet
checkpoint:
# update the filepath and load_args, but keep the dataset type unchanged
filepath: gcs://other-bucket/CHECKPOINT
load_args:
k1: v1
Special checkpoint config keys¶
Along with the standard dataset attributes, checkpoint config also accepts 2 special optional keys:
comparison_func(defaults tooperator.gt) - fully qualified import path to the function that will be used to compare a partition ID with the checkpoint value, to determine if a partition should be processed. Such function must accept 2 positional string arguments - partition ID and checkpoint value, and returnTrueif such partition is considered to be past the checkpoint. Specifying your owncomparison_funcmay be useful if you need to customise the checkpoint filtration mechanism - for example, you may want to implement windowed loading, where you always want to load the partitions representing the last calendar month. See the example config specifying a custom comparison function:
my_partitioned_dataset:
type: IncrementalDataSet
path: s3://my-bucket-name/path/to/folder
dataset: pandas.CSVDataSet
checkpoint:
comparison_func: my_module.path.to.custom_comparison_function # the path must be importable
force_checkpoint- if set, partitioned dataset will use this value as the checkpoint instead of loading the corresponding checkpoint file. This might be useful if you need to rollback the processing steps and reprocess some (or all) of the available partitions. See the example config forcing the checkpoint value:
my_partitioned_dataset:
type: IncrementalDataSet
path: s3://my-bucket-name/path/to/folder
dataset: pandas.CSVDataSet
checkpoint:
force_checkpoint: 2020-01-01/data.csv
Note
Specification of force_checkpoint is also supported via the shorthand notation as follows:
my_partitioned_dataset:
type: IncrementalDataSet
path: s3://my-bucket-name/path/to/folder
dataset: pandas.CSVDataSet
checkpoint: 2020-01-01/data.csv
Note
If you need to force the partitioned dataset to load all available partitions, set checkpoint to an empty string:
my_partitioned_dataset:
type: IncrementalDataSet
path: s3://my-bucket-name/path/to/folder
dataset: pandas.CSVDataSet
checkpoint: ""