kedro.io.DataCatalogWithDefault¶
-
class
kedro.io.DataCatalogWithDefault(data_sets=None, default=None, remember=False)[source]¶ A
DataCatalogwith a defaultDataSetimplementation for any data set which is not registered in the catalog.Methods
add(data_set_name, data_set[, replace])Adds a new
AbstractDataSetobject to theDataCatalog.add_all(data_sets[, replace])Adds a group of new data sets to the
DataCatalog.add_feed_dict(feed_dict[, replace])Adds instances of
MemoryDataSet, containing the data provided through feed_dict.add_transformer(transformer[, data_set_names])Add a
DataSetTransformer to the:class:~kedro.io.DataCatalog.confirm(name)Confirm a dataset by its name.
exists(name)Checks whether registered data set exists by calling its exists() method.
from_config(catalog[, credentials, …])To create a
DataCatalogWithDefaultfrom configuration, please use: .from_data_catalog(data_catalog, default)Convenience factory method to create a
DataCatalogWithDefaultfrom aDataCataloglist([regex_search])List of all
DataSetnames registered in the catalog.load(name[, version])Loads a registered data set
release(name)Release any cached data associated with a data set
save(name, data)Save data to a registered data set.
Returns a shallow copy of the current object.
-
__init__(data_sets=None, default=None, remember=False)[source]¶ DataCatalogWithDefaultis deprecated and will be removed in Kedro 0.18.0. ADataCatalogwith a defaultDataSetimplementation for any data set which is not registered in the catalog.- Parameters
data_sets (
Optional[Dict[str,AbstractDataSet]]) – A dictionary of data set names and data set instances.default (
Optional[Callable[[str],AbstractDataSet]]) – A callable which accepts a single argument of type string, the key of the data set, and returns anAbstractDataSet.loadandsavecalls on data sets which are not registered to the catalog will be delegated to thisAbstractDataSet.remember (
bool) – If True, then store in the catalog anyAbstractDataSets provided by thedefaultcallable argument. Useful when one want to transition from aDataCatalogWithDefaultto aDataCatalog: just callDataCatalogWithDefault.to_yaml, after all required data sets have been saved/loaded, and use the generated YAML file with a newDataCatalog.
- Raises
TypeError – If default is not a callable.
Example:
from kedro.extras.datasets.pandas import CSVDataSet def default_data_set(name): return CSVDataSet(filepath='data/01_raw/' + name) io = DataCatalog(data_sets={}, default=default_data_set) # load the file in data/raw/cars.csv df = io.load("cars.csv")
-
add(data_set_name, data_set, replace=False)¶ Adds a new
AbstractDataSetobject to theDataCatalog.- Parameters
data_set_name (
str) – A unique data set name which has not been registered yet.data_set (
AbstractDataSet) – A data set object to be associated with the given data set name.replace (
bool) – Specifies whether to replace an existingDataSetwith the same name is allowed.
- Raises
DataSetAlreadyExistsError – When a data set with the same name has already been registered.
Example:
from kedro.extras.datasets.pandas import CSVDataSet io = DataCatalog(data_sets={ 'cars': CSVDataSet(filepath="cars.csv") }) io.add("boats", CSVDataSet(filepath="boats.csv"))
- Return type
None
-
add_all(data_sets, replace=False)¶ Adds a group of new data sets to the
DataCatalog.- Parameters
data_sets (
Dict[str,AbstractDataSet]) – A dictionary ofDataSetnames and data set instances.replace (
bool) – Specifies whether to replace an existingDataSetwith the same name is allowed.
- Raises
DataSetAlreadyExistsError – When a data set with the same name has already been registered.
Example:
from kedro.extras.datasets.pandas import CSVDataSet, ParquetDataSet io = DataCatalog(data_sets={ "cars": CSVDataSet(filepath="cars.csv") }) additional = { "planes": ParquetDataSet("planes.parq"), "boats": CSVDataSet(filepath="boats.csv") } io.add_all(additional) assert io.list() == ["cars", "planes", "boats"]
- Return type
None
-
add_feed_dict(feed_dict, replace=False)¶ Adds instances of
MemoryDataSet, containing the data provided through feed_dict.- Parameters
feed_dict (
Dict[str,Any]) – A feed dict with data to be added in memory.replace (
bool) – Specifies whether to replace an existingDataSetwith the same name is allowed.
Example:
import pandas as pd df = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]}) io = DataCatalog() io.add_feed_dict({ 'data': df }, replace=True) assert io.load("data").equals(df)
- Return type
None
-
add_transformer(transformer, data_set_names=None)¶ Add a
DataSetTransformer to the:class:~kedro.io.DataCatalog. Transformers can modify the way Data Sets are loaded and saved.- Parameters
transformer (
AbstractTransformer) – The transformer instance to add.data_set_names (
Union[str,Iterable[str],None]) – The Data Sets to add the transformer to. Or None to add the transformer to all Data Sets.
- Raises
DataSetNotFoundError – When a transformer is being added to a non existent data set.
TypeError – When transformer isn’t an instance of
AbstractTransformer
-
confirm(name)¶ Confirm a dataset by its name.
- Parameters
name (
str) – Name of the dataset.- Raises
DataSetError – When the dataset does not have confirm method.
- Return type
None
-
exists(name)¶ Checks whether registered data set exists by calling its exists() method. Raises a warning and returns False if exists() is not implemented.
- Parameters
name (
str) – A data set to be checked.- Return type
bool- Returns
Whether the data set output exists.
-
classmethod
from_config(catalog, credentials=None, load_versions=None, save_version=None, journal=None)[source]¶ To create a
DataCatalogWithDefaultfrom configuration, please use:DataCatalogWithDefault.from_data_catalog( DataCatalog.from_config(catalog, credentials))
- Parameters
catalog (
Optional[Dict[str,Dict[str,Any]]]) – SeeDataCatalog.from_configcredentials (
Optional[Dict[str,Dict[str,Any]]]) – SeeDataCatalog.from_configload_versions (
Optional[Dict[str,str]]) – SeeDataCatalog.from_configsave_version (
Optional[str]) – SeeDataCatalog.from_configjournal (
Optional[Journal]) – SeeDataCatalog.from_config
- Raises
ValueError – If you try to instantiate a
DataCatalogWithDefaultdirectly with this method.
-
classmethod
from_data_catalog(data_catalog, default)[source]¶ Convenience factory method to create a
DataCatalogWithDefaultfrom aDataCatalogA
DataCatalogwith a defaultDataSetimplementation for any data set which is not registered in the catalog.- Parameters
data_catalog (
DataCatalog) – TheDataCatalogto convert to aDataCatalogWithDefault.default (
Callable[[str],AbstractDataSet]) – A callable which accepts a single argument of type string, the key of the data set, and returns anAbstractDataSet.loadandsavecalls on data sets which are not registered to the catalog will be delegated to thisAbstractDataSet.
- Return type
DataCatalogWithDefault- Returns
A new
DataCatalogWithDefaultwhich contains all theAbstractDataSetsfrom the provided data-catalog.
-
list(regex_search=None)¶ List of all
DataSetnames registered in the catalog. This can be filtered by providing an optional regular expression which will only return matching keys.- Parameters
regex_search (
Optional[str]) – An optional regular expression which can be provided to limit the data sets returned by a particular pattern.- Return type
List[str]- Returns
A list of
DataSetnames available which match the regex_search criteria (if provided). All data set names are returned by default.- Raises
SyntaxError – When an invalid regex filter is provided.
Example:
io = DataCatalog() # get data sets where the substring 'raw' is present raw_data = io.list(regex_search='raw') # get data sets which start with 'prm' or 'feat' feat_eng_data = io.list(regex_search='^(prm|feat)') # get data sets which end with 'time_series' models = io.list(regex_search='.+time_series$')
-
load(name, version=None)[source]¶ Loads a registered data set
- Parameters
name (
str) – A data set to be loaded.version (
Optional[str]) – Optional version to be loaded.
- Return type
Any- Returns
The loaded data as configured.
- Raises
DataSetNotFoundError – When a data set with the given name has not yet been registered.
-
release(name)¶ Release any cached data associated with a data set
- Parameters
name (
str) – A data set to be checked.- Raises
DataSetNotFoundError – When a data set with the given name has not yet been registered.
-
save(name, data)[source]¶ Save data to a registered data set.
- Parameters
name (
str) – A data set to be saved to.data (
Any) – A data object to be saved as configured in the registered data set.
- Raises
DataSetNotFoundError – When a data set with the given name has not yet been registered.
-