kedro catalog create ==================== I use to boost my productivity by automatically generating yaml catalog entries for me. It will create new yaml files for each pipeline, fill in missiing... Date: November 15, 2021 I use `kedro catalog create` to boost my productivity by automatically generating yaml catalog entries for me. It will create new yaml files for each pipeline, fill in missiing catalog entries, and respect already existing catalog entries. It will reformat the file, and sort it based on catalog key. [https://youtu.be/_22ELT4kja4](https://youtu.be/_22ELT4kja4){.youtube-embed} What is Kedro [1] > 👆 Unsure what kedro is? Check out this post. ## Running Kedro Catalog Create The command to ensure there are catalog entries for every dataset in the passed in pipeline. ``` bash kedro catalog create --pipeline history_nodes ``` * Create's new yaml file, if needed * Fills in new dataset entries with the default dataset * Keeps existing datasets untouched * it will reformat your yaml file a bit * default sorting will be applied * empty newlines will be removed ## CONF_ROOT Kedro will respect your `CONF_ROOT` settings when it creates a new catalog file, or looks for existing catalog files. You can change the location of your configuration files by editing your `CONF_ROOT` variable in your projects. `settings.py`. ``` python # settings.py # default settings CONF_ROOT = "conf" # I like to package my configuration CONF_ROOT = str(Path(__file__).parent / "conf") ``` > I prefer to keep my configuration packaged inside of my project. This is > partly due to how my team operates and deploys pipelines. ## File Location The `kedro catalog create` command will look for a `yaml` file based on the name of the pipeline (`CONF_ROOT/catalog/.yml`). If it does not find one it will create one and make entries for each dataset in the pipeline. It will not look in all of your existing catalog files for entries, only the one in the exact file for your pipeline. If you are going to use this command its important that you follow this pattern or copy what it generates into your own catalog file of choice. > ⚠️ It will not look in all of your existing catalog files for entries, only the one in the exact file for your pipeline. ## MemoryDataSet's When you run `kedro catalog create` you get `MemoryDataSet`, that's it. As of `0.17.4` its hard coded into the library and not configurable. ``` yaml range12: type: MemoryDataSet ``` ## Your free to use what you want though Let's switch this dataset over to a `pandas.CSVDataSet` so that the file gets stored and we can pick up and read the file without re-running the whole pipeline. ``` yaml range12: type: pandas.CSVDataSet filepath: data/range12.csv ``` ## Continue adding nodes As we work we will keep adding nodes to our kedro pipeline, in this case we added another node that created a dataset called `range13`. ``` bash kedro catalog create --pipeline history_nodes ``` After telling kedro to create new catalog entries for us we will see that it left our `range12` entry alone and created `range13` for us. ``` yaml range12: type: pandas.CSVDataSet filepath: data/range12.csv range13: type: MemoryDataSet ``` ## Formatting is not worthwhile If we decide this is too cramped for us we could add some space between datasets. The next time we run `kedro catalog create` empty lines will be removed. ``` yaml range12: type: pandas.CSVDataSet range13: type: MemoryDataSet ``` ## Continuing to work If we coninue adding new nodes, and tell kedro to create catalog entries again, all of our effort given to formatting will be lost. I wouldn't worry about it unless you have an autoformatter that you can run on your yaml files. The productivity gains in an semi-automated catalog are worth it. ``` yaml range12: type: pandas.CSVDataSet filepath: data/range12.csv range121: type: MemoryDataSet range13: type: MemoryDataSet ``` ## Sorting Order Notice the sorting order in the last entry, `range121` comes before `range13`. This is all based on how pythons `yaml.safe_dump` works, kedro has set the `default_flow_style` to `False`. You can see where they write your file in the source code currently [here](https://github.com/kedro-org/kedro/blob/master/kedro/framework/cli/catalog.py#L202) References: [1]: /what-is-kedro/