reasons-to-kedro ================ Date: November 1, 2020 There are many reasons that you should be using kedro. If you are on a team of Data Scientists/Data Engineers processing DataFrames from many data sources should be considering a pipeline framework. Kedro is a great option that provides many benefits for teams to collaborate, develop, and deploy data pipelines What is Kedro [1] ## Starter Template Kedro makes it super easy to get started with their cli that utilizes cookiecutter under the hood. ``` bash conda create -n my-new-project -y python=3.8 kedro new kedro install kedro run ``` Create New Kedro Project [2] > read more about how to start your first kedro project here ## Collaboration Kedro provides many tools that help teams collaborate on a single codebase. While writing monolithic scripts it can be easy to pin yourself in a corner where it is difficult to have multiple people making changes to the notebook/script at the same time. Kedro helps guide your team to break your project down into small pieces that different members of the team can work on in parallel. ### sharable catalog Kedro makes it easy to collaborate with members who aren't even working on the pipeline. I often see team members who want to investigate datasets from different points in the pipeline. Kedro makes it really easy for them to load it into python. **for python users** Share catalog entries with folks doing EDA. ``` python catalog.load('main_table') ``` **for non-python users** For those who may not be using python, we can easily kick out a CSV version of that `main_table` that they can get from s3 or your cloud storage solution of choice. ``` yaml master_table: type: pandas.CSVDataSet filepath: s3://bucket/data/03_primary/master_table.csv layer: primary ``` **for the SQL folks** We aren't even constrained to those who only use python or excel, we can kick out any kind of dataset that python can output. Kedro even comes with many DataSet types out of the box so that we don't have to write any read/write code. ``` yaml master_table: type: SQLTableDataSet table_name: master_table credentials: postgres ``` ### small nodes over monolithic scripts As I said before single notebooks/scripts are really hard to collaborate on. I have seen Data Engineers sitting idle waiting to get their changes manually added into the master notebook. When you find yourself in this situation, find a better solution. It's time to break things down into individual modules and utilize a version control system that can automatically merge changes in. Kedro encourages the use of git version control and storing all node functions inside of modules while still making it really easy to load data into a notebook/shell and start trying out new things. ## No More read and write code As I said earlier kedro comes with datasets for the most popular output formats. It is also backed by a really amazing library called `fsspec`, this library makes the filesystem that you are storing agnostic to how you write to it. This means that the kedro library utilizes `fsspec` under the hood and writes to the file as if it was to disk, but based on the prefix to the file it may actually be writing to the local filesystem, gcp, azure blob, or s3. **custom DataSets** If kedro does not have a `DataSet` for the format that you need to read or write you can easily create your own custom `DataSet` all you need to do is inherit from `kedro.io.AbstractDataSet` and create methods for `__init__`, `_load`, `_save`, `_exists`, and `_describe`. Check out this example from their docs. I removed the docstrings for brevity, you can see the entire `DataSet` in their [docs](https://kedro.readthedocs.io/en/0.15.2/03_tutorial/03_set_up_data.html?highlight=custom%20dataset#creating-custom-datasets). > The complete example all in one was only available in an older version, more up to date [docs](https://kedro.readthedocs.io/en/0.16.6/07_extend_kedro/01_custom_datasets.html?highlight=custom%20dataset) have a good writeup that walks through everything separately. ``` python from os.path import isfile from typing import Any, Union, Dict import pandas as pd from kedro.io import AbstractDataSet class ExcelLocalDataSet(AbstractDataSet): def _describe(self) -> Dict[str, Any]: return dict(filepath=self._filepath, engine=self._engine, load_args=self._load_args, save_args=self._save_args) def __init__( self, filepath: str, engine: str = "xlsxwriter", load_args: Dict[str, Any] = None, save_args: Dict[str, Any] = None, ) -> None: self._filepath = filepath default_save_args = {} default_load_args = {"engine": "xlrd"} self._load_args = {**default_load_args, **load_args} \ if load_args is not None else default_load_args self._save_args = {**default_save_args, **save_args} \ if save_args is not None else default_save_args self._engine = engine def _load(self) -> Union[pd.DataFrame, Dict[str, pd.DataFrame]]: return pd.read_excel(self._filepath, **self._load_args) def _save(self, data: pd.DataFrame) -> None: writer = pd.ExcelWriter(self._filepath, engine=self._engine) data.to_excel(writer, **self._save_args) writer.save() def _exists(self) -> bool: return isfile(self._filepath) ``` ## Execution order is taken care of As you build up complex pipelines containing 10's or 100's of nodes it becomes difficult to splice in new nodes/steps without messing up or a framework to help. Kedro simply needs a set of nodes that each takes in catalog entries as input and output to catalog entries and it will figure out the order for you. These nodes can be made for one-off purposes, take in functions from reusable libraries, and even be dynamically generated from a configuration. There is no need to worry about hand curating the execution order, that's all taken care of. ## Easily slice up a pipeline Since kedro is a DAG that takes in a pile of nodes and figures out all of the dependencies for you it knows a lot about your pipeline. You can slice it up to only the specific pieces that you need. ``` python # single nodes pipeline.only_nodes("node1") # single nodes and all of thier dependencies pipeline.to_nodes("node1", "node2") # from a dataset to all of its dependants pipeline.from_inputs("dataset1", "dataset2") # to a an outputs with all of its dependencies pipeline.to_outputs("dataset6", "dataset7") ``` ## plugins/hooks Creating your own modifications to how kedro behaves is made really simple through the use of hooks. There are several hooks that happen at different points in the kedro lifecycle. For instance, you can hook in before pipeline run or after pipeline run to do whatever your project needs. creating the kedro-preflight hook [3] ### pip install plugin There is a growing list of plugins available from pypi that is only a `pip install` away. Most of them are on [GitHub](https://github.com/topics/kedro-plugin) and tagged as a [kedro-plugin](https://github.com/topics/kedro-plugin) topic. ## flexible cli In the end, you have a cli for your project that can run your pipeline in all sorts of cool ways since it knows about each node's dependencies. This makes running and scheduling production a breeze. ``` bash # single nodes kedro run --node node1 # single nodes and all of their dependencies kedro run --to-nodes node1,node2 # from a dataset to all of its dependents kedro run --from-inputs dataset1,dataset2 # to outputs with all of their dependencies kedro run --to-outputs dataset6,dataset7 ``` ## Try it out Hopefully this post gave you the inspiration to get started today, if it did `pip install kedro` and run `kedro new` to try it out. References: [1]: /what-is-kedro/ [2]: /create-new-kedro-project/ [3]: /creating-the-kedro-preflight-hook/