Kedro Basics
Learn Kedro in 5 days
All posts with the tag "kedro"
Learn Kedro in 5 days
If we take a look at the release notes I see one major feature improvement on the list, auto-discovery of hooks.
## Major features and improvements * Enabled auto-discovery of hooks implementations coming from installed plugins.
This one comes a bit surprising as it was just casually mentioned in #435
...
I am exploring a kedro catalog meta data hook, these are some notes about what I am thinking.
try pandas method -> try spark -> try dict/list -> none
Is there an easy way to create a nosql database in memory from a a list of dictionaries?
While using the catalog alone will not reap all of the benefits of the framework, it does get you and your project ready for the full framework eventually. For me the full benefit of the catalog comes when you combine it with the pipeline and dont even touch read/write steps at all.
Taking a step into kedro by adopting the catalog first will give you a way to organize all of your data loads in one place, and stop manually writing read/write code, which can be different for each data and storage type. You just don’t need to think about it.
“can be dropped into kedro later” Let’s talk a bit more about that
...
kedro 0.16.2 just dropped last week with a long-awaited feature… catalog search! I went as far as monkey patching this into each of my projects. I work jump between a few really big projects that have tons of datasets. Being able to quickly search for what I need is so useful.
The kedro data catalog is a key component to the kedro framework. It handles all data loading and saving for you. It is configurable and hackable. Having all your data connections listed in one place make it so easy to pick your project up and move it to a completely new environment. That sweet imperative loading style saves so much read/write overhead. I can load all my data with a single command whether it’s in amazon s3, google cloud platform, or a local file.
Just like with most of these articles, I am going to create a conda environment so that I don’t break any existing projects and scaffold up a toy project to learn from.
...
Passing inputs into kedro is a key concept. Understanding how it accepts a single catalog key as input is quite trivial that easily makes sense, but passing a list or dictionary of catalog entries can be a bit confusing.
Check out this post for a review of how *args **kwargs work in python.
understanding python *args and **kwargs
...
🔥 #kedrotips use find-kedro to assembly your pipelines
🔥 #kedrotips hooks can be created using modules
** 0.3.0 just launched with _ support 🎉
kedro-static-viz is out with support for the newly released hooks feature. This means that you can have kedro-static-viz automatically deploy a full gatsby site before_pipeline_run keeping your visualization always up to date.
Even though it is a static site there is no functionality lost. The only thing that’s missing is the flask server. With kedro-static-viz you can deploy your visualization to a number of static hosting providers such as GitHub pages free of charge with wicked fast performance
Even though it’s built on gatsbyjs the full site builds in under 2s even on slower hardware. This is because the site is already pre-rendered and stripped of any excess. It’s zipped up right into the python package and is typically used with the cli, but now can be used with python, or as a hook as well.
...
** building pipelines with _ today
This post is a 🧠 branstorming work in progress. I will likely use it as a storage location/brain dump of hook ideas.
If you are completely unsure what kedro is be sure to check out my what is kedro post
hooks are executed in reverse order of the hooks list.
...
Kedro provides an efficient way to build out data catalogs with their yaml api. It allows you to be very declaritive about loading and saving your data. For the most part you just need to tell Kedro what connector to use and its filepath. When running Kedro takes care of all of the read/write, you just reference the catalog key.
Under the hood there is an AbstractDataSet that each connector inherits from. It sets up a lot of the behind the scenes structure for us so that we dont have to. For the most part kedro has connectors for about anything that you want to load, csv, parquet, sql, json, from about anywhere, http, s3, localfile system are just some of the examples.
Here is a DataSet implementation from their docs. Here you can see the barebones example straight from the docs. Parameters from the yaml catalog will get passed in
Kedro Hooks Intro - kedro hooks are an exciting upcoming feature of kedro 0.16.0. They allow you to hook into catalog_created,pipeline_run, and node_run(nouns). With a before, or after (adjective). This really reminds me of reacts lifecycle hooks, that let you hook into various state of react web components. This is going to make kedro so extendable by the community. I am super pumped to see what the community is able to do with this ability.
kedro hooks are an exciting upcoming feature of kedro 0.16.0. They allow you to hook into catalog_created,pipeline_run, and node_run(nouns). With a before, or after (adjective). This really reminds me of reacts lifecycle hooks, that let you hook into various state of react web components. This is going to make kedro so extendable by the community. I am super pumped to see what the community is able to do with this ability.
...
This is a very rough idea for a kedro package to prevent time lost to get partway through a pipeline run only to realize that you dont have access to data or resources.
find-kedro is a small library to enhance your kedro experience. It looks through your modules to find kedro pipelines, nodes, and iterables (lists, sets, tuples) of nodes. It then assembles them into a dictionary of pipelines, each module will create a separate pipeline, and __default__ being a combination of all pipelines. This format is compatible with the kedro _create_pipelines format.
kedro is a ✨ fantastic project that allows for super-fast prototyping of data pipelines, while yielding production-ready pipelines. find-kedro enhances this experience by adding a pytest like node/pipeline discovery eliminating the need to bubble up pipelines through modules.
...
This is a quickstart to getting a new kedro pipeline up and running. After this article you should be able to understand how to get started with kedro. You can learn more about this Hello World Example in the docs
🧹 Install
...
This is my original what-is-kedro article. There is a brand new one
Kedro is an open source data pipeline framework. It provides guardrails to set your project up right from the start without needing to know deeply how to setup your own python library for data pipelining. It includes really great ways to manipulate catalogs and pipelines. This article will cover the 10K view of kedro, future articles will dive deper into each one.
...
See all of my kedro related posts in [[ tag/kedro ]].
I am tweeting out most of these snippets as I add them, you can find them all here #kedrotips.
Below are some quick snippets/notes for when using kedro to build data pipelines. So far I am just compiling snippets. Eventually I will create several posts on kedro. These are mostly things that I use In my everyday with kedro. Some are a bit more essoteric. Some are helpful when writing production code, some are useful more usefule for exploration.
...