I keep my nodes short and sweet. They do one thing and do it well. I
turn almost every DataFrame transformation into its own node. It makes
it must easier to pull catalog entries, than firing up the pipeline,
running it, and starting a debugger. For this reason many of my nodes
can be built from inline lambdas.
Examples # [1]
Here are two examples, the first one lambda x: x is sometimes referred
to as an identity function. This is super common to use in the early
phases of a project. It lets you follow standard layering conventions,
without skipping a layer, overthinking if you should have the layer or
not, and leaves a good placholder to fill in later when you need it.
Many times I just want to get the data in as fast as possible, learn
about it, then go back and tidy it up.
from kedro.pipeline import node
my_first_node = node(
func=lambda x: x,
inputs='raw_cars',
output='int_cars',
tags=['int',]
)
my_first_node = node(
func=lambda cars: cars[['mpg', 'cyl', 'disp',]].query('disp>200'),
inputs='raw_cars',
output='int_cars',
tags=['pri',]
)
Note: try not to take the idea of a one liner too far. If your
one line function wraps several lines down it probably deserv...