Posts tagged: data

All posts with the tag "data"

Zev Averbach Interview

Zev Averbach, Frustrated spreadsheet jockey to software developer at 36

Q: Tell me about your journey as a spreadsheet jockey into Data Engineering?

A: First of all, it’s hilarious that I accidentally found your questions for this interview by Googling myself. 😊

...

Minimal Kedro Pipeline

How small can a minimum kedro pipeline ready to package be? I made one within 4 files that you can pip install. It’s only a total of 35 lines of python, 8 in setup.py and 27 in mini_kedro_pipeline.py.

šŸ“ Note this is only a composable pipeline, not a full project, it does not contain a catalog or runner.

I have everything for this post hosted in this gihub repo, you can fork it, clone it, or just follow along.

...

Kedro - My Data Is Not A Table

In python data science/engineering most of our data is in the form of some sort of table, typically a DataFrame from a library like pandas, spark, or dask.

These containers for data contain many convenient methods to manipulate table like data structures. Sometimes we leverage other data types, namely vanilla types like lists and dicts, or even numpy data types.

What is Kedro

...

Gracefully adopt kedro, the catalog

While using the catalog alone will not reap all of the benefits of the framework, it does get you and your project ready for the full framework eventually. For me the full benefit of the catalog comes when you combine it with the pipeline and dont even touch read/write steps at all.

Taking a step into kedro by adopting the catalog first will give you a way to organize all of your data loads in one place, and stop manually writing read/write code, which can be different for each data and storage type. You just don’t need to think about it.

ā€œcan be dropped into kedro laterā€ Let’s talk a bit more about that

...

How to find things in your kedro catalog

kedro 0.16.2 just dropped last week with a long-awaited feature… catalog search! I went as far as monkey patching this into each of my projects. I work jump between a few really big projects that have tons of datasets. Being able to quickly search for what I need is so useful.

The kedro data catalog is a key component to the kedro framework. It handles all data loading and saving for you. It is configurable and hackable. Having all your data connections listed in one place make it so easy to pick your project up and move it to a completely new environment. That sweet imperative loading style saves so much read/write overhead. I can load all my data with a single command whether it’s in amazon s3, google cloud platform, or a local file.

Just like with most of these articles, I am going to create a conda environment so that I don’t break any existing projects and scaffold up a toy project to learn from.

...

011

Load _ from database into **

1 min

010

load remote _ with **

1 min

009

Combine a directory of _ with **

1 min

Create Custom Kedro Dataset

Kedro provides an efficient way to build out data catalogs with their yaml api. It allows you to be very declaritive about loading and saving your data. For the most part you just need to tell Kedro what connector to use and its filepath. When running Kedro takes care of all of the read/write, you just reference the catalog key.

Under the hood there is an AbstractDataSet that each connector inherits from. It sets up a lot of the behind the scenes structure for us so that we dont have to. For the most part kedro has connectors for about anything that you want to load, csv, parquet, sql, json, from about anywhere, http, s3, localfile system are just some of the examples.

Here is a DataSet implementation from their docs. Here you can see the barebones example straight from the docs. Parameters from the yaml catalog will get passed in

What is YOUR Advice for New Data Scientists

You dont have to start out as a git wizard with the cleanest possible commit history. At first dont let yourself get too wrapped up in it, the most important part is that you make commits. You will find needs for more advanced stuff later.

git add . git commit -m "FEAT added new function to calculate revenue by product family" git push

Get comfortable with this, then learn how to branch, rebase, stash, etc…

Get the job done. Keep it in small bite size pieces. Make readable function definitions and variable names. You will thank yourself for naming things well later. Readability counts more than performance in most cases of data science. If it gets the job done try not to over worry about things like performance. A few extra seconds to clean a dataset or build a model is not worth hours of your time. As you go you will have cases that performance is more critical and you will learn what to do from the start to avoid them.

1 min read

Filtering Pandas

Good for method chaining, i.e. adding more methods or filters without assigning a new variable.

# is skus.query('AVAILABILITY == " AVAILABLE"') # is not skus.query('AVAILABILITY != " AVAILABLE"')

masking ¶ #

general purpose, this is probably the most common method you see in training/examples

# is skus[skus['AVAILABILITY'] == 'AVAILABLE'] # is not skus[~skus['AVAILABILITY'] == 'AVAILABLE']

isin ¶ #

capable of including multiple strings to include

...

Clean up Your Data Science with Named Tuples

If you are a regular listener of TalkPython or PythonBytes you have hear Michael Kennedy talk about Named Tuples many times, but what are they and how do they fit into my data science workflow.

As you graduate your scripts into modules and libraries you might start to notice that you need to pass a lot of data around to all of the functions that you have created. For example if you are running some analysis utilizing sales, inventory, and pricing data. You may need to calculate total revenue, inventory on hand. You may need to pass these data sets into various models to drive production or pricing based on predicted volumes.

Here we setup functions that can load data from the sales database. Assume that we also have similar functions to get_inventory and get_pricing.

...