Posts tagged: python

Blog Data With Python

Generating an api for a blog is much simpler than one might expect with python. Markdown # [1] Frontmatter # [2] Fill in the blanks # [3] fix missing data Fast # [4] References: [1]: #markdown [2]: #frontmatter [3]: #fill-in-the-blanks [4]: #fast

Kedro - My Data Is Not A Table

In python data science/engineering most of our data is in the form of some sort of table, typically a DataFrame from a library like pandas, spark, or dask. DataFrames are the heart of most pipelines # [1] These containers for data contain many convenient methods to manipulate table like data structures. Sometimes we leverage other data types, namely vanilla types like lists and dicts, or even numpy data types. What is Kedro [2] unfamiliar with kedro, check out this post Sometimes datasets are not tables # [3] There are times when our data doesn’t fit nicely into a DataFrame. Lucky for us Kedro has pickle support out of the box. Pickle is a way to store any python object to disk. Beware that pickle files coming from an unknown source can run malicous code and are considered unsafe. For the most part though when you read and write your own pickle files they are a good tool to consider. See more about pickle [4] from python.org. Cataloging Pickle # [5] I may have a dictionary ...

Quickly Change Conda Env With Fzf

Changing conda environments is a bit verbose, I use a function with fzf that both lists environments and selects the one I want in one go. Conda # [1] I have used conda as a virtual environment [2] tool for years now. I started using conda for its simplicity to install packages on windows, but now that has gotten so much better and it’s been years since I have run a conda install command. I’m sure that I could use a different environment manager, but it works for me and makes sense. What environment manager do you use for python? Conda environments are stored in a central location such as ~/miniconda3/envs/ and not with the project. They contain both the python interpreter and packages for that env. Conda create # [3] Conda environments are created with the conda create command. At this point, you will need to name your env and select the python version. conda create -n my_env python=3.8 After running this command you will have a directory ~/miniconda3/envs/my_env with a base...

Minimal Python Package

What does it take to create an installable python package that can be hosted on pypi? What is the minimal python package # [1] - setup.py - my_module.py This post is somewhat inspired by the bottle framework, which is famously created as a single python module. Yes, a whole web framework is written in one file. Directory structure # [2] . ├── setup.py └── my_pipeline.py setup.py # [3] from setuptools import setup setup( name="", version="0.1.0", py_modules=["my_pipeline", ], install_requires=["kedro"], ) name # [4] The name of the package can contain any letters, numbers, “_”, or “-”. Even if it’s for internal/personal consumption only I usually check for discrepancy with pypi so that you don’t run into conflicts. Note that pypi treats “-” and “_” as the same thing, beware of name clashes version # [5] This is the version number of your package. Most packages follow semver [6]. At a high level its three numbers separated by a . that follow the format major.minor.patc...

Ipython-Config

I use my ipython terminal daily. It’s my go to way of running python most of the time. After you use it for a little bit you will probably want to setup a bit of your own configuration. install ipython # [1] Activate your virtual environment [2] of choice and pip install it. Any time you are running your project in a virtual environment, you will need to install ipython inside it to access those packages from ipython. pip install ipython You are using a virtual environment right? Virtual environments like venv or conda can save you a ton of pain down the road. profile_default # [3] When you install ipython you start out with no config at all. Runnign ipython profile create will start a new profile called profile_default that contains all of the default configuration. ipython profile create This command will create a directory ~/.ipython/profile_default multiple configurations # [4] You can run multiple configurations by naming them with ipython profile create [profile_name...

Custom Ipython Prompt

I’ve grown tired of the standard ipython prompt as it doesn’t do much to give me any useful information. The default one gives out a line number that only seems to add anxiety as I am working on a simple problem and see that number grow to several hundred. I start to question my ability 🤦‍♂️. Configuration # [1] If you already have an ipython config you can move on otherwise check out this post on creating an ipython config. Ipython-Config [2] The Dream Prompt # [3] I want something similar to the starship prompt I am using in the shell. I want to be able to quickly see my python version, environment name, and git [4] branch. - python version - active environment - git branch [5] This is my zsh prompt I am using for inspiration Basic Prompt # [6] This is mostly boilerplate that I found from various google searches, but this gets me a basic green chevron as my prompt. from IPython.terminal.prompts import Prompts, Token class MyPrompt(Prompts): def in_prompt_tokens(self...

Ipython Ninjitsu

- ?docstring - ??sourcecode - %run - %debug - %autoreload - %history - autoformat - %reset - !shell commands ?docstring # [1] Stop going to google everytime your stuck and stay in your workflow. The ipython ? is a superhero for productivity and staying on task. from kedro.pipeline import Pipeline Pipeline? Init signature: Pipeline( nodes: Iterable[Union[kedro.pipeline.node.Node, ForwardRef('Pipeline')]], *, tags: Union[str, Iterable[str]] = None, ) Docstring: A ``Pipeline`` defined as a collection of ``Node`` objects. This class treats nodes as part of a graph representation and provides inputs, outputs and execution order. Init docstring: Initialise ``Pipeline`` with a list of ``Node`` instances. Args: nodes: The iterable of nodes the ``Pipeline`` will be made of. If you provide pipelines among the list of nodes, those pipelines will be expanded and all their nodes will become part of this new pipeline. tags: Optional set of tags to be applied to all the pipeli...

Automating my Post Starter

One thing we all dread is mundane work of getting started, and all the hoops it takes to get going. This year I want to post more often and I am taking some steps towards making it easier for myself to just get started. When I start a new post I need to cd into my blog directory, start neovim in a markdown file with a clever name, copy some frontmatter boilerplate, update the post date, add tags, a description, and a cover. Todo List for starting a post # [1] - frontmatter template - Title - slug - tags - date - cover - description - create markdown file - open in neovim Lets Automate this # [2] This aint no proper cli # [3] hot and fast As with many thing running behind the scenes on this site, I am the one and only user, I have limited time, so this is going to be a bit hot and fast. Let’s create a file called new-post. start the script new-post #!python # new-post 👆 Works on my machine If this were something that had more users than me I would probably use some...

Windowing Python Lists

In python data science we often will reach for pandas a bit more than necessary. While pandas can save us so much there are times where there are alternatives that are much simpler. The itertoolsandmore-itertools` are full of cases of this. This post is a walkthrough of me solving a problem with more-itertools rather than reaching for a for loop, or pandas. I am working on a one-line-link expander for my blog. I ended up doing it, just by modifying the markdown with python. I first split the post into lines with content.split('\n'), then look to see if the line appears to be just a link. One more safety net that I wanted to add was to check if there was whitespace around the line, this could not simply be done in a list comprehension by itself. I need just a bit of knowledge of the surrounding lines, enter more-itertools. simplified rendering function # [1] I have a function that will check to see if the line should be expanded, then render the correct template. Fist step is to ...

Testing Data Pipelines

Lint/Format/Doc - black - flake8 - interrogate - mypy Pipeline Assertions - pipeline constructs - pipeline as expected nodes - pipeline has minimum nodes - test minimum tags - test alternate tags Catalog Assertions - test catalog follows naming structure - Node Tests - test function does the correct operations on test data Great Expectations

reasons-to-kedro

There are many reasons that you should be using kedro. If you are on a team of Data Scientists/Data Engineers processing DataFrames from many data sources should be considering a pipeline framework. Kedro is a great option that provides many benefits for teams to collaborate, develop, and deploy data pipelines What is Kedro [1] Starter Template # [2] Kedro makes it super easy to get started with their cli that utilizes cookiecutter under the hood. conda create -n my-new-project -y python=3.8 kedro new kedro install kedro run Create New Kedro Project [3] read more about how to start your first kedro project here Collaboration # [4] Kedro provides many tools that help teams collaborate on a single codebase. While writing monolithic scripts it can be easy to pin yourself in a corner where it is difficult to have multiple people making changes to the notebook/script at the same time. Kedro helps guide your team to break your project down into small pieces that different members o...

Reasons to Kedro

Reasons to Kedro # [1] - collaboration - Sharable catalog - small nodes over monolithic notebooks - catalog - easily load anything without needing to run - No need to write read/write code - pipeline - No need to keep execution order in your head - easily run a slice of a pipeline - plugins - pip install - make your own - hooks - flexible expandable cli Reasons Not to Kedro # [2] - Already utilizing another DAG framework - Data is not in a widely supported format - Micro short-lived project - Large Project / Deadline - Use a lower profile project to learn first - Team not willing to change - Need minimal dependencies - God Project - kedro owns everything?? References: [1]: #reasons-to-kedro [2]: #reasons-not-to-kedro

What's New in Kedro 0.16.6

Kedro 0.16.6 [1] is out! Let’s take a look through the release notes Deployment Docs # [2] This is really exciting to see more deployment options coming from the kedro team. It really shows the power of the framework. The power of some of these orchestrations options is incredible. - Argo [3] - Prefect [4] - Kubeflow [5] - Batch [6] - SageMaker [7] Most of them hinge on a sweet combination of the kedro cli, docker image, and the pipeline knowing your nodes dependencies. Argo, Prefect, and Kubeflow have an interesting technique where they translate the pipeline and its dependencies from kedro to their language. Batch uses the aws cli to submit jobs, one node per job, and listen for them to complete. It will submit all nodes with completed dependencies at once, meaning that we can get some massive parallelization. I did a quick and dirty test of one of these by simulating the technique in a bash script and saw a 40 hr pipeline finish in about 1 hour. I am excited to get thi...

A brain dump of stories

I started making stories as kind of a brain dump a few times per day and posting them to [LinkedIn](https://www.linkedin.com/in/waylonwalker/(https://www.linkedin.com/in/waylonwalker/). Here are the last 11 days of stories. I store all the stories on my website with the hopes of doing something with them on my own platform eventually. For now it makes it easy to make these posts. cd static/stories ls | xargs -I {} echo '![](https://waylonwalker.com/stories/{})' Stories 10-10-2020 - 10-21-2020 # [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] References: [1]: #stories-10-10-2020---10-21-2020 [2]: https://waylonwalker.com/stories/TIL-kedro-sorts-nodes.png [3]: https://waylonwalker.com/stories/disable-base-pip.png [4]: https://waylonwalker.com/stories/discovered-social-cards.png [5]: https://waylonwalker.com/stories/find-kedro-de1-contributor.png [6]: https://waylonwalker.com/stories/hacktoberfest-2020-kedro-538-tests-pass.png [7]: https://waylonwalk...

Designing a "Router" for kedro

nodes_global # [1] I released a router-like plugin for kedro back in April 2020. This was not the first design, the idea actually came from one of the QB folks who taught me kedro nearly a year before. We were assembling our pipelines with something called nodes_global. It worked fairly well but did have some issues around being set as a global variable. But… One thing in particular that it did not lend itself well to was being able to create a packagable pipeline that I could pip install and append into any of my existing pipelines. Something I am still trying to work out, maybe I don’t need this. I think I have it working for our internal pipelines and it seems like the way to go, but we don’t necessarily end up using it. Also… With this pattern all of the nodes needed to be importable by the module containing nodes_global. I find that this becomes a big hurdle for new pipelines coming from jupyter to overcome and can be most infuriating when their nodes aren’t getting ran af...

Reclaim memory usage in Jupyter

Today I ran into an issue where we had a one-off script that just needed to work, but it was just chewing threw memory like nothing. It started with a colleague asking me How do I clear the memory in a Jupyter notebook, these are the steps we took to debug the issue and free up some memory in their notebook. How do I clear the memory in a Jupyter notebook? Pre check the status of memory # [1] There are a number of ways that you can check the amount of memory on your system. The easiest is not necessarily my first go to is free… literally free. check for free space $ free -h total used free shared buffers cached Mem: 15G 15G 150M 0B 59M 8.7G Generally my first go to is a bit more graphical, and not available on a stock stystem, but far more useful…. htop. htop [2] is a terminal process explorer that shows cpu usage, mem usage, and running processes. htop sudo apt-get install htop # install it from your package repo htop [3] First step throw more swap at it # [4] Often be...

Strip Trailing Whitespace from Git projects

A common linting error thrown by various linters is for trailing whitespace. I most often use flake8. I generally have [pre-commit](https://waylonwalker.com/pre-commit-is-awesome hooks setup to strip this, but sometimes I run into situations where I jump into a project without it, and my editor lights up with errors. A simple fix is to run this one-liner. One-Liner to strip whitespace # [1] bash git grep -I --name-only -z -e '' | xargs -0 sed -i -e 's/[ \t]\+$\r\?$$/\1/' pre-commit is awesome I recently discovered the ✨ awesomeness that is pre-commit. I steered away from it for so long because it seemed like a big daunting thing to set up, but... Jun 5, 2020 [2] References: [1]: #one-liner-to-strip-whitespace [2]: /pre-commit-is-awesome/

Three things to Automate with Python using Pandas

Here are three things that I see my non programming counterparts doing every single day. These really sum up so much of what folks do within an office. So many of us dabble in or become power users of spreadsheets without knowing there is an alternative out there that can save us time, automate boring things, and allow us to open up our minds for the part that we add value, Thinking about the data. Focus on Value Add Operations # [1] Lets face it, stitching together spreadsheets is zero value add by itself, but if you can see something in the data and take action on it, this can be huge value add to your company. Learning just a bit of python will help focus more of your attention on “value add operations” and leave the mundane stuff to your computer. Merge a directory full of spreadsheets into one # [2] I see this one all the time. One team gets a spreadsheet from another team once per month and they need to stich all the pieces together. Excel really opens the door for some na...

How to Install miniconda on linux (from the command line only)

miniconda is a python distribution from continuum. It’s a slimmed-down version of their very popular anaconda distribution. It comes with its own environment manager and has eased the install process for many that do not have a way to compile c-extensions. It made it much easier to install the data science stack on windows a few years ago. These days windows are much better than it was back then at compiling c-extensions. I still like its environment manager, which installs to a global directory rather than a local directory for your project. Installing miniconda on Linux # [1] Installing miniconda on Linux can be a bit tricky the first time you do it completely from the terminal. The following snippet will create a directory to install miniconda into, download the latest python 3 based install script for Linux 64 bit, run the install script, delete the install script, then add a conda initialize to your bash or zsh shell. After doing this you can restart your shell and conda will...

Kedro Basics

Learn Kedro in 5 days Day 0 Setup # [1] - vm - install - python - editor Day 1 # [2] - kedro new - kedro viz Day 2 # [3] - catalog - filter catalog - load data - fsspec Day 3 # [4] - pipeline - nodes Day 4 # [5] - filter pipeline - run partial pipeline Day 5 # [6] - kedro docker - GitHub Actions Advanced Kedro # [7] - hooks - custom datasets - modular pipelines References: [1]: #day-0-setup [2]: #day-1 [3]: #day-2 [4]: #day-3 [5]: #day-4 [6]: #day-5 [7]: #advanced-kedro

`j`	Scroll down
`k`	Scroll up
`g` `g`	Scroll to top
`Shift` `G`	Scroll to bottom
`d`	Half-page down
`u`	Half-page up

`j` / `↓`	Next post (in feeds)
`k` / `↑`	Previous post (in feeds)
`Enter` / `o`	Open highlighted post
`Shift` `O`	Open in new tab
`g` `h`	Go to home
`g` `s`	Focus search
`[`	Previous page
`]`	Next page
`s`	Toggle simple/rich feed view

`/`	Focus search input
`⌘CtrlK`	Focus search (alternative)
`y` `y`	Copy URL to clipboard
`?`	Show this help
`Esc`	Close / clear highlight