Markdown Cli
This is a post that may be a work in progress for awhile, Its a collections of thoughts on managing my blog, but could be translated into anythiung that is just a collection of markdown.
All posts with the tag "python"
This is a post that may be a work in progress for awhile, Its a collections of thoughts on managing my blog, but could be translated into anythiung that is just a collection of markdown.
Generating an api for a blog is much simpler than one might expect with python.
fix missing data
In python data science/engineering most of our data is in the form of some sort of table, typically a DataFrame from a library like pandas, spark, or dask.
These containers for data contain many convenient methods to manipulate table like data structures. Sometimes we leverage other data types, namely vanilla types like lists and dicts, or even numpy data types.
...
Changing conda environments is a bit verbose, I use a function with fzf that both lists environments and selects the one I want in one go.
I have used conda as a virtual environment tool for years now. I started using conda for its simplicity to install packages on windows, but now that has gotten so much better and it’s been years since I have run a conda install command. I’m sure that I could use a different environment manager, but it works for me and makes sense.
What environment manager do you use for python?
...
What does it take to create an installable python package that can be hosted on pypi?
This post is somewhat inspired by the bottle framework, which is famously created as a single python module. Yes, a whole web framework is written in one file.
. ├── setup.py └── my_pipeline.py
from setuptools import setup setup( name="", version="0.1.0", py_modules=["my_pipeline", ], install_requires=["kedro"], ) The name of the package can contain any letters, numbers, “_”, or “-”. Even if it’s for internal/personal consumption only I usually check for discrepancy with pypi so that you don’t run into conflicts.
...
I use my ipython terminal daily. It’s my go to way of running python most of the time. After you use it for a little bit you will probably want to setup a bit of your own configuration.
Activate your virtual environment of choice and pip install it. Any time you are running your project in a virtual environment, you will need to install ipython inside it to access those packages from ipython.
pip install ipython
You are using a virtual environment right? Virtual environments like venv or conda can save you a ton of pain down the road.
...
I’ve grown tired of the standard ipython prompt as it doesn’t do much to give me any useful information. The default one gives out a line number that only seems to add anxiety as I am working on a simple problem and see that number grow to several hundred. I start to question my ability 🤦♂️.
If you already have an ipython config you can move on otherwise check out this post on creating an ipython config.
...
Stop going to google everytime your stuck and stay in your workflow. The ipython ? is a superhero for productivity and staying on task.
from kedro.pipeline import Pipeline Pipeline? Init signature: Pipeline( nodes: Iterable[Union[kedro.pipeline.node.Node, ForwardRef('Pipeline')]], *, tags: Union[str, Iterable[str]] = None, ) Docstring: A ``Pipeline`` defined as a collection of ``Node`` objects. This class treats nodes as part of a graph representation and provides inputs, outputs and execution order. Init docstring: Initialise ``Pipeline`` with a list of ``Node`` instances. Args: nodes: The iterable of nodes the ``Pipeline`` will be made of. If you provide pipelines among the list of nodes, those pipelines will be expanded and all their nodes will become part of this new pipeline. tags: Optional set of tags to be applied to all the pipeline nodes. Raises: ValueError: When an empty list of nodes is provided, or when not all nodes have unique names....
...
One thing we all dread is mundane work of getting started, and all the hoops it takes to get going. This year I want to post more often and I am taking some steps towards making it easier for myself to just get started.
When I start a new post I need to cd into my blog directory, start neovim in a markdown file with a clever name, copy some frontmatter boilerplate, update the post date, add tags, a description, and a cover.
hot and fast
...
In python data science we often will reach for pandas a bit more than necessary. While pandas can save us so much there are times where there are alternatives that are much simpler. The itertoolsandmore-itertools` are full of cases of this.
This post is a walkthrough of me solving a problem with more-itertools rather than reaching for a for loop, or pandas.
I am working on a one-line-link expander for my blog. I ended up doing it, just by modifying the markdown with python. I first split the post into lines with content.split('\n'), then look to see if the line appears to be just a link. One more safety net that I wanted to add was to check if there was whitespace around the line, this could not simply be done in a list comprehension by itself. I need just a bit of knowledge of the surrounding lines, enter more-itertools.
...
There are many reasons that you should be using kedro. If you are on a team of Data Scientists/Data Engineers processing DataFrames from many data sources should be considering a pipeline framework. Kedro is a great option that provides many benefits for teams to collaborate, develop, and deploy data pipelines
Kedro makes it super easy to get started with their cli that utilizes cookiecutter under the hood.
...
Kedro 0.16.6 is out! Let’s take a look through the release notes
This is really exciting to see more deployment options coming from the kedro team. It really shows the power of the framework. The power of some of these orchestrations options is incredible.
Most of them hinge on a sweet combination of the kedro cli, docker image, and the pipeline knowing your nodes dependencies.
...
I released a router-like plugin for kedro back in April 2020. This was not the first design, the idea actually came from one of the QB folks who taught me kedro nearly a year before. We were assembling our pipelines with something called nodes_global. It worked fairly well but did have some issues around being set as a global variable.
But…
One thing in particular that it did not lend itself well to was being able to create a packagable pipeline that I could pip install and append into any of my existing pipelines. Something I am still trying to work out, maybe I don’t need this. I think I have it working for our internal pipelines and it seems like the way to go, but we don’t necessarily end up using it.
...
Today I ran into an issue where we had a one-off script that just needed to work, but it was just chewing threw memory like nothing.
It started with a colleague asking me How do I clear the memory in a Jupyter notebook, these are the steps we took to debug the issue and free up some memory in their notebook.
How do I clear the memory in a Jupyter notebook?
...
A common linting error thrown by various linters is for trailing whitespace. I most often use flake8. I generally have [pre-commit](https://waylonwalker.com/pre-commit-is-awesome hooks setup to strip this, but sometimes I run into situations where I jump into a project without it, and my editor lights up with errors. A simple fix is to run this one-liner.
bash
git grep -I --name-only -z -e '' | xargs -0 sed -i -e 's/[ \t]\+\(\r\?\)$/\1/'
Here are three things that I see my non programming counterparts doing every single day. These really sum up so much of what folks do within an office. So many of us dabble in or become power users of spreadsheets without knowing there is an alternative out there that can save us time, automate boring things, and allow us to open up our minds for the part that we add value, Thinking about the data.
Lets face it, stitching together spreadsheets is zero value add by itself, but if you can see something in the data and take action on it, this can be huge value add to your company. Learning just a bit of python will help focus more of your attention on “value add operations” and leave the mundane stuff to your computer.
I see this one all the time. One team gets a spreadsheet from another team once per month and they need to stich all the pieces together. Excel really opens the door for some nasty hidden bugs in your manually stiched together data. It also takes time out of your day...
...
miniconda is a python distribution from continuum. It’s a slimmed-down version of their very popular anaconda distribution. It comes with its own environment manager and has eased the install process for many that do not have a way to compile c-extensions. It made it much easier to install the data science stack on windows a few years ago. These days windows are much better than it was back then at compiling c-extensions. I still like its environment manager, which installs to a global directory rather than a local directory for your project.
Installing miniconda on Linux can be a bit tricky the first time you do it completely from the terminal. The following snippet will create a directory to install miniconda into, download the latest python 3 based install script for Linux 64 bit, run the install script, delete the install script, then add a conda initialize to your bash or zsh shell. After doing this you can restart your shell and conda will be ready to go.
...
If we take a look at the release notes I see one major feature improvement on the list, auto-discovery of hooks.
## Major features and improvements * Enabled auto-discovery of hooks implementations coming from installed plugins.
This one comes a bit surprising as it was just casually mentioned in #435
As I continue to build out waylonwalker.com I sometimes run into some errors that are not caught because I do not have good testing implemented. I want to explore some integration testing options using GitHub’s actions.
Running integration tests will not prevent bugs from happening completely, but it will allow me to quickly spot them and rollback.
The very first thing that comes to my mind is anything that is loaded or ran client-side. Two things quickly came to mind here. I run gatsby so most of my content is statically rendered, and it yells at me if something isn’t as expected. For performance reasons I lazy load cards on my blogroll, loading all of the header images gets heavy and kills lighthouse (if anyone actually cares). I am also loading some...
...