Posts tagged: data

Background Tasks in Python for Data Science

This post is intended as an extension/update from background tasks in python. I started using background the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started.

I use it in more places than I probably should

...

Generating Readme Tables From Pandas

I commonly have a need to paste the first few lines of a dataset into a markdown file. I use two handy packages to do this, tabulate and pyperclip. Lets say I have a Pandas DataFrame in memory as df already. All I would need to do to convert the first 5 rows to markdown and copy it to the clipboard is the following.

from tabulate import tabulate import pyperclip md = tabulate.tabulate(df.head(), df.columns, tablefmt='pipe') pyperclip.copy(md)

This is a super handy snippet that I use a lot. Folks really appreciate it when they can see a sample of the data without opening the entire file.

Stepping Up My SQL Game

In 2018 I transitioned from a Product Engineering (Mechanical) role to a Data Scientist Role. I entered this space with strong subject matter expertise with our products, our data, munging through data in pyhon, and data visualization in python. My sql skills were lacking to say the least. I had learned what I needed to know to get data from our relational databases, then use pandas to do any further analysis. Just run something like the following and you have data.

SELECT * FROM Table Where col_1 = 'col_1_filter'

This technique works great for small data sets that you only need to run once. There is no shame to pull in a big dataset and start munging with it in pandas to get some results, and make decisions. The problem becomes when your dataset becomes too big or you need to run the query on a frequent basis. Doing the aggregations on the server run much quicker, as it reduces the time spent in io. My longest running steps are currently io related. Reducing these steps have improved my workflow. At the point that I was getting server timeout errors, or using the same long running query in many places I would be searching for examples online, because I just did not have the experience with many more techniques. I decided it was time to put away the cheat sheets, step away from Stack Overflow, and improve my speed.

SQL is far from the hot topic in 2018, AI, Deep Learning, BIG data, Machine Learning, Natural Language Processing take the win here....

...

background tasks in python

I have tried most of the different methods in the past and found that copying and pasting the threadpoolexecutor example or the processpoolexecutor example from the standard library documentation to be the most reliable. Since this is often something that I stuff in the back of a utility module of a library it is not something that I write often enough to be familiar with, which makes it both hard to write and hard to read and debug. If you are looking for a good overview of the difference concurrency Raymond Hettinger has a great talk about the difference between the various different methods, when to use them and why.

Recently a new python library was released to make running tasks in the background very simple. The

...

`j`	Scroll down
`k`	Scroll up
`g` `g`	Scroll to top
`Shift` `G`	Scroll to bottom
`d`	Half-page down
`u`	Half-page up

`j` / `↓`	Next post (in feeds)
`k` / `↑`	Previous post (in feeds)
`Enter` / `o`	Open highlighted post
`Shift` `O`	Open in new tab
`g` `h`	Go to home
`g` `s`	Focus search
`[`	Previous page
`]`	Next page
`s`	Toggle simple/rich feed view

`/`	Focus search input
`⌘CtrlK`	Focus search (alternative)
`y` `y`	Copy URL to clipboard
`?`	Show this help
`Esc`	Close / clear highlight