Posts tagged: python
All posts with the tag "python"
Just Use Pathlib
Pathlib is an amazing cross-platform path tool.
from pathlib import Path
Create path object #
Current Directory
cwd = Path('.').absolute()
Users Home Directory
...
Filtering Pandas
Good for method chaining, i.e. adding more methods or filters without assigning a new variable.
# is skus.query('AVAILABILITY == " AVAILABLE"') # is not skus.query('AVAILABILITY != " AVAILABLE"')
masking #
general purpose, this is probably the most common method you see in training/examples
# is skus[skus['AVAILABILITY'] == 'AVAILABLE'] # is not skus[~skus['AVAILABILITY'] == 'AVAILABLE']
isin #
capable of including multiple strings to include
...
Pyspark
I have been using pyspark since March 2019, here are my thoughts.
Making good documentation in python
I just started using portray and it is amazingly simple to use!
Quick Progress Bars in python using TQDM
tqdm is one of my favorite general purpose utility libraries in python. It allows me to see progress of multipart processes as they happen. I really like this for when I am developing something that takes some amount of time and I am unsure of performance. It allows me to be patient when the process is going well and will finish in sufficient time, and allows me to 💥 kill it and find a way to make it perform better if it will not finish in sufficient time.
for more gifs like these follow me on twitter @waylonwalker
Add a simple Progress bar!
...
Clean up Your Data Science with Named Tuples
If you are a regular listener of TalkPython or PythonBytes you have hear Michael Kennedy talk about Named Tuples many times, but what are they and how do they fit into my data science workflow.
As you graduate your scripts into modules and libraries you might start to notice that you need to pass a lot of data around to all of the functions that you have created. For example if you are running some analysis utilizing sales, inventory, and pricing data. You may need to calculate total revenue, inventory on hand. You may need to pass these data sets into various models to drive production or pricing based on predicted volumes.
Here we setup functions that can load data from the sales database. Assume that we also have similar...
...
Background Tasks in Python for Data Science
This post is intended as an extension/update from background tasks in python. I started using background the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started.
This post is intended as an extension/update from background tasks in python. I started using...
...
📝 Bash Notes
Bash is super powerful.
Show Remaining Space on Drives
df -h
show largest files in current directory
...
Autoreload in Ipython
I have used %autoreload for several years now with great success and 🔥 rapid reloads. It allows me to move super fast when developing libraries and modules. They have made some great updates this year that allows class modules to be automatically be updated.
🔥 Blazing Fast
💥 Keeps me in the comfort of my text editor
...
Python Tips
Generating Readme Tables From Pandas
I commonly have a need to paste the first few lines of a dataset into a markdown file. I use two handy packages to do this, tabulate and pyperclip. Lets say I have a Pandas DataFrame in memory as df already. All I would need to do to convert the first 5 rows to markdown and copy it to the clipboard is the following.
from tabulate import tabulate import pyperclip md = tabulate.tabulate(df.head(), df.columns, tablefmt='pipe') pyperclip.copy(md)
This is a super handy snippet that I use a lot. Folks really appreciate it when they can see a sample of the data without opening the entire file.
Pycon 2018 Roundup
These are my notes from pycon 2018 videos. I love the python community and especially the conference talks. This year I am going to take some notes from my favorite talks and post them here.
This is an Incomplete working post.
https://www.youtube.com/watch?v=zQeYx87mfyw
...
My favorite pandas pattern
I work with a lot of transactional timeseries data that includes categories. I often want to create timeseries plots with each category as its own line. This is the method that I use almost data to achieve this result. Typically the data that am working with changes very slowly and trends happen over years not days or weeks. Plotting daily/weekly data tends to be noisy and hides the trend. I use this pattern because it works well with my data and is easy to explain to my stakeholders.
import pandas as pd import numpy as np % matplotlib inline
Lets Fake some data #
Here I am trying to simulate a subset of a large transactional data set. This could be something like sales data, production data, hourly billing, anything that has a date, category, and value. Since we generated this data we know that it is clean. I am still going to assume that it contains some nulls, and an...
...
background tasks in python
I have tried most of the different methods in the past and found that copying and pasting the threadpoolexecutor example or the processpoolexecutor example from the standard library documentation to be the most reliable. Since this is often something that I stuff in the back of a utility module of a library it is not something that I write often enough to be familiar with, which makes it both hard to write and hard to read and debug. If you are looking for a good overview of the difference concurrency Raymond Hettinger has a great talk about the difference between the various different methods, when to use them and why.
...
Pycon 2017 Roundup
Good afternoon fellow Data Geeks. Last week Pycon released 141 videos of greatness. Here are my top picks from the event.
https://www.youtube.com/watch?v=u_iAXzy3xBA&t=1795s
https://www.youtube.com/watch?v=abrcJ9MpF60
...