This Python Pandas Cheat Sheet is a quick reference for data wrangling with Pandas, complete with code samples.
By now, you’ll already know the Pandas library is one of the most preferred tools for data manipulation and analysis, and you’ll have explored the fast, flexible, and expressive Pandas data structures, maybe with the help of myTechMint’s Pandas Basics cheat sheet.
Yet, there is still much functionality that is built into this package to explore, especially when you get hands-on with the data: you’ll need to reshape or rearrange your data, iterate over DataFrames, visualize your data, and much more. And this might be even more difficult than “just” mastering the basics.
That’s why today’s post introduces a new, more advanced Pandas cheat sheet.
It’s a quick guide through the functionalities that Pandas can offer you when you get into more advanced data wrangling with Python.
The Pandas cheat sheet will guide you through some more advanced indexing techniques, DataFrame iteration, handling missing values or duplicate data, grouping and combining data, data functionality, and data visualization.
In short, everything that you need to complete your data manipulation with Python!
Don’t miss out on our other cheat sheets for data science that cover Numpy, the Pandas basics, and myTechMint’s detailed Pandas Tutorials.
Reshape Data
Pivot
df3= df2.pivot(index='Date', #Spread rows into columns
columns='Type',
values='Value')
Stack/ Unstack
stacked= df5.stack() #Pivot a level of column labels
stacked.unstack() #Pivot a level of index labels
Melt
pd.melt(df2, #Gather columns into rows
id_vars=[''Date''],
value_vars=[''Type'', ''Value''],
value name=''Observations'')
Iteration
df.iteritems() #{Column-index, Series) pairs
df.iterrows() #{Row-index, Series) pairs
Missing Data
df.dropna() #Drop NaN values
df3.fillna(df3.mean()) #Fill NaN values with a predetermined value
df2.replace("a", "f") #Replace values with others
Advanced Indexing
Selecting
df3.loc[:,(df3>1).any()] #Select cols with any vols >1
df3.loc[:,(df3>1).all()] #Select cols with vols> 1
df3.loc[:,df3.isnull().any()] #Select cols with NaN
df3.loc[:,df3.notnull().all()] #Select cols without NaN
Indexing With isin ()
df[(df.Country.isin(df2.Type))] #Find some elements
df3.filter(iterns="a","b"]) #Filter on values
df.select(lambda x: not x%5) #Select specific elements
Where
s.where(s > 0) #Subset the data
Query
df6.query('second > first') #Query DataFrame
Setting/Resetting Index
df.set_index('Country') #Set the index
df4 = df.reset_index() #Reset the index
df = df.rename(index=str, #Rename
DataFrame columns={"Country":"cntry",
"Capital":"cptl", "Population":"ppltn"})
Reindexing
s2 = s. reindex (['a','c','d','e',' b'])
Forward Filling
df.reindex(range(4),
method='ffill')
Country | Capital | Population |
0 Belgium | Brussels | 11190846 |
1 India | New Dehli | 1303171035 |
2 Brazil | Brasilia | 207847528 |
3 Brazil | Brasilia | 207847528 |
Backward Filling
s3 = s.reindex(range(5),
method='bfill')
3 | |
1 | 3 |
2 | 3 |
3 | 3 |
4 | 3 |
Multi-Indexing
arrays= [np.array([1,2,3]),
np.array([5,4,3])]
df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)
tuples= list(zip(*arrays))
index= pd.Multilndex.from_tuples(tuples,
names= ['first','second'])
df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
df2.set_index(["Date", "Type"])
Duplicate Data
s3.unique() #Return unique values
df2.duplicated('Type') #Check duplicates
df2.drop_duplicates('Type', keep='last') #Drop duplicates
df.index.duplicated() #Check index duplicates
Grouping Data
Aggregation
df2.groupby(by=['Date','Type']).mean()
df4.groupby(level=0).sum()
df4.groupby(level=0).agg({'a':lambda x:sum(x)/len (x), 'b': np.sum})
Transformation
customSum = lambda x: (x+x%2)
df4.groupby(level=0).transform(customSum)
Combining Data
Merge
pd.merge(data1,
data2,
how=' left',
on='X1')
pd.merge(data1,
data2,
how='right',
on='X1')
pd.merge(data1,
data2,
how='inner',
on='X1')
pd.merge(data1,
data2,
how='outer',
on='X1')
Join
data1.join(data2, how='right')
Concatenate
Vertical
s.append(s2)
Horizontal/Vertical
pd.concat([s,s2],axis=1, keys=['One','Two'])
pd.concat([datal, data2], axis=1, join='inner')
Dates
df2['Date']= pd.to_datetime(df2['Date'])
df2['Date']= pd.date_range('2000-1-1',
periods=6,
freq='M')
dates= [datetime(2012,5,1), datetime(2012,5,2)]
index= pd.Datetimelndex(dates)
index= pd.date_range(datetime(2012,2,1), end, freq='BM')
Visualization
import matplotlib.pyplot as plt
s.plot()
plt.show()
df2.plot()
plt.show()
Download Python Pandas Cheat Sheet