34 minute read

I’ve already done the same thing with Numpy, so i’ve decided to create a cheat-sheet for Pandas too :) Most of what you’ll find below is taken from Khalil El Mahrsi’s course that you can find on his website smellydatascience.com published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). So all the credits goes to him. I’ve added few tips found here and there…

You’ll also found my Anki decks related to Numpy & Pandas on github.

Introduction

  • a Python package for tabular data analysis and manipulation
  • powerful & intuitive : fast, flexible and easy to use
  • open source

Tabular data definition:
data structured into a table (dataframe) ie into rows & columns

  • row: entities, objects, observations, instances
  • column: variables, features, attibutes

Variable types

title

Quantitative (Numerical) Variables

  • A quantitative variable has values that are numeric and that reflect a notion of magnitude
  • Quantitative variables can be
    • Discrete → finite set of countable values (often integers)
      e.g., number of children per family, number of rooms in a house, …
    • Continuous → infinity of possible values
      e.g., age, height, weight, distance, date and time, …
  • Math operations on quantitative variables make sense
    e.g., a person who has 4 children has twice as much children as a person who has 2

Qualitative (Categorical) Variables

  • A qualitative variable’s values represent categories (modalities, levels)
  • They do not represent quantities or orders of magnitude
  • Qualitative variables can be
    • Nominal → modalities are unordered e.g., color,
    • Ordinal → an order exists between modalities e.g., cloth sizes (XS, S, M, L, XL, XXL, …), satisfaction level (very dissatisfied, dissatisfied, neutral, satisfied, very satisfied), …

/!\ Encoding a categorical variable with a numeric data type (e.g., int) does not make it quantitative!

Pandas’ Features

  • Based on NumPy → many concepts (indexing, slicing, …) work similarly
  • Two main object types
    • DataFrame → 2-dimensional data structure storing data of different types (strings, integers, floats, …) in columns
    • Series → represents a column (series of values)

Installing & importing pandas

!conda install pandas
!pip install pandas

import pandas as pd

Basic Functionalities

dataset used for the below examples: The Bank Marketing Data Set Variables

  • age: age in years (numeric)
  • job: the customer’s job category (categorical)
  • marital:the customer’s marital status (categorical)
  • education: the customer’s education level (ordinal)
  • default: whether the customer has a loan in default (categorical)
  • housing: whether the customer has a housing loan (categorical)
  • loan: whether the customer has a presonal loan (categorical)
  • y: how the customer responded to a marketing campaign (target variable)

Pandas provides reader and writer functions for handling popular formats: CSV, JSON, parquet, Excel, SQL databases…

reader function to load a data frame from a CSV file

# df = pd.read_csv(file_path, sep=separator,...) 
# default sep = ","
df = pd.read_csv("./bank.csv", sep=';')
df.head()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
4 59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no
df.to_csv("bank_copy.csv")
#!dir or !ls -alh

view the beginning or the end of a series / dataframe

df.head(2)
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
df.tail(3)
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
4518 57 technician married secondary no 295 no no cellular 19 aug 151 11 -1 0 unknown no
4519 28 blue-collar married secondary no 1137 no no cellular 6 feb 129 4 211 3 other no
4520 44 entrepreneur single tertiary no 1136 yes yes cellular 3 apr 345 2 249 7 other no

get the shape of a DataFrame or a Series

  • DataFrame → tuple (row_count, column_count)
  • Series → singleton tuple (length, )
df.shape
(4521, 17)

The column names of a DataFrame can be accessed using its columns attribute

df.columns
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')

Use the dtypes attribute to check the data types of a Series or a DataFrame’s columns

  • pandas mostly relies on NumPy arrays and dtypes (bool, int, float, datetime64[ns], …)
  • pandas also extends some NumPy types (CategoricalDtype, DatetimeTZDtype, …)
  • Two ways to represent strings: object dtype (default) or StringDtype (recommended)
df.dtypes
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

Technical summary
A technical summary of a DataFrame can be accessed using the info() method. It contains

  • The type of the DataFrame
  • The row index (RangeIndex in the example) and its number of entries
  • The total number of columns
  • For each column -The column’s name
    • The count of non-null values
    • The column’s data type
  • Column count per data type
  • Total memory usage
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  duration   4521 non-null   int64 
 12  campaign   4521 non-null   int64 
 13  pdays      4521 non-null   int64 
 14  previous   4521 non-null   int64 
 15  poutcome   4521 non-null   object
 16  y          4521 non-null   object
dtypes: int64(7), object(10)
memory usage: 600.6+ KB

Statistical Summary of Numerical Columns
Use the describe() method to access a statistical summary (mean, standard deviation, min, max, …) of numerical columns of a DataFrame

# transpose the statistical summary for better readability
df.describe().T
count mean std min 25% 50% 75% max
age 4521.0 41.170095 10.576211 19.0 33.0 39.0 49.0 87.0
balance 4521.0 1422.657819 3009.638142 -3313.0 69.0 444.0 1480.0 71188.0
day 4521.0 15.915284 8.247667 1.0 9.0 16.0 21.0 31.0
duration 4521.0 263.961292 259.856633 4.0 104.0 185.0 329.0 3025.0
campaign 4521.0 2.793630 3.109807 1.0 1.0 2.0 3.0 50.0
pdays 4521.0 39.766645 100.121124 -1.0 -1.0 -1.0 -1.0 871.0
previous 4521.0 0.542579 1.693562 0.0 0.0 0.0 0.0 25.0

Value Counts of Qualitative Columns
Use the value_counts() method to count the number of occurrences of each value in a Series (or DataFrame). Use normalize=True in the method call to get percentages

# count occurences of each category of marital
# (can also be used with numerical variables)

df['marital'].value_counts()
married     2797
single      1196
divorced     528
Name: marital, dtype: int64
# percentages instead of counts

df['marital'].value_counts(normalize=True)
married     0.618668
single      0.264543
divorced    0.116788
Name: marital, dtype: float64
# for a DF, value_counts() counts occurences
# of rows (i.e., value combinations)

df[['marital', 'housing']].value_counts()
marital   housing
married   yes        1625
          no         1172
single    yes         636
          no          560
divorced  yes         298
          no          230
dtype: int64

Selecting a Single Column
To select a single column from a DataFrame, specify its name within square brackets → df[col]. The retrieved object is a Series.

jobs = df['job']
type(jobs)
pandas.core.series.Series
jobs.head()
0     unemployed
1       services
2     management
3     management
4    blue-collar
Name: job, dtype: object

Selecting Multiple Columns
To select multiple columns, provide a list of column names within square brackets → df[[col_1, col_2, …]]. The retrieved object is a DataFrame

df[['age', 'education', 'job', 'loan']].head()
age education job loan
0 30 primary unemployed no
1 33 secondary services yes
2 35 tertiary management no
3 30 tertiary management yes
4 59 secondary blue-collar no

Dropping Columns
Instead of selecting columns, you can drop unwanted columns using the drop() method. Be sure to specify axis=1 (otherwise, will attempt to drop rows) To modify the original data frame, use inplace=True in the method call

df_with_cols_dropped = df.drop(['balance', 'day', 'month', 
                                'duration', 'pdays'], axis=1)
df_with_cols_dropped.tail(3)
age job marital education default housing loan contact campaign previous poutcome y
4518 57 technician married secondary no no no cellular 11 0 unknown no
4519 28 blue-collar married secondary no no no cellular 4 3 other no
4520 44 entrepreneur single tertiary no yes yes cellular 2 7 other no

Why Select Columns? Two main motivations for selecting or dropping columns

  • Restrict the data to meaningful variables that are useful for the intended data analysis
  • Retaining variables that are compatible with some technique you intend to use e.g., some machine learning algorithms only make sense when applied to numerical variables

Filtering Rows Rows can be removed using a boolean filter → df[bool_filter]

  • Filter contains True at position i → keep corresponding row
  • Filter contains False at position i → remove corresponding row Most of the time, the filter involves conditions on the columns
  • e.g., keep married clients only
  • e.g., keep clients who are 30 or older
  • etc. Conditions can be combined using logical operators
  • & → bit-wise logical and (binary)
  • __ __ → bit-wise logical or (binary)
  • ~ → bit-wise logical negation (unary)

Example: clients who are married or divorced, unemployed, and 40 or older

Each condition produces a pandas Series. The different conditions’ Series are then combined into one Series used to filter the rows.

df[
    ((df['marital'] == 'married') | (df['marital'] == 'divorced')) &
    (df['age'] >= 40) & (df['job'] == 'unemployed')
].head(3)
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
79 40 unemployed married secondary no 219 yes no cellular 17 nov 204 2 196 1 failure no
108 56 unemployed married primary no 3391 no no cellular 21 apr 243 1 -1 0 unknown yes
152 45 unemployed divorced primary yes -249 yes yes unknown 1 jul 92 1 -1 0 unknown no

Why Filter Rows?
Filtering rows can be motivated by multiple reasons

  • Limiting the analysis to a specific subpopulation of interest
  • Handling outliers and missing values (drop problematic rows)
  • Performance considerations (subsampling a massive data set)

/!\ Never filter rows (or select columns) using for loops!

Sorting Data
Use the sort_values() method to sort a DataFrame or a Series

  • Data frames can be sorted on multiple columns by providing the list of column names
  • Sorting order (ascending or descending) can be controlled with the ascending argument
  • Use inplace=True in the method call to modify the original DataFrame or Series
df.sort_values(by='age').head(10)
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
503 19 student single primary no 103 no no cellular 10 jul 104 2 -1 0 unknown yes
1900 19 student single unknown no 0 no no cellular 11 feb 123 3 -1 0 unknown no
2780 19 student single secondary no 302 no no cellular 16 jul 205 1 -1 0 unknown yes
3233 19 student single unknown no 1169 no no cellular 6 feb 463 18 -1 0 unknown no
999 20 student single secondary no 291 no no telephone 11 may 172 5 371 5 failure no
1725 20 student single secondary no 1191 no no cellular 12 feb 274 1 -1 0 unknown no
13 20 student single secondary no 502 no no cellular 30 apr 261 1 -1 0 unknown yes
3362 21 student single secondary no 6 no no unknown 9 may 622 1 -1 0 unknown no
2289 21 student single secondary no 681 no no unknown 20 aug 6 1 -1 0 unknown no
110 21 student single secondary no 2488 no no cellular 30 jun 258 6 169 3 success yes

Example: sort the data frame by decreasing alphabetical order of marital status and education, and increasing order of age

/!\ While education is an ordinal variable, pandas sorts it alphabetically since it is encoded as a string!

df.sort_values(by=['marital', 'education', 'age'],
              ascending=[False, False, True]).head(10)
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
1900 19 student single unknown no 0 no no cellular 11 feb 123 3 -1 0 unknown no
3233 19 student single unknown no 1169 no no cellular 6 feb 463 18 -1 0 unknown no
2703 21 student single unknown no 137 yes no unknown 12 may 198 3 -1 0 unknown no
1241 22 student single unknown no 549 no no cellular 2 sep 154 1 -1 0 unknown no
1543 22 student single unknown no 47 no no cellular 3 jul 69 3 -1 0 unknown no
2565 23 blue-collar single unknown no 817 yes no cellular 18 may 123 1 -1 0 unknown no
2621 24 blue-collar single unknown no 431 yes no unknown 3 jun 108 12 -1 0 unknown no
3200 24 student single unknown no 3298 yes no unknown 28 may 227 1 -1 0 unknown no
1870 25 student single unknown no 10788 no no cellular 23 dec 102 2 210 2 other no
357 27 management single unknown no 3196 no no cellular 9 feb 10 2 -1 0 unknown no

Indexing Rows and Columns

Two main ways for indexing data frames

  • Label-based indexing with .loc
    • A label-based index can be
    • A single label (e.g., “age”)
    • A list or array of labels (e.g., [“age”, “job”, “loan”])
    • A slice with labels (e.g., “age”:”balance”)
    • A boolean array or list
  • Position-based indexing with .iloc
  • Similar to NumPy arrays indexing
    • A position-based index can be
    • An integer (e.g., 4)
    • A list or array of integers (e.g., [4, 2, 10])
    • A slice with integers (e.g., 2:10:2)
    • A boolean array or list
  • If you don’t want to index a dimension, leave its index empty or replace it with a colon (:)

label-based indexing with .loc

df.loc[row_lab_idx, col_lab_idx]

position-based indexing with .iloc

df._loc[row_pos_idx, col_pos_idx]

get the first 10 rows and only columns at positions 2 to 4 (5 is excluded)

df.iloc[:10, 2:5]
marital education default
0 married primary no
1 married secondary no
2 single tertiary no
3 married tertiary no
4 married secondary no
5 single tertiary no
6 married tertiary no
7 married secondary no
8 married tertiary no
9 married primary no

using .loc to get rows and columns by label

df.loc[[1, 3, 5, 10, 20], ["job", "marital", "poutcome"]]
job marital poutcome
1 services married failure
3 management married unknown
5 management single failure
10 services married unknown
20 management divorced unknown

label slices include the end label !

df.loc[:, "balance":"duration"].head(3)
balance housing loan contact day month duration
0 1787 no no cellular 19 oct 79
1 4789 yes yes cellular 11 may 220
2 1350 yes no cellular 16 apr 185

Modifying a DataFrame’s Row Index

A DataFrame’s row index can be changed using the set_index() method

  • Use inplace=True in the method call to modify the original DataFrame
  • The new index can be
    • One or more existing columns (provide the list of names)
    • One or more arrays, serving as the new index (less common)
  • The new index can replace the existing one or expand it
  • The ability to modify the DataFrame’s index enables more interesting label-based indexing of the rows

use the marital column as the DataFrame’s row index (instead of the default RangeIndex)

df_new_idx = df.set_index("marital")
df_new_idx.head(3)
age job education default balance housing loan contact day month duration campaign pdays previous poutcome y
marital
married 30 unemployed primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
married 33 services secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
single 35 management tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no

the new index can be used to filter more easily on marital status

df_new_idx.loc["single"].head()
age job education default balance housing loan contact day month duration campaign pdays previous poutcome y
marital
single 35 management tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
single 35 management tertiary no 747 no no cellular 23 feb 141 2 176 3 failure no
single 20 student secondary no 502 no no cellular 30 apr 261 1 -1 0 unknown yes
single 37 admin. tertiary no 2317 yes no cellular 20 apr 114 1 152 2 failure no
single 25 blue-collar primary no -221 yes no unknown 23 may 250 1 -1 0 unknown no

Multiple columns can be used as the (multi-level) row index

df_new_idx = df.set_index(["marital", "education"])
df_new_idx.head()
age job default balance housing loan contact day month duration campaign pdays previous poutcome y
marital education
married primary 30 unemployed no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
secondary 33 services no 4789 yes yes cellular 11 may 220 1 339 4 failure no
single tertiary 35 management no 1350 yes no cellular 16 apr 185 1 330 1 failure no
married tertiary 30 management no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
secondary 59 blue-collar no 0 yes no unknown 5 may 226 1 -1 0 unknown no

Hierarchical Indexing (MultiIndex)

A pandas DataFrame or Series can have a multi-level (hierarchical) index

marital_housing_counts = df[["marital", "housing"]].value_counts()
marital_housing_counts
marital   housing
married   yes        1625
          no         1172
single    yes         636
          no          560
divorced  yes         298
          no          230
dtype: int64

check the index

marital_housing_counts.index
MultiIndex([( 'married', 'yes'),
            ( 'married',  'no'),
            (  'single', 'yes'),
            (  'single',  'no'),
            ('divorced', 'yes'),
            ('divorced',  'no')],
           names=['marital', 'housing'])

label-based indexing on the 1st level

marital_housing_counts["married"]
housing
yes    1625
no     1172
dtype: int64

hierachical label-based indexing on the 2 levels

marital_housing_counts["married"]["yes"]
1625

Resetting a DataFrame’s Index

You can reset the DataFrame’s index to the default one by using the reset_index() method

  • By default, pandas will re-insert the index as columns in the dataset (use drop=True in the method call to drop it instead)
  • Use inplace=True in the method call to modify the original DataFrame directly
  • For a MultiIndex, you can select which levels to reset (level parameter)
df_new_idx = df.set_index(["marital", "education"])
df_new_idx.head(4)
age job default balance housing loan contact day month duration campaign pdays previous poutcome y
marital education
married primary 30 unemployed no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
secondary 33 services no 4789 yes yes cellular 11 may 220 1 339 4 failure no
single tertiary 35 management no 1350 yes no cellular 16 apr 185 1 330 1 failure no
married tertiary 30 management no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no

reset the whole indexed

df_new_idx.reset_index().head(4)
marital education age job default balance housing loan contact day month duration campaign pdays previous poutcome y
0 married primary 30 unemployed no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 married secondary 33 services no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 single tertiary 35 management no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 married tertiary 30 management no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no

only reset the education (2nd) level

df_new_idx.reset_index(level="education").head(4)
education age job default balance housing loan contact day month duration campaign pdays previous poutcome y
marital
married primary 30 unemployed no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
married secondary 33 services no 4789 yes yes cellular 11 may 220 1 339 4 failure no
single tertiary 35 management no 1350 yes no cellular 16 apr 185 1 330 1 failure no
married tertiary 30 management no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no

Working with Variables

Data Cleaning

  • Toy data sets are clean and tidy
  • Real data sets are messy and dirty -Duplicates
    • Missing values
      • e.g., a sensor was offline or broken, a person didn’t answer a question in a survey, …
    • Outliers
      • e.g., extreme amounts, …
    • Value errors
      • e.g., negative ages, birthdates in the future, …
    • Inconsistent category encoding and spelling mistakes
      • e.g., “unemployed”, “Unemployed”, “Unemployd”, …
    • Inconsistent formats
      • e.g., 2020-11-19, 2020/11/12, 2020-19-11, …
  • If nothing is done → garbage in, garbage out!!!

Deduplicating Data

  • Use the duplicated() method to identify duplicated rows in a DataFrame (or values in a Series)
  • Use drop_duplicates() to remove duplicates from a DataFrame or Series
    • Use inplace=True to modify the original DataFrame or Series
    • Use the subset argument to limit the columns on which to search for duplicates
    • Use the keep argument to indicate what item of the duplicates must be retained (first, last, drop all duplicates)
df_persons = pd.read_csv("persons.csv")
df_persons
first last age children
0 John Doe 24 0.0
1 Jane Doe 21 1.0
2 NaN Trevor NaN 4.0
3 Undefined Smith 34 3.0
4 Will Snow Unknown NaN
5 Sarah Sanders 20 0.0
6 James Steward 45 NaN
7 Jane Doe 21 1.0
8 Will Tylor 21 1.0
df_persons.drop_duplicates()
first last age children
0 John Doe 24 0.0
1 Jane Doe 21 1.0
2 NaN Trevor NaN 4.0
3 Undefined Smith 34 3.0
4 Will Snow Unknown NaN
5 Sarah Sanders 20 0.0
6 James Steward 45 NaN
8 Will Tylor 21 1.0

Dealing with Missing Values

Two main strategies for dealing with missing values

  • Remove rows (or columns) with missing values → viable when the data set is big (or if impacted columns are not important)
  • Replace the missing values
    • Using basic strategies (e.g., replace with a constant, replace with the column’s median, …)
    • Using advanced strategies (e.g., ML algorithms that infer missing values based on values of other columns)

/!\ The presence of missing values can have serious repercussions on the column data types!

age typed as a string (due to the “Unknown”), children typed as float due to NaN

df_persons.dtypes
first        object
last         object
age          object
children    float64
dtype: object

Dropping Missing Values

Use the dropna() method to remove rows (or columns) with missing values Important arguments

  • axis: axis along which missing values will be removed
  • how: whether to remove a row or column if all values are missing (“all”) or if any value is missing (“any”)
  • subset: labels on other axis to consider when looking for missing values
  • inplace: if True, do the operation on the original object

drop rows with missiong values on any of the columns

df_persons.dropna().head(10)
first last age children
0 John Doe 24 0.0
1 Jane Doe 21 1.0
3 Undefined Smith 34 3.0
5 Sarah Sanders 20 0.0
7 Jane Doe 21 1.0
8 Will Tylor 21 1.0

drop rows with missiong values on either first or age columns (or both)

df_persons.dropna(subset=["first", "age"]).head(10)
first last age children
0 John Doe 24 0.0
1 Jane Doe 21 1.0
3 Undefined Smith 34 3.0
4 Will Snow Unknown NaN
5 Sarah Sanders 20 0.0
6 James Steward 45 NaN
7 Jane Doe 21 1.0
8 Will Tylor 21 1.0

drop rows with missing values on both first & age columns

df_persons.dropna(subset=["first", "age"], how="all").head(10)
first last age children
0 John Doe 24 0.0
1 Jane Doe 21 1.0
3 Undefined Smith 34 3.0
4 Will Snow Unknown NaN
5 Sarah Sanders 20 0.0
6 James Steward 45 NaN
7 Jane Doe 21 1.0
8 Will Tylor 21 1.0

Replacing Missing Values

Use the fillna() method to replace missing values in a DataFrame Important arguments

  • value: replacement value
  • axis: axis along which to fill missing values
  • inplace: if True, do the operation on the original DataFrame

fill missing values with the (constant) value -999

df_persons.fillna(-999).head()
first last age children
0 John Doe 24 0.0
1 Jane Doe 21 1.0
2 -999 Trevor -999 4.0
3 Undefined Smith 34 3.0
4 Will Snow Unknown -999.0

Recasting Variables

  • Variables should be typed with the most appropriate data type
    • Binary variables should be encoded as booleans or 0, 1
    • Discrete quantitative variables should be encoded as integers
    • Depending on the intended goal, categorical features can be dummy-encoded
    • etc.
  • Use the convert_dtypes() method to let pandas attempt to infer the most appropriate data types for a data frame’s columns
  • Use the astype() method to recast columns (Series) to other types

/!\ The most appropriate data type is often task-dependant!

orginal data types in the bank data frame

df.dtypes
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

data types after using the convert_dtypes() method

df_converted = df.convert_dtypes()
df_converted.dtypes
age           Int64
job          string
marital      string
education    string
default      string
balance       Int64
housing      string
loan         string
contact      string
day           Int64
month        string
duration      Int64
campaign      Int64
pdays         Int64
previous      Int64
poutcome     string
y            string
dtype: object

Converting binary (yes/no) variables to 0, 1

for col in ["default", "housing", "loan", "y"]:
    df[col] = (df[col] == "yes").astype(int)
    
df.head()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary 0 1787 0 0 cellular 19 oct 79 1 -1 0 unknown 0
1 33 services married secondary 0 4789 1 1 cellular 11 may 220 1 339 4 failure 0
2 35 management single tertiary 0 1350 1 0 cellular 16 apr 185 1 330 1 failure 0
3 30 management married tertiary 0 1476 1 1 unknown 3 jun 199 4 -1 0 unknown 0
4 59 blue-collar married secondary 0 0 1 0 unknown 5 may 226 1 -1 0 unknown 0

Recasting categorical variables from strings to pandas’ CategoricalDtype

for col in ["job", "marital", "contact", "month", "poutcome"]:
    col_type = pd.CategoricalDtype(df[col].drop_duplicates())
    df[col] = df[col].astype(col_type)
    
edu_type = pd.CategoricalDtype(
    categories=["primary", "secondary", "tertiary", "unknown"],
    ordered=True
)
df["education"] = df["education"].astype(edu_type)
df.dtypes
age             int64
job          category
marital      category
education    category
default         int32
balance         int64
housing         int32
loan            int32
contact      category
day             int64
month        category
duration        int64
campaign        int64
pdays           int64
previous        int64
poutcome     category
y               int32
dtype: object

Sorting on education respects the category order now

(i) In practice, categorical variables are often left as strings and not encoded as CategoricalDtype.

df.sort_values(by="education")
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary 0 1787 0 0 cellular 19 oct 79 1 -1 0 unknown 0
1984 56 blue-collar married primary 0 551 0 0 unknown 29 may 27 1 -1 0 unknown 0
992 39 blue-collar married primary 0 879 1 0 unknown 26 may 102 2 -1 0 unknown 0
994 59 housemaid married primary 0 0 0 0 cellular 27 aug 76 11 -1 0 unknown 0
3255 55 blue-collar married primary 0 284 0 1 telephone 14 jul 252 4 -1 0 unknown 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
150 41 entrepreneur married unknown 0 89 1 0 unknown 6 may 333 2 -1 0 unknown 0
149 37 technician single unknown 0 391 1 0 unknown 8 may 103 3 -1 0 unknown 0
4337 73 retired married unknown 0 519 0 0 telephone 16 oct 434 1 57 1 failure 1
4342 50 blue-collar married unknown 0 2284 1 0 telephone 31 jul 1088 17 -1 0 unknown 1
1069 45 technician single unknown 0 7108 1 1 cellular 18 nov 53 1 172 3 failure 0

4521 rows × 17 columns

Creating New Features

  • Data analyses and machine learning often involve feature engineering, i.e., creating new features from existing ones based on domain knowledge (and intuition)
  • Examples of feature engineering
    • Extracting day of week, month, year, etc. from datetime variables
    • Reverse geocoding (i.e., creating country, state, department, etc. fields from geographical coordinates)
    • Binning
    • One-hot encoding
    • Log-transformation
    • etc.

new column with the count of the customer’s loans

df["loan_count"] = df["loan"] + df["housing"] 

is account balance greater than average balance?

df["richer_than_avg"] = (df["balance"] > df["balance"].mean()).astype(int)

is the customer a senior citizen?

df["senior"] = (df["age"] > 60).astype(int)

is the customer a young adult?

df["senior"] = (df["age"] <= 25).astype(int)
df
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y loan_count richer_than_avg senior
0 30 unemployed married primary 0 1787 0 0 cellular 19 oct 79 1 -1 0 unknown 0 0 1 0
1 33 services married secondary 0 4789 1 1 cellular 11 may 220 1 339 4 failure 0 2 1 0
2 35 management single tertiary 0 1350 1 0 cellular 16 apr 185 1 330 1 failure 0 1 0 0
3 30 management married tertiary 0 1476 1 1 unknown 3 jun 199 4 -1 0 unknown 0 2 1 0
4 59 blue-collar married secondary 0 0 1 0 unknown 5 may 226 1 -1 0 unknown 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4516 33 services married secondary 0 -333 1 0 cellular 30 jul 329 5 -1 0 unknown 0 1 0 0
4517 57 self-employed married tertiary 1 -3313 1 1 unknown 9 may 153 1 -1 0 unknown 0 2 0 0
4518 57 technician married secondary 0 295 0 0 cellular 19 aug 151 11 -1 0 unknown 0 0 0 0
4519 28 blue-collar married secondary 0 1137 0 0 cellular 6 feb 129 4 211 3 other 0 0 0 0
4520 44 entrepreneur single tertiary 0 1136 1 1 cellular 3 apr 345 2 249 7 other 0 2 0 0

4521 rows × 20 columns

One-Hot Encoding

  • Machine learning tasks often require one-hot encoding of categorical variables
    • The variable is transformed into multiple columns
    • Each column represents a category of the variable
    • A 1 on a column indicates which category the original variable had
    • All the other categories’ columns contain 0
  • Use the get_dummies() function to one-hot encode categorical columns

Dummy encode the education variable, join the dummy variables to the data frame (more on joins later), and drop the original column

df.join(pd.get_dummies(df["education"], prefix="education")).drop("education", axis=1)
df
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y loan_count richer_than_avg senior
0 30 unemployed married primary 0 1787 0 0 cellular 19 oct 79 1 -1 0 unknown 0 0 1 0
1 33 services married secondary 0 4789 1 1 cellular 11 may 220 1 339 4 failure 0 2 1 0
2 35 management single tertiary 0 1350 1 0 cellular 16 apr 185 1 330 1 failure 0 1 0 0
3 30 management married tertiary 0 1476 1 1 unknown 3 jun 199 4 -1 0 unknown 0 2 1 0
4 59 blue-collar married secondary 0 0 1 0 unknown 5 may 226 1 -1 0 unknown 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4516 33 services married secondary 0 -333 1 0 cellular 30 jul 329 5 -1 0 unknown 0 1 0 0
4517 57 self-employed married tertiary 1 -3313 1 1 unknown 9 may 153 1 -1 0 unknown 0 2 0 0
4518 57 technician married secondary 0 295 0 0 cellular 19 aug 151 11 -1 0 unknown 0 0 0 0
4519 28 blue-collar married secondary 0 1137 0 0 cellular 6 feb 129 4 211 3 other 0 0 0 0
4520 44 entrepreneur single tertiary 0 1136 1 1 cellular 3 apr 345 2 249 7 other 0 2 0 0

4521 rows × 20 columns

Grouping and Aggregating Data

Group By

Group by refers to a 3-step process

  1. Splitting the data into groups based on some criteria (e.g., by marital status)
  2. Applying a function to each group separately
    • Aggregation (e.g., computing summary statistics)
    • Transformation (e.g., standardization, NA filling, …)
    • Filtering (e.g., remove groups with few rows or based on group aggregates, …)
  3. Combining the results into a data structure (a DataFrame most of the time)

(i) If this sounds familiar to you, it is because this is how value_counts() works.

title

(i) Refer to the documentation for details on DataFrameGroupBy objects and possible aggregations and transformations.

Example: group by marital status and calculate the min, median, and max balance as well as the median age for each group

df.groupby("marital").aggregate(
    # aggregate based on a dict of col_name : 
    # aggregates_list 
    {"balance" : ["min", "median", "max"],
    "age": "median"}
)
balance age
min median max median
marital
married -3313 452.0 71188 42
single -1313 462.0 27733 32
divorced -1148 367.5 26306 45

Example: group by marital status and education level, then calculate median balance, age mean and standard deviation, and number of rows in each group

df.groupby(["marital", "education"]).aggregate(
    # aggregate based on kwarg = (column, aggregate) 
    balance_median = ("balance", "median"),
    age_mean = ("age", "mean"),
    age_std = ("age", "std"),
    count = ("marital", "count")
).reset_index()
# reset idx to more marital & ducation back as cols
marital education balance_median age_mean age_std count
0 married primary 400.5 47.511407 10.594826 526
1 married secondary 406.0 42.404345 10.113719 1427
2 married tertiary 593.0 41.777166 9.574839 727
3 married unknown 559.0 48.444444 9.587568 117
4 single primary 538.0 37.013699 9.888957 73
5 single secondary 376.0 33.052545 7.120707 609
6 single tertiary 613.5 34.512821 7.217769 468
7 single unknown 526.5 34.652174 9.175371 46
8 divorced primary 328.0 51.392405 11.339083 79
9 divorced secondary 319.5 43.496296 9.333056 270
10 divorced tertiary 442.0 45.148387 9.340454 155
11 divorced unknown 774.0 50.375000 10.672242 24

Reshaping Dataframes

Pivoting

Pivoting is useful when studying how a given (numeric) variable is conditioned by two or more (discrete) variables
- The conditioning variables’ values are used as dimensions (row and column indexes)
- The cells contain the values of the conditioned variable for the corresponding dimensions

title

df_aggs = df.groupby(["marital", "education"]).aggregate(
    balance_median = ("balance", "median")
).reset_index()
df_aggs
marital education balance_median
0 married primary 400.5
1 married secondary 406.0
2 married tertiary 593.0
3 married unknown 559.0
4 single primary 538.0
5 single secondary 376.0
6 single tertiary 613.5
7 single unknown 526.5
8 divorced primary 328.0
9 divorced secondary 319.5
10 divorced tertiary 442.0
11 divorced unknown 774.0
df_aggs_pivoted = df_aggs.pivot(
    index="marital", columns="education", values="balance_median"
)
df_aggs_pivoted
education primary secondary tertiary unknown
marital
married 400.5 406.0 593.0 559.0
single 538.0 376.0 613.5 526.5
divorced 328.0 319.5 442.0 774.0
responses_per_marital = df.groupby(["marital", "y"]).aggregate(
    count = ("y", "count")
).reset_index()
responses_per_marital
marital y count
0 married 0 2520
1 married 1 277
2 single 0 1029
3 single 1 167
4 divorced 0 451
5 divorced 1 77
responses_pivoted = responses_per_marital.pivot(index="marital", columns="y", values="count")
responses_pivoted
y 0 1
marital
married 2520 277
single 1029 167
divorced 451 77
response_pct = 100 * responses_pivoted.divide(responses_pivoted.sum(axis="columns"), axis="index")
response_pct
y 0 1
marital
married 90.096532 9.903468
single 86.036789 13.963211
divorced 85.416667 14.583333

Melting

Melting can be seen as the inverse of pivoting title

df_aggs_pivoted
education primary secondary tertiary unknown
marital
married 400.5 406.0 593.0 559.0
single 538.0 376.0 613.5 526.5
divorced 328.0 319.5 442.0 774.0
df_aggs_pivoted.dtypes
education
primary      float64
secondary    float64
tertiary     float64
unknown      float64
dtype: object
# DOESN'T WORK !!!

#df_aggs_pivoted.reset_index().melt(id_vars="marital", value_name="balance_median")
df_aggs_pivoted.convert_dtypes().reset_index().melt(id_vars="marital", value_name="balance_median")
marital variable balance_median
0 married primary 400.5
1 single primary 538.0
2 divorced primary 328.0
3 married secondary 406.0
4 single secondary 376.0
5 divorced secondary 319.5
6 married tertiary 593.0
7 single tertiary 613.5
8 divorced tertiary 442.0
9 married unknown 559.0
10 single unknown 526.5
11 divorced unknown 774.0
# df_aggs_pivoted.reset_index().melt(id_vars="marital", value_name="balance_median")
# without .convert_dtypes() is doesn't work  
# this could be related to the fact that pandas converts your string columns 
# to categorical columns, which raises an error, if you tries to set a value in a row to 
# a value that is not in the allowed categories (because it was not previously observed).

# A simple fix could be to force all your columns to be string columns instead of pandas-categorical ones, like:

for col in ['Provincia', 'Consumo', 'Potencia max', 'Comercializadora_encoded']:
   df_copy[col] = df_copy[col].astype(str)

Cross Tabulations

Use the crosstab() function to compute cross tabulations (i.e., co-occurrence counts) of two or more categorical Series

  • Can be normalized on rows, columns, etc. using the normalize argument
  • Can be marginalized by passing margins=True in the function call
# normalize on rows -> yes/no percentage
# in each category of married

# margins=True : sum on columns -> 
# yes/no percentage in all the data frame

100 * pd.crosstab(df["marital"], df["y"],
                 normalize="index",
                 margins=True)
y no yes
marital
divorced 85.416667 14.583333
married 90.096532 9.903468
single 86.036789 13.963211
All 88.476001 11.523999

Other Reshaping Operations

Other reshaping operations include

  • Stacking and unstacking
  • Pivot tables (generalization of the simple pivot)
  • Exploding list-like columns
  • etc.

Working with Multiple Tables

  • Real data sets are often organized in multiple data tables
  • Each table describes one entity type
    • e.g., a table describes customers, another table describes products, and a third table describes purchases
  • Entities can reference other entities they are related to
  • In order to conduct your analysis, you need to “patch” these tables together

title

customers = pd.read_csv("retail/customers.csv")
customers.head()
customer_id customer_name
0 1 Clark Kent
1 2 Diana Prince
2 3 Bruce Wayne
3 4 Peter Parker
4 5 Natasha Romanoff
purchases = pd.read_csv("retail/purchases.csv")
purchases.head()
customer_id product_id purchase_date
0 1 1 2013-11-15
1 5 4 2012-12-20
2 2 2 2019-07-05
3 1 2 2019-11-24
4 7 2 2020-01-24
products = pd.read_csv("retail/products.csv")
products.head()
id product_name release_date
0 1 PlayStation 4 2013-11-15
1 2 Nintendo Switch 2017-03-03
2 3 XBox One X 2017-11-07
3 4 Nintendo Wii U 2012-11-18

Merging Data Frames

  • Use the merge() method to merge a DataFrame with another DataFrame (or Series)
  • The merge is done with a database-style (SQL) join
  • Usually based on one or more common columns (e.g., the customer_id column in both customers and purchases)
    • If a row from the left object and a row from the right object have matching values for the join columns → a row combining the two is produced
    • If no match is found → output depends on the join type
      • Inner join → no row is produced
      • Left join → for left rows with no match, produce a row (with NA filled right row)
      • Right join → for right rows with no match, produce a row (with NA filled left row)
      • Outer join → combination of left and right join
  • Joins can also be performed on rows (less common)

Inner Join

customers.merge(purchases, on="customer_id")
customer_id customer_name product_id purchase_date
0 1 Clark Kent 1 2013-11-15
1 1 Clark Kent 2 2019-11-24
2 2 Diana Prince 2 2019-07-05
3 5 Natasha Romanoff 4 2012-12-20

Left join

customers.merge(purchases, on="customer_id", how="left")
customer_id customer_name product_id purchase_date
0 1 Clark Kent 1.0 2013-11-15
1 1 Clark Kent 2.0 2019-11-24
2 2 Diana Prince 2.0 2019-07-05
3 3 Bruce Wayne NaN NaN
4 4 Peter Parker NaN NaN
5 5 Natasha Romanoff 4.0 2012-12-20

Right join

customers.merge(purchases, on="customer_id", how="right")
customer_id customer_name product_id purchase_date
0 1 Clark Kent 1 2013-11-15
1 5 Natasha Romanoff 4 2012-12-20
2 2 Diana Prince 2 2019-07-05
3 1 Clark Kent 2 2019-11-24
4 7 NaN 2 2020-01-24

Outer join

customers.merge(purchases, on="customer_id", how="outer")
customer_id customer_name product_id purchase_date
0 1 Clark Kent 1.0 2013-11-15
1 1 Clark Kent 2.0 2019-11-24
2 2 Diana Prince 2.0 2019-07-05
3 3 Bruce Wayne NaN NaN
4 4 Peter Parker NaN NaN
5 5 Natasha Romanoff 4.0 2012-12-20
6 7 NaN 2.0 2020-01-24
  • Multiple tables can be merged together (consecutively)
  • Sometimes, the merge is on columns that do not have the same names (e.g., the id column in products and the product_id column in purchases)
    • Use the left_on and right_on arguments to specify the column names in the left and right data frames respectively
customers.merge(
    # if the merge is on columns that have the
    # same name -> the on arg is optional
    purchases,
).merge(
    products,
    # name of column on left side
    left_on="product_id",
    # name of column on right side
    right_on="id"
).drop("id", axis="columns")
# drop the duplicated column
customer_id customer_name product_id purchase_date product_name release_date
0 1 Clark Kent 1 2013-11-15 PlayStation 4 2013-11-15
1 1 Clark Kent 2 2019-11-24 Nintendo Switch 2017-03-03
2 2 Diana Prince 2 2019-07-05 Nintendo Switch 2017-03-03
3 5 Natasha Romanoff 4 2012-12-20 Nintendo Wii U 2012-11-18

Other tips & tricks

Configure Options & Settings at Interpreter Startup

Highlight all negative values in a dataframe

def color_negative_red(val):
    color = 'red' if val < 0 else 'black'
    return 'color: %s' % color


df = pd.DataFrame(dict(col_1=[1.53,-2.5,3.53], 
                       col_2=[-4.1,5.9,0])
                 )
df.style.applymap(color_negative_red)
  col_1 col_2
0 1.530000 -4.100000
1 -2.500000 5.900000
2 3.530000 0.000000

Pandas options

pd.options.display.max_columns = 50  # None -> No Restrictions  
pd.options.display.max_rows = 200    # None -> Be careful with this   
pd.options.display.max_colwidth = 100  
pd.options.display.precision = 3  

Add row total and column total to a numerical dataframe

df = pd.DataFrame(dict(A=[2,6,3],
                       B=[2,2,6], 
                       C=[3,2,3]))

df['col_total']     = df.apply(lambda x: x.sum(), axis=1)
df.loc['row_total'] = df.apply(lambda x: x.sum())

df
A B C col_total
0 2 2 3 7
1 6 2 2 10
2 3 6 3 12
row_total 11 10 8 29

Check memory usage

df.memory_usage(deep=True)
Index        170
A             32
B             32
C             32
col_total     32
dtype: int64

Cumulative sum

df = pd.DataFrame(dict(A=[2,6,3],
                       B=[2,2,6], 
                       C=[3,2,3]))

df['cumulative_sum'] = df['A'].cumsum()
df
A B C cumulative_sum
0 2 2 3 2
1 6 2 2 8
2 3 6 3 11

Crosstab
When you need to count the frequencies for groups formed by 3+ features, pd.crosstab() can make your life easier.

df = pd.DataFrame(dict(
    departure=['SFO', 'SFO', 'LAX', 'LAX', 'JFK', 'SFO'],
    arrival=['ORD', 'DFW', 'DFW', 'ATL', 'ATL', 'ORD'],
    airlines=['Delta', 'JetBlue', 'Delta', 'AA', 'SoutWest', 'Delta']
))

pd.crosstab(
    index=[df['departure'], df['airlines']],
    columns=[df['arrival']],
    rownames=['departure', 'airlines'],
    colnames=['arrival'],
    margins=True # add subtotal
)
arrival ATL DFW ORD All
departure airlines
JFK SoutWest 1 0 0 1
LAX AA 1 0 0 1
Delta 0 1 0 1
SFO Delta 0 0 2 2
JetBlue 0 1 0 1
All 2 2 2 6

credits for this tip: Shiu-Tang Li