Data Cleaning with pandas

About this course

Data cleaning is a critical step for any data science, machine learning, statistical, or analytics project. Before you do elaborate data transformations, feature selections, statistical analysis, and visualizations you first need to consider how to handle missing values, outliers, duplicates, improper formatting, and other concerns that come with raw data. Many practitioners are quick to skip these basic steps and rush to sophisticated tasks like machine learning, and this can be costly to any project.

This course will cover the basics of pruning, cleaning, and formatting data through tasks like dataframe selection, filtering, outlier removal, coalescing blanks, and formatting data types. Afterwards, you will be prepared to handle more advanced areas in pandas like data transformation, feature selection, and machine learning.

What you'll learn—and how you can apply it

By the end of this course, you’ll understand:

What constitutes data cleaning and why it is necessary
Techniques on dealing with missing values and outliers
When data should be modified versus removed

And you'll be able to:

Take raw inputs and sanitize them for more sophisticated tasks
Strategize how to handle outliers, missing values, and bad data
Cast grungy values into proper data types including freeform text, dates, and times

This training is for you because:

You’re a spreadsheet user looking for a better way to clean data.
You work with data science professionals seeking more usable data.
You want to become a data professional who can transform raw data into usable formats.

Prerequisites

Basic Python proficiency (variables, loops, collections, operators, etc.) or Introduction to Python Programming Learning Path
Basic pandas proficiency is recommended, but not required

Recommended preparation:

Introduction to pandas for Data Analysis course

Setup

To follow along using your desktop IDE:

Install or update to the latest version of Anaconda
Launch your command line tool and configure your conda environment

For macOS and Linux users: Search and launch Terminal in your system

For Windows users: Locate and launch Anaconda Prompt in your system

3. (Optional but recommended) From the command line, run the following prompts to create and activate a new environment

conda create --name NEW_ENV_NAME

conda activate NEW_ENV_NAME

4. Install required packages in the command line

conda install matplotlib pandas seaborn

5. Launch JupyterLab from the command line

jupyter lab

To open Anaconda Notebooks:

Go to https://anaconda.cloud
Click on 'Notebooks' from the top navigation menu
Create an account or login if you already have one

The Notebooks for this course are also available at this public GitHub link.

Facilitator Bio

Thomas Nield is the founder of Nield Consulting Group and Yawman Flight, as well as an instructor at University of Southern California. He enjoys making technical content relatable and relevant to those unfamiliar or intimidated by it. Thomas regularly teaches classes on data analysis, machine learning, mathematical optimization, and practical artificial intelligence. At USC he teaches AI System Safety, developing systematic approaches for identifying AI-related hazards in aviation and ground vehicles. He's authored three books, including Essential Math for Data Science (O’Reilly) and Getting Started with SQL (O'Reilly)

He is also the founder and inventor of Yawman Flight, a company developing universal handheld flight controls for flight simulation and unmanned aerial vehicles. You can find him at:

Nield Consulting Group

Questions? Issues? Join our Community page to get help.

Curriculum03:05:42

Getting started with Anaconda Notebooks 00:01:02
pandas basics
pandas overview 00:05:53
pandas DataFrames 00:06:39
Importing data in pandas + exercise 00:07:07
Selecting rows and columns
Selecting rows and columns 00:08:36
Dropping rows by condition 00:09:13
Updating data + exercise 00:08:00
Sorting, mapping, and categories
Sorting, casting, and categories 00:07:58
Categories + exercise 00:07:30
Removing duplicative and sparse data
Removing duplicative and sparse data 00:08:08
Remove columns with one value 00:08:15
Remove columns with low variance + exercise 00:07:37
Handling missing data
Handling missing data 00:07:52
Removing rows with missing values 00:07:02
Fill in missing values with nearest neighbor + exercise 00:07:37
Outliers
Outliers 00:12:07
Using local outlier factor (LOF) + exercise 00:06:00
Dates and times
Dates and times 00:11:36
Filtering on datetimes 00:11:04
Dates and times exercise 00:01:59
Wrangling text
Wrangling text 00:05:17
Regular expression (RegEx) basics 00:09:31
Partial and full string matches 00:08:31
Finding all matches + exercise 00:07:13
Conclusion
Conclusion 00:03:55
End of course survey

About this course