This video is still being processed. Please check back later and refresh the page.

Uh oh! Something went wrong, please try again.

Data Cleaning with pandas

Prepare data for analysis with Python.

rate limit

Code not recognized.

About this course

Data cleaning is a critical step for any data science, machine learning, statistical, or analytics project. Before you do elaborate data transformations, feature selections, statistical analysis, and visualizations you first need to consider how to handle missing values, outliers, duplicates, improper formatting, and other concerns that come with raw data. Many practitioners are quick to skip these basic steps and rush to sophisticated tasks like machine learning, and this can be costly to any project.

This course will cover the basics of pruning, cleaning, and formatting data through tasks like dataframe selection, filtering, outlier removal, coalescing blanks, and formatting data types. Afterwards, you will be prepared to handle more advanced areas in pandas like data transformation, feature selection, and machine learning. 

What you'll learn—and how you can apply it

By the end of this course, you’ll understand:

  • What constitutes data cleaning and why it is necessary
  • Techniques on dealing with missing values and outliers
  • When data should be modified versus removed

And you'll be able to:

  • Take raw inputs and sanitize them for more sophisticated tasks
  • Strategize how to handle outliers, missing values, and bad data 
  • Cast grungy values into proper data types including freeform text, dates, and times

This training is for you because:

  • You’re a spreadsheet user looking for a better way to clean data.
  • You work with data science professionals seeking more usable data.
  • You want to become a data professional who can transform raw data into usable formats.

Prerequisites

Recommended preparation:

Setup 

To follow along using your desktop IDE:

  1. Install or update to the latest version of Anaconda
  2. Launch your command line tool and configure your conda environment

For macOS and Linux users: Search and launch Terminal in your system

For Windows users: Locate and launch Anaconda Prompt in your system

3. (Optional but recommended) From the command line, run the following prompts to create and activate a new environment

conda create --name NEW_ENV_NAME

conda activate NEW_ENV_NAME 

4. Install required packages in the command line

conda install matplotlib pandas seaborn 

5. Launch JupyterLab from the command line

jupyter lab 

To open Anaconda Notebooks:

  1. Go to https://anaconda.cloud
  2. Click on 'Notebooks' from the top navigation menu
  3. Create an account or login if you already have one

The Notebooks for this course are also available at this public GitHub link

Facilitator Bio

Thomas Nield is the founder of Nield Consulting Group and Yawman Flight, as well as an instructor at University of Southern California. He enjoys making technical content relatable and relevant to those unfamiliar or intimidated by it. Thomas regularly teaches classes on data analysis, machine learning, mathematical optimization, and practical artificial intelligence. At USC he teaches AI System Safety, developing systematic approaches for identifying AI-related hazards in aviation and ground vehicles. He's authored three books, including Essential Math for Data Science (O’Reilly) and Getting Started with SQL (O'Reilly) 

He is also the founder and inventor of Yawman Flight, a company developing universal handheld flight controls for flight simulation and unmanned aerial vehicles. You can find him at: 

Nield Consulting Group

Yawman Flight

Twitter

LinkedIn

GitHub 

YouTube

 

Questions? Issues? Join our Community page to get help. 

Curriculum03:05:42

  • Getting started with Anaconda Notebooks 00:01:02
  • pandas basics
  • pandas overview 00:05:53
  • pandas DataFrames 00:06:39
  • Importing data in pandas + exercise 00:07:07
  • Selecting rows and columns
  • Selecting rows and columns 00:08:36
  • Dropping rows by condition 00:09:13
  • Updating data + exercise 00:08:00
  • Sorting, mapping, and categories
  • Sorting, casting, and categories 00:07:58
  • Categories + exercise 00:07:30
  • Removing duplicative and sparse data
  • Removing duplicative and sparse data 00:08:08
  • Remove columns with one value 00:08:15
  • Remove columns with low variance + exercise 00:07:37
  • Handling missing data
  • Handling missing data 00:07:52
  • Removing rows with missing values 00:07:02
  • Fill in missing values with nearest neighbor + exercise 00:07:37
  • Outliers
  • Outliers 00:12:07
  • Using local outlier factor (LOF) + exercise 00:06:00
  • Dates and times
  • Dates and times 00:11:36
  • Filtering on datetimes 00:11:04
  • Dates and times exercise 00:01:59
  • Wrangling text
  • Wrangling text 00:05:17
  • Regular expression (RegEx) basics 00:09:31
  • Partial and full string matches 00:08:31
  • Finding all matches + exercise 00:07:13
  • Conclusion
  • Conclusion 00:03:55
  • End of course survey

About this course

Data cleaning is a critical step for any data science, machine learning, statistical, or analytics project. Before you do elaborate data transformations, feature selections, statistical analysis, and visualizations you first need to consider how to handle missing values, outliers, duplicates, improper formatting, and other concerns that come with raw data. Many practitioners are quick to skip these basic steps and rush to sophisticated tasks like machine learning, and this can be costly to any project.

This course will cover the basics of pruning, cleaning, and formatting data through tasks like dataframe selection, filtering, outlier removal, coalescing blanks, and formatting data types. Afterwards, you will be prepared to handle more advanced areas in pandas like data transformation, feature selection, and machine learning. 

What you'll learn—and how you can apply it

By the end of this course, you’ll understand:

  • What constitutes data cleaning and why it is necessary
  • Techniques on dealing with missing values and outliers
  • When data should be modified versus removed

And you'll be able to:

  • Take raw inputs and sanitize them for more sophisticated tasks
  • Strategize how to handle outliers, missing values, and bad data 
  • Cast grungy values into proper data types including freeform text, dates, and times

This training is for you because:

  • You’re a spreadsheet user looking for a better way to clean data.
  • You work with data science professionals seeking more usable data.
  • You want to become a data professional who can transform raw data into usable formats.

Prerequisites

Recommended preparation:

Setup 

To follow along using your desktop IDE:

  1. Install or update to the latest version of Anaconda
  2. Launch your command line tool and configure your conda environment

For macOS and Linux users: Search and launch Terminal in your system

For Windows users: Locate and launch Anaconda Prompt in your system

3. (Optional but recommended) From the command line, run the following prompts to create and activate a new environment

conda create --name NEW_ENV_NAME

conda activate NEW_ENV_NAME 

4. Install required packages in the command line

conda install matplotlib pandas seaborn 

5. Launch JupyterLab from the command line

jupyter lab 

To open Anaconda Notebooks:

  1. Go to https://anaconda.cloud
  2. Click on 'Notebooks' from the top navigation menu
  3. Create an account or login if you already have one

The Notebooks for this course are also available at this public GitHub link

Facilitator Bio

Thomas Nield is the founder of Nield Consulting Group and Yawman Flight, as well as an instructor at University of Southern California. He enjoys making technical content relatable and relevant to those unfamiliar or intimidated by it. Thomas regularly teaches classes on data analysis, machine learning, mathematical optimization, and practical artificial intelligence. At USC he teaches AI System Safety, developing systematic approaches for identifying AI-related hazards in aviation and ground vehicles. He's authored three books, including Essential Math for Data Science (O’Reilly) and Getting Started with SQL (O'Reilly) 

He is also the founder and inventor of Yawman Flight, a company developing universal handheld flight controls for flight simulation and unmanned aerial vehicles. You can find him at: 

Nield Consulting Group

Yawman Flight

Twitter

LinkedIn

GitHub 

YouTube

 

Questions? Issues? Join our Community page to get help. 

Curriculum03:05:42

  • Getting started with Anaconda Notebooks 00:01:02
  • pandas basics
  • pandas overview 00:05:53
  • pandas DataFrames 00:06:39
  • Importing data in pandas + exercise 00:07:07
  • Selecting rows and columns
  • Selecting rows and columns 00:08:36
  • Dropping rows by condition 00:09:13
  • Updating data + exercise 00:08:00
  • Sorting, mapping, and categories
  • Sorting, casting, and categories 00:07:58
  • Categories + exercise 00:07:30
  • Removing duplicative and sparse data
  • Removing duplicative and sparse data 00:08:08
  • Remove columns with one value 00:08:15
  • Remove columns with low variance + exercise 00:07:37
  • Handling missing data
  • Handling missing data 00:07:52
  • Removing rows with missing values 00:07:02
  • Fill in missing values with nearest neighbor + exercise 00:07:37
  • Outliers
  • Outliers 00:12:07
  • Using local outlier factor (LOF) + exercise 00:06:00
  • Dates and times
  • Dates and times 00:11:36
  • Filtering on datetimes 00:11:04
  • Dates and times exercise 00:01:59
  • Wrangling text
  • Wrangling text 00:05:17
  • Regular expression (RegEx) basics 00:09:31
  • Partial and full string matches 00:08:31
  • Finding all matches + exercise 00:07:13
  • Conclusion
  • Conclusion 00:03:55
  • End of course survey