Data Preparation for Large Language Models

Data Preparation for Large Language Models

Transforming and cleaning data for LLMs.

rate limit

Code not recognized.

About this course

Large language models (LLMs) have come out of left field and surprised everyone in recent years. From ChatGPT to Google Bard, it is hard to ignore the advances of machine learning to produce human-like text based on a large corpus of text documents.

However, textual data can come from diverse sources, including books, online articles, social media, or internal documents. Natural language is messy and not readily understood by LLMs. In this hands-on course, you'll learn fundamental techniques for cleaning and vectorizing text data so it can be used by LLMs. We will cover many code examples using Python and scikit-learn, and work our way from bag-of-words models to word embeddings.

What you'll learn—and how you can apply it

By the end of this hands-on course, you’ll understand:

  • Significance of curating and processing textual data for LLMs.
  • How to clean and prepare textual data.
  • How different vectorization models apply to different language model problems.

And you’ll be able to:

  • Leverage Python libraries to create bag-of-words and word embedding models.
  • Get hands-on insights into how LLMs work.
  • Turn words and documents into mathematical representations appropriate for machine learning.

This training is for you because…

  • You are a Pythonista wanting to understand how LLM data preparation works, including document ingestion.
  • You are a budding data science or machine learning practitioner wanting enhanced mastery over language modeling.
  • You find chatbots fascinating and want to take first steps in building your own.

Prerequisites

  • Basic Python proficiency (variables, loops, functions, library usage).
  • Basic NumPy proficiency (array declaration and array manipulation).
  • Experience with scikit-learn is helpful, but not required.

Setup

To open Anaconda Notebooks:

  1. Go to https://anaconda.cloud
  2. Click on 'Notebooks' from the top navigation menu
  3. Create an account or login if you already have one

Recommended preparation

Recommended follow-up

Facilitator bio

Thomas Nield is the founder of Nield Consulting Group and Yawman Flight, as well as an instructor at University of Southern California. He enjoys making technical content relatable and relevant to those unfamiliar or intimidated by it. Thomas regularly teaches classes on data analysis, machine learning, mathematical optimization, and practical artificial intelligence. At USC, he teaches AI System Safety, developing systematic approaches for identifying AI-related hazards in aviation and ground vehicles. He's authored three books, including Essential Math for Data Science (O’Reilly) and Getting Started with SQL (O'Reilly).

He is also the founder and inventor of Yawman Flight, a company developing universal handheld flight controls for flight simulation and unmanned aerial vehicles. You can find him on Twitter | LinkedIn | GitHub | YouTube.

Questions? Issues? Contact [email protected].

Curriculum02:43:08

  • Getting Started
  • How to Use Anaconda Notebooks 00:01:02
  • Course Overview and Learning Objectives 00:02:25
  • Introduction to Language Model Data
  • What is Natural Language Processing? 00:09:53
  • What Are Large Language Models? 00:07:28
  • How Computers See Text 00:09:23
  • Strengths and Limits of Large Language Models 00:04:04
  • Concerns in Procuring LLM Data 00:04:41
  • Exercise: Procuring LLM Data 00:02:37
  • Cleaning Data for Language Models
  • Text Cleaning 00:09:12
  • Manual Tokenization 00:05:40
  • Using the Natural Language Toolkit (NLTK) 00:05:07
  • Stemming 00:02:19
  • Using spaCy 00:06:18
  • Exercise: Tokenize Text 00:01:58
  • Vectorizing and Encoding
  • Converting Text to Numbers 00:05:48
  • Word Counts 00:09:06
  • Word Frequencies 00:05:50
  • Word Hashing 00:04:43
  • Binary and Other Parameters 00:01:27
  • Exercise: Vectorize Text 00:03:12
  • Bag of Words Project
  • Data Preparation with a Real-world Dataset 00:04:54
  • Cleaning and Tokenizing the Data 00:03:55
  • Vectorizing the Data 00:09:43
  • Exercise: Data Quality 00:05:51
  • Word Embeddings
  • What are Word Embeddings? 00:08:28
  • Word2Vec and GloVe 00:02:40
  • Word Embedding with Gensim 00:06:55
  • Word Embedding with Gensim Continued 00:12:55
  • Exercise: Word Embedding 00:02:39
  • Conclusion
  • Summary 00:02:55
  • End of Course Survey
  • Certificate Info

About this course

Large language models (LLMs) have come out of left field and surprised everyone in recent years. From ChatGPT to Google Bard, it is hard to ignore the advances of machine learning to produce human-like text based on a large corpus of text documents.

However, textual data can come from diverse sources, including books, online articles, social media, or internal documents. Natural language is messy and not readily understood by LLMs. In this hands-on course, you'll learn fundamental techniques for cleaning and vectorizing text data so it can be used by LLMs. We will cover many code examples using Python and scikit-learn, and work our way from bag-of-words models to word embeddings.

What you'll learn—and how you can apply it

By the end of this hands-on course, you’ll understand:

  • Significance of curating and processing textual data for LLMs.
  • How to clean and prepare textual data.
  • How different vectorization models apply to different language model problems.

And you’ll be able to:

  • Leverage Python libraries to create bag-of-words and word embedding models.
  • Get hands-on insights into how LLMs work.
  • Turn words and documents into mathematical representations appropriate for machine learning.

This training is for you because…

  • You are a Pythonista wanting to understand how LLM data preparation works, including document ingestion.
  • You are a budding data science or machine learning practitioner wanting enhanced mastery over language modeling.
  • You find chatbots fascinating and want to take first steps in building your own.

Prerequisites

  • Basic Python proficiency (variables, loops, functions, library usage).
  • Basic NumPy proficiency (array declaration and array manipulation).
  • Experience with scikit-learn is helpful, but not required.

Setup

To open Anaconda Notebooks:

  1. Go to https://anaconda.cloud
  2. Click on 'Notebooks' from the top navigation menu
  3. Create an account or login if you already have one

Recommended preparation

Recommended follow-up

Facilitator bio

Thomas Nield is the founder of Nield Consulting Group and Yawman Flight, as well as an instructor at University of Southern California. He enjoys making technical content relatable and relevant to those unfamiliar or intimidated by it. Thomas regularly teaches classes on data analysis, machine learning, mathematical optimization, and practical artificial intelligence. At USC, he teaches AI System Safety, developing systematic approaches for identifying AI-related hazards in aviation and ground vehicles. He's authored three books, including Essential Math for Data Science (O’Reilly) and Getting Started with SQL (O'Reilly).

He is also the founder and inventor of Yawman Flight, a company developing universal handheld flight controls for flight simulation and unmanned aerial vehicles. You can find him on Twitter | LinkedIn | GitHub | YouTube.

Questions? Issues? Contact [email protected].

Curriculum02:43:08

  • Getting Started
  • How to Use Anaconda Notebooks 00:01:02
  • Course Overview and Learning Objectives 00:02:25
  • Introduction to Language Model Data
  • What is Natural Language Processing? 00:09:53
  • What Are Large Language Models? 00:07:28
  • How Computers See Text 00:09:23
  • Strengths and Limits of Large Language Models 00:04:04
  • Concerns in Procuring LLM Data 00:04:41
  • Exercise: Procuring LLM Data 00:02:37
  • Cleaning Data for Language Models
  • Text Cleaning 00:09:12
  • Manual Tokenization 00:05:40
  • Using the Natural Language Toolkit (NLTK) 00:05:07
  • Stemming 00:02:19
  • Using spaCy 00:06:18
  • Exercise: Tokenize Text 00:01:58
  • Vectorizing and Encoding
  • Converting Text to Numbers 00:05:48
  • Word Counts 00:09:06
  • Word Frequencies 00:05:50
  • Word Hashing 00:04:43
  • Binary and Other Parameters 00:01:27
  • Exercise: Vectorize Text 00:03:12
  • Bag of Words Project
  • Data Preparation with a Real-world Dataset 00:04:54
  • Cleaning and Tokenizing the Data 00:03:55
  • Vectorizing the Data 00:09:43
  • Exercise: Data Quality 00:05:51
  • Word Embeddings
  • What are Word Embeddings? 00:08:28
  • Word2Vec and GloVe 00:02:40
  • Word Embedding with Gensim 00:06:55
  • Word Embedding with Gensim Continued 00:12:55
  • Exercise: Word Embedding 00:02:39
  • Conclusion
  • Summary 00:02:55
  • End of Course Survey
  • Certificate Info