Data Preparation for Large Language Models

Transforming and cleaning data for LLMs.

Get Access

Data Preparation for Large Language Models

Transforming and cleaning data for LLMs.

2.5 Hours 31 Lessons

Get Free Access

Get Access

What you'll learn—and how you can apply it

By the end of this hands-on course, you’ll understand:

Significance of curating and processing textual data for LLMs., How to clean and prepare textual data.,How different vectorization models apply to different language model problems.

And you’ll be able to:

Leverage Python libraries to create bag-of-words and word embedding models.

Get hands-on insights into how LLMs work.

Turn words and documents into mathematical representations appropriate for machine learning.

Description

Large language models (LLMs) have come out of left field and surprised everyone in recent years. From ChatGPT to Google Bard, it is hard to ignore the advances of machine learning to produce human-like text based on a large corpus of text documents.

However, textual data can come from diverse sources, including books, online articles, social media, or internal documents. Natural language is messy and not readily understood by LLMs. In this hands-on course, you'll learn fundamental techniques for cleaning and vectorizing text data so it can be used by LLMs. We will cover many code examples using Python and scikit-learn, and work our way from bag-of-words models to word embeddings.

This training is for you because...

You are a Pythonista wanting to understand how LLM data preparation works, including document ingestion.
You are a budding data science or machine learning practitioner wanting enhanced mastery over language modeling.
You find chatbots fascinating and want to take first steps in building your own.

Prerequisites

Basic Python proficiency (variables, loops, functions, library usage).
Basic NumPy proficiency (array declaration and array manipulation).
Experience with scikit-learn is helpful, but not required.

Instructor

Thomas Nield

Thomas is the Founder of Nield Consulting Group and Yawman Flight, and an instructor at University of Southern California. He has authored bestselling books, including Essential Math for Data Science (O’Reilly).

Getting started with Anaconda Notebooks

Course Overview and Learning Objectives

What is Natural Language Processing?

What Are Large Language Models?

How Computers See Text

Strengths and Limits of Large Language Models

Concerns in Procuring LLM Data

Exercise: Procuring LLM Data

Text Cleaning

Manual Tokenization

Using the Natural Language Toolkit (NLTK)

Stemming

Using spaCy

Exercise: Tokenize Text

Converting Text to Numbers

Word Counts

Word Frequencies

Word Hashing

Binary and Other Parameters

Exercise: Vectorize Text

Data Preparation with a Real-world Dataset

Cleaning and Tokenizing the Data

Vectorizing the Data

Exercise: Data Quality

What are Word Embeddings?

Word2Vec and GloVe

Word Embedding with Gensim

Word Embedding with Gensim Continued

Exercise: Word Embedding

Summary

End of course survey

Get Free Access

Get Access

Instructor

Thomas Nield

Get Free Access

Data Preparation for Large Language Models

Instructor

Thomas Nield

Curriculum

Instructor

Thomas Nield