Data Preparation for Large Language Models

Transforming and cleaning data for LLMs.

2.5 Hours 31 Lessons

What you'll learn—and how you can apply it

By the end of this hands-on course, you’ll understand:

Significance of curating and processing textual data for LLMs., How to clean and prepare textual data.,How different vectorization models apply to different language model problems.

And you’ll be able to:

Leverage Python libraries to create bag-of-words and word embedding models.
Get hands-on insights into how LLMs work.
Turn words and documents into mathematical representations appropriate for machine learning.

Description

Large language models (LLMs) have come out of left field and surprised everyone in recent years. From ChatGPT to Google Bard, it is hard to ignore the advances of machine learning to produce human-like text based on a large corpus of text documents.

However, textual data can come from diverse sources, including books, online articles, social media, or internal documents. Natural language is messy and not readily understood by LLMs. In this hands-on course, you'll learn fundamental techniques for cleaning and vectorizing text data so it can be used by LLMs. We will cover many code examples using Python and scikit-learn, and work our way from bag-of-words models to word embeddings.

This training is for you because...

  • You are a Pythonista wanting to understand how LLM data preparation works, including document ingestion.
  • You are a budding data science or machine learning practitioner wanting enhanced mastery over language modeling.
  • You find chatbots fascinating and want to take first steps in building your own.

    Prerequisites

    • Basic Python proficiency (variables, loops, functions, library usage).
    • Basic NumPy proficiency (array declaration and array manipulation).
    • Experience with scikit-learn is helpful, but not required.

      Instructor

      Thomas is the Founder of Nield Consulting Group and Yawman Flight, and an instructor at University of Southern California. He has authored bestselling books, including Essential Math for Data Science (O’Reilly).

      Curriculum

      31 Lessons
      Getting started with Anaconda Notebooks
      Course Overview and Learning Objectives
      What is Natural Language Processing?
      What Are Large Language Models?
      How Computers See Text
      Strengths and Limits of Large Language Models
      Concerns in Procuring LLM Data
      Exercise: Procuring LLM Data
      Text Cleaning
      Manual Tokenization
      Using the Natural Language Toolkit (NLTK)
      Stemming
      Using spaCy
      Exercise: Tokenize Text
      Converting Text to Numbers
      Word Counts
      Word Frequencies
      Word Hashing
      Binary and Other Parameters
      Exercise: Vectorize Text
      Data Preparation with a Real-world Dataset
      Cleaning and Tokenizing the Data
      Vectorizing the Data
      Exercise: Data Quality
      What are Word Embeddings?
      Word2Vec and GloVe
      Word Embedding with Gensim
      Word Embedding with Gensim Continued
      Exercise: Word Embedding
      Summary
      End of course survey
      Get Free Access