Mining Training Data for LLMs
Learn how to generate training data for large language models using Codegen
This guide demonstrates how to use Codegen to generate high-quality training data for large language models (LLMs) by extracting function implementations along with their dependencies and usages. This approach is similar to word2vec or node2vec - given the context of a function, learn to predict the function’s implementation.
Overview
The process involves three main steps:
- Finding all functions in the codebase
- Extracting their implementations, dependencies, and usages
- Generating structured training data
Let’s walk through each step using Codegen.
Step 1: Finding Functions and Their Context
First, we will do a “graph expansion” for each function - grab the function’s source, as well as the full source of all usages of the function and all dependencies.
First, let’s import the types we need from Codegen:
Here’s how we get the full context for each function:
Notice how we use hop_through_imports
to resolve dependencies. When working with imports, symbols can be re-exported multiple times. For example, a helper function might be imported and re-exported through several files before being used. We need to follow this chain to find the actual implementation:
This creates a structured representation of each function’s context:
Step 2: Processing the Codebase
Next, we process all functions in the codebase to generate our training data:
Step 3: Running the Generator
Finally, we can run our training data generator on any codebase.
This will:
- Load the target codebase
- Process all functions
- Save the structured training data to a JSON file
You can use any Git repository as your source codebase by passing the repo URL to Codebase.from_repo(…).
Using the Training Data
The generated data can be used to train LLMs in several ways:
- Masked Function Prediction: Hide a function’s implementation and predict it from dependencies and usages
- Code Embeddings: Generate embeddings that capture semantic relationships between functions
- Dependency Prediction: Learn to predict which functions are likely to be dependencies
- Usage Pattern Learning: Train models to understand common usage patterns
For example, to create a masked prediction task:
Was this page helpful?