Skip to main content
Most SOTA models are already trained to condense long documents into short summaries. However, not every summary is created equal. In this tutorial, we’re going to train a summarizer that excels at filtering useful information from a document and cutting out the fluff. To skip ahead and see the results of a prior training run, check out the blog post. Otherwise, please enjoy this tutorial!
Reading time: 45 minTraining time: 4 hoursTotal cost: $22

Step 1: Clone the starter repo and install dependencies

To get started, clone Summary-RL, which contains the following pieces of our RL pipeline:
  • The agent’s environment
  • The reward function
  • Some training examples
Once the repository is cloned, install dependencies. If you haven’t already, install uv by following the instructions here. Then install the project dependencies by running uv sync.

2. Install backend dependencies and provision a GPU

You’ll be using LocalBackend to manage the GPU that your model will be trained on. Install ART with the backend dependencies:
pip install openpipe-art[backend]
Make sure you have access to a machine with a modern NVIDIA GPU. This can be your local workstation or a cloud VM. If you’re using a cloud provider (e.g. RunPod, Lambda, or GCP), launch the GPU instance and run the rest of this tutorial on that machine.

3. Set up optional environment variables found in .env.example.

In a new .env file at the root of the repository, set the following optional environment variables:
  • WANDB_API_KEY - Enables metric logging to Weights & Biases.
  • OPENPIPE_API_KEY - Enables chat completion logging to OpenPipe.
  • OPENAI_API_KEY - Will be necessary for later comparison benchmarks, but not used for training.
To enable model and logging backup to S3, you’ll also need to provide AWS credentials. These are necessary for generating the benchmarks found in the benchmarks directory, but not for training itself. If you don’t already have AWS credentials with create/read/write permissions for s3 buckets, follow the instructions here.
  • AWS_ACCESS_KEY_ID - Your AWS access key ID, which should have create/read/write permissions for s3 buckets.
  • AWS_SECRET_ACCESS_KEY - Your matching secret access key.
  • AWS_REGION - The region of the S3 bucket.
  • BACKUP_BUCKET - The name of the S3 bucket in which to store model checkpoints and logging data. Can be a new bucket or an existing one.

4. Run the training script

uv run python src/summarizer/train.py
The first training run will:
  • Register the model with ART.
  • Download the model checkpoint from S3 (if configured).
  • Start vLLM and the training service on your GPU.
  • Train the model for a specified number of steps.
  • Upload the final model checkpoint to S3 (if configured).

5. Shutting down your GPU instance

When you’re done training and running benchmarks, shut down your GPU instance through your cloud provider’s console or CLI. If you’re running locally, you can simply stop the training process.

Running Benchmarks

The benchmark_models.py script will compare the performance of the trained model to gpt-4o, gpt-4.1, o4-mini, and gemini-2.5-pro-preview. Before running the benchmark script, make sure you’ve provided a valid OPENROUTER_API_KEY and the AWS credentials detailed in step 3. These credentials are necessary for the script to upload the benchmark results to S3.
uv run python benchmarks/benchmark_models.py
This script will:
  • Run each benchmarked model through each document in the validation set.
  • Record the percentage of questions that each model’s summary allowed Gemini 2.5 Flash to answer correctly.
  • Upload the results to S3.
Once the benchmark generation script has finished running, you can view the results and generate visual charts by navigating to benchmarks/display_benchmarks.ipynb and running the cells. After running all the cells, you should see something like the following: Benchmark Percentage of Questions Answered Comparison The percentage of questions that each model’s summary allowed Gemini 2.5 Flash to answer correctly at each training step. By step 40 of this training run, the trained model outperforms every other model. Benchmark Percentage of Questions Answered Comparison A side-by-side comparison of the percentage of questions that each model’s summary allowed Gemini 2.5 Flash to answer correctly. The trained model began with a percentage of 37%, but by the final step, it was able to generate summaries that allowed Gemini 2.5 Flash to answer 70% of the questions correctly.