Train a deep research agent to exceed SOTA performance using GRPO and SFT.
Accuracy of a Qwen 2.5 14B Instruct model (the same model you will be training) as it learns to perform deep research, eventually exceeding the performance of GPT-4.1 and Sonnet 4.
uv
by following the instructions here.
Then install the project dependencies by running uv sync
.
LocalBackend
to manage the GPU that your model will be trained on. In order to provision a GPU for your training run, you’ll need to have SkyPilot installed on your machine and provide it with the credentials to spin up machines on at least one infra provider.
We recommend using RunPod because of their ease of use, but any infra provider that SkyPilot supports will work.
Follow RunPod’s Getting Started guide here. You’ll have to provide a credit card to use RunPod, but you’ll only pay for the time your GPUs are running.
.env.example
.env.example
to .env
at the root of the repository, and fill in the values for the environment variables. If you’re unsure about any of the values, refer to ENV_INSTRUCTIONS.md.
display_benchmarks.ipynb
notebook to visualize the results and generate comparison charts.