Skip to main content
Training jobs can run for thousands of steps, and each step generates a new model checkpoint. For most training runs, these checkpoints are LoRAs that takes up 80-150MB of disk space. To reduce storage overhead and preserve only the best checkpoint from your runs, you can set up automatic deletion of all but your best-performing and most recent checkpoints.

Deleting low-performing checkpoints

To delete all but the most recent and best-performing checkpoints of a model, call the delete_checkpoints method as shown below.
import art
# also works with LocalBackend and SkyPilotBackend
from art.serverless.backend import ServerlessBackend

model = art.TrainableModel(
    name="agent-001",
    project="checkpoint-deletion-demo",
    base_model="Qwen/Qwen2.5-14B-Instruct",
)
backend = ServerlessBackend()
# in order for the model to know where to look for its existing checkpoints,
# we have to point it to the correct backend
await model.register(backend)

# deletes all but the most recent checkpoint
# and the checkpoint with the highest val/reward
await model.delete_checkpoints()
By default, delete_checkpoints ranks existing checkpoints by their val/reward score and erases all but the highest-performing and most recent. However, delete_checkpoints can be configured to use any metric that it is passed.
await model.delete_checkpoints(best_checkpoint_metric="train/eval_1_score")
Keep in mind that once checkpoints are deleted, they generally cannot be recovered, so use this method with caution.

Deleting within a training loop

Below is a simple example of a training loop that trains a model for 50 steps before exiting. By default, the LoRA checkpoint generated by each step will automatically be saved in the storage mechanism your backend uses (in this case W&B Artifacts).

import art
from art.serverless.backend import ServerlessBackend

from .rollout import rollout
from .scenarios load_train_scenarios

TRAINING_STEPS = 50

model = art.TrainableModel(
    name="agent-001",
    project="checkpoint-deletion-demo",
    base_model="Qwen/Qwen2.5-14B-Instruct",
)
backend = ServerlessBackend()
await model.register(backend)


train_scenarios = load_train_scenarios()

# training loop
for _step in range(await model.get_step(), TRAINING_STEPS):
    train_groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(rollout(model, scenario, step) for _ in range(8))
            for scenario in train_scenarios
        ),
        pbar_desc=f"gather(train:{step})",
    )
    # trains model and automatically persists each LoRA as a W&B Artifact
    # ~120MB per step
    await model.train(
        train_groups,
        config=art.TrainConfig(learning_rate=5e-5),
    )

# ~6GB of storage used by checkpoints
However, since each LoRA checkpoint generated by this training run is ~120MB, in total this training run will require ~6GB of storage for the model checkpoints alone. To reduce our storage overhead, let’s implement checkpoint deletion on each step.
...
# training loop
for _step in range(await model.get_step(), TRAINING_STEPS):
    train_groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(rollout(model, scenario, step) for _ in range(8))
            for scenario in train_scenarios
        ),
        pbar_desc=f"gather(train:{step})",
    )
    # trains model and automatically persists each LoRA as a W&B Artifact
    # ~120MB per step
    await model.train(
        train_groups,
        config=art.TrainConfig(learning_rate=5e-5),
    )
    # clear all but the most recent and best-performing checkpoint on the train/reward metric
    await model.delete_checkpoints(best_checkpoint_metric="train/reward")

# ~240MB of storage used by checkpoints
With this change, we’ve reduced the total amount of storage used by checkpoints from 6GB to 240MB, while preserving the checkpoint that performed the best on train/reward.
⌘I