Checkpoint Forking

Run 206 had a catastrophic failure. We fixed it by forking into run 230 before the point of collapse.

Checkpoint forking allows you to create a new training run that starts from an existing model’s checkpoint. This is particularly useful when:

Training has gone off track and you want to restart from a known good checkpoint
You want to experiment with different hyperparameters from a specific point
You need to branch off multiple experiments from the same checkpoint

This feature is marked as experimental because we’re still refining the API shape. However, the core functionality will remain stable.

Basic Usage

The simplest way to fork a checkpoint is to specify it when creating your model:

import art
from art.local import LocalBackend

async def train():
    with LocalBackend() as backend:
        # Create a new model that will fork from an existing checkpoint
        model = art.TrainableModel(
            name="my-model-v2",
            project="my-project",
            base_model="Qwen/Qwen2.5-14B-Instruct",
        )

        # Copy the checkpoint from another model
        await backend._experimental_fork_checkpoint(
            model,
            from_model="my-model-v1",
            not_after_step=500,  # Use checkpoint at or before step 500
            verbose=True,
        )
        
        # Register and continue training
        await model.register(backend)
        # ... rest of training code

Forking from S3

If your checkpoints are stored in S3, you can fork directly from there:

await backend._experimental_fork_checkpoint(
    model,
    from_model="my-model-v1",
    from_s3_bucket="my-backup-bucket",
    not_after_step=500,
    verbose=True,
)

Parameters

`from_model` (required)

The name of the model to fork from.

`from_project` (optional)

The project containing the model to fork from. Defaults to the current model’s project.

`from_s3_bucket` (optional)

S3 bucket to pull the checkpoint from. If not provided, will look for the checkpoint locally.

`not_after_step` (optional)

The maximum step number to use. The function will use the latest checkpoint that is less than or equal to this step. If not provided, uses the latest available checkpoint.

`verbose` (optional)

Whether to print detailed progress information during the forking process.

How It Works

Checkpoint Selection: The system finds the appropriate checkpoint based on your not_after_step parameter
S3 Pull (if needed): If forking from S3, only the specific checkpoint is downloaded, not the entire model history
Checkpoint Copy: The checkpoint is copied to your new model’s directory at the same step number
Training Continuation: Your model can now continue training from this checkpoint

Example: Lowering the Learning Rate

Here’s a practical example of using checkpoint forking to test a lower learning rate:

# Original model trained with lr=1e-5
base_model = art.TrainableModel(
    name="summarizer-base",
    project="experiments",
    base_model="Qwen/Qwen2.5-14B-Instruct",
)

# Fork at step 1000 to try lower learning rate
low_lr_model = art.TrainableModel(
    name="summarizer-low-lr",
    project="experiments",
    base_model="Qwen/Qwen2.5-14B-Instruct",
)

async def experiment():
    with LocalBackend() as backend:
        # Fork the model from the base model
        await backend._experimental_fork_checkpoint(
            low_lr_model,
            from_model="summarizer-base",
            not_after_step=1000,
            verbose=True,
        )
        await model.register(backend)
        
        # Now train with a lower learning rate
        # ... training code with different configs

Notes

Checkpoints are forked at the same step number they had in the source model
The not_after_step parameter uses <= comparison, so specifying 500 will include step 500 if it exists
Only checkpoint files are copied - training logs and trajectories are not included in the fork

Get Started

Fundamentals

Features

Tutorials

Resources

Checkpoint Forking

Checkpoint Forking

Basic Usage

Forking from S3

Parameters

`from_model` (required)

`from_project` (optional)

`from_s3_bucket` (optional)

`not_after_step` (optional)

`verbose` (optional)

How It Works

Example: Lowering the Learning Rate

Notes

Get Started

Fundamentals

Features

Tutorials

Resources

​Checkpoint Forking

​Basic Usage

​Forking from S3

​Parameters

​from_model (required)

​from_project (optional)

​from_s3_bucket (optional)

​not_after_step (optional)

​verbose (optional)

​How It Works

​Example: Lowering the Learning Rate

​Notes

Checkpoint Forking

Basic Usage

Forking from S3

Parameters

`from_model` (required)

`from_project` (optional)

`from_s3_bucket` (optional)

`not_after_step` (optional)

`verbose` (optional)

How It Works

Example: Lowering the Learning Rate

Notes