NextStep-1: Autoregressive Image Generation with Continuous Tokens

Experience the future of text-to-image generation with NextStep-1, a 14B parameter autoregressive model that achieves state-of-the-art performance using continuous tokens and flow matching technology.

NextStep-1 Autoregressive Image Generation

Image credit: https://huggingface.co/stepfun-ai/NextStep-1-Large

Try NextStep-1 Interactive Demo

What is NextStep-1?

NextStep-1 represents a significant advancement in autoregressive image generation technology. This innovative system combines a massive 14 billion parameter transformer model with a lightweight 157 million parameter flow matching head to create high-quality images from text descriptions. Unlike traditional methods that rely on computationally intensive diffusion models or suffer from quantization loss through vector quantization, NextStep-1 operates directly on continuous image tokens.

The model employs next-token prediction objectives on both discrete text tokens and continuous image tokens, creating a unified approach that excels in both image synthesis and editing tasks. This breakthrough approach allows NextStep-1 to achieve state-of-the-art performance among autoregressive models while maintaining the flexibility and control that makes autoregressive generation so appealing.

What sets NextStep-1 apart is its ability to handle complex visual scenarios with remarkable fidelity. The system demonstrates exceptional capability in generating detailed portraits, realistic objects, animals, and intricate scenes while maintaining consistency and quality throughout the generation process. The integration of continuous tokens with flow matching enables the model to produce images that rival those created by traditional diffusion-based systems, but with the added benefits of autoregressive control and editing capabilities.

NextStep-1 Overview

FeatureDescription
AI ModelNextStep-1
CategoryAutoregressive Text-to-Image Generation
FunctionImage Synthesis and Editing
Model Size14B Parameters + 157M Flow Matching Head
Token TypeContinuous Image Tokens
Research PaperarXiv:2508.10711
Code Repositorygithub.com/stepfun-ai/NextStep-1
Model Hubhuggingface.co/stepfun-ai/NextStep-1-Large

Technical Architecture

NextStep-1 employs a sophisticated technical architecture that bridges the gap between autoregressive language modeling and continuous image generation. The system consists of two primary components working in harmony: a causal transformer that processes mixed text and image tokens, and a specialized flow matching head that guides continuous image patch generation.

The causal transformer reads sequences containing both discrete text tokens and continuous image tokens, predicting the next element in the sequence. This unified approach allows the model to understand the relationship between textual descriptions and visual content at a fundamental level. The language modeling head handles discrete text processing with traditional cross-entropy loss, while the flow matching head manages continuous image patches using velocity prediction trained with mean square error.

The flow matching component represents a significant innovation in the field. Rather than relying on heavy diffusion decoders, NextStep-1 uses a lightweight flow matching head that learns a velocity field to guide noisy latent samples toward target image patches. This approach significantly reduces computational overhead while maintaining high-quality output. The hidden state from the transformer conditions the velocity prediction, creating a seamless integration between text understanding and image generation.

Key Features of NextStep-1

Continuous Token Processing

Operates directly on continuous image tokens without quantization loss, maintaining fine-grained detail and quality throughout the generation process.

Flow Matching Technology

Employs a lightweight flow matching head that learns velocity fields to guide image generation, offering computational efficiency over traditional diffusion methods.

Unified Training Approach

Uses next-token prediction objectives for both text and image tokens, creating a cohesive learning framework that excels in both generation and editing tasks.

High-Fidelity Image Synthesis

Generates detailed portraits, realistic objects, animals, and complex scenes with exceptional quality and consistency across diverse visual scenarios.

Advanced Image Editing

Demonstrates strong performance in instruction-based editing, including object addition, background changes, material modifications, and style transfers.

Scalable Architecture

Built with a 14B parameter foundation that can be scaled and fine-tuned for specific applications while maintaining computational efficiency.

Training Process and Performance

NextStep-1 follows a comprehensive five-stage training recipe that progressively builds capabilities from basic language understanding to sophisticated image generation and editing. The training process begins with foundation stages focusing on text-only learning, then gradually introduces image-text pairs and advanced editing capabilities.

The training hyperparameters are carefully tuned across different phases. Learning rate schedules transition from constant to cosine decay, with weight decay maintained at 0.1 throughout training. The loss balance between cross-entropy and mean square error is kept at a 0.01 to 1 ratio, ensuring stable learning across both text and image modalities. Training steps decrease over time while image resolution progressively increases from 256 to a mixed 256 and 512 pixel format.

Performance evaluations demonstrate NextStep-1's competitive standing against established baselines. On GenEval benchmarks, the model achieves 0.63 overall, with basic and advanced prompt scores of 0.88 and 0.67 respectively. DPG-Bench results show 85.28 performance, which improves to 0.73, 0.90, and 0.74 respectively when using self-chain-of-thought prompting techniques. These results position NextStep-1 among the top-performing autoregressive image generation models.

Installation Guide

Follow these steps to set up NextStep-1 in your environment:

Environment Setup

# Create and activate conda environment
conda create -n nextstep python=3.11 -y
conda activate nextstep

# Install uv package manager (optional but recommended)
pip install uv

Download Model

# Clone repository without downloading large files
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/stepfun-ai/NextStep-1-Large
cd NextStep-1-Large

# Install requirements
uv pip install -r requirements.txt

# Download VAE checkpoint
hf download stepfun-ai/NextStep-1-Large "vae/checkpoint.pt" --local-dir ./

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModel
from models.gen_pipeline import NextStepPipeline

HF_HUB = "stepfun-ai/NextStep-1-Large"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(HF_HUB, local_files_only=True, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_HUB, local_files_only=True, trust_remote_code=True)
pipeline = NextStepPipeline(tokenizer=tokenizer, model=model).to(device="cuda", dtype=torch.bfloat16)

# Generate image
IMG_SIZE = 512
image = pipeline.generate_image(
    "A realistic photograph of a wall with 'NextStep-1.1 is coming' prominently displayed",
    hw=(IMG_SIZE, IMG_SIZE),
    num_images_per_caption=1,
    cfg=7.5,
    num_sampling_steps=28,
    seed=3407,
)[0]
image.save("./output.jpg")

Applications and Use Cases

Creative Content Generation

Generate high-quality artwork, illustrations, and creative visuals for marketing, advertising, and artistic projects with precise control over style and content.

Product Visualization

Create product mockups, concept designs, and visual prototypes for e-commerce, manufacturing, and design workflows.

Educational Materials

Develop educational illustrations, diagrams, and visual aids for textbooks, online courses, and training materials.

Media Production

Generate concept art, storyboards, and visual references for film, television, gaming, and digital media production.

Image Editing and Enhancement

Perform sophisticated image editing tasks including object addition, background replacement, style transfer, and material modification.

Research and Development

Support computer vision research, AI development, and academic studies requiring high-quality synthetic image datasets.

Performance Benchmarks

NextStep-1 demonstrates competitive performance across multiple evaluation benchmarks, establishing itself as a leading autoregressive image generation model. The system excels particularly in scenarios requiring both generation quality and editing precision.

In editing benchmarks, NextStep-1 achieves strong results on GEditBench with scores of 7.15 for semantic consistency and 7.01 for perceptual quality in English evaluation. The model maintains robust performance across languages, demonstrating 6.88 semantic consistency and 7.02 perceptual quality in Chinese evaluations. These scores position NextStep-1 among the strongest open-source editing models, trailing only select proprietary systems.

The unified training approach enables NextStep-1 to excel in both image generation and editing tasks using the same model architecture. This versatility represents a significant advantage over specialized systems that require separate models for different tasks. The continuous token approach paired with flow matching achieves quality levels that rival traditional diffusion systems while maintaining the control and interpretability benefits of autoregressive generation.

Advantages and Limitations

Advantages

  • State-of-the-art autoregressive image generation quality
  • Unified approach for both generation and editing tasks
  • Continuous tokens eliminate quantization artifacts
  • Efficient flow matching reduces computational overhead
  • Strong performance in complex scene generation
  • Excellent instruction-based editing capabilities
  • Open-source availability for research and development

Limitations

  • ×Requires significant computational resources for training
  • ×14B parameter model demands substantial memory
  • ×Limited resolution capabilities compared to latest diffusion models
  • ×Complex setup and installation process
  • ×Training data biases may affect generation quality
  • ×Still developing compared to mature diffusion alternatives

How to Use NextStep-1

Step 1: Environment Preparation

Set up your Python environment with the required dependencies. Ensure you have sufficient GPU memory (recommended 16GB or more) and install the necessary packages including transformers, torch, and other dependencies listed in the requirements file.

Step 2: Model Download and Setup

Clone the NextStep-1 repository and download the pre-trained model weights. The VAE checkpoint is essential for proper image encoding and decoding operations.

Step 3: Initialize Pipeline

Load the tokenizer and model components, then initialize the NextStepPipeline with appropriate device settings. Configure the model for your specific hardware setup, typically using CUDA with bfloat16 precision for optimal performance.

Step 4: Configure Generation Parameters

Set up your generation parameters including image size, number of sampling steps, classifier-free guidance scale, and random seed for reproducible results. Adjust these parameters based on your quality and speed requirements.

Step 5: Generate and Save Images

Provide your text prompt and execute the generation process. The model will produce high-quality images based on your description, which can then be saved to your desired location for further use or editing.

Frequently Asked Questions

Research and Future Development

NextStep-1 represents a significant step forward in autoregressive image generation research, demonstrating that continuous tokens combined with flow matching can achieve competitive results with traditional diffusion approaches. The research opens new avenues for exploration in unified text-image modeling and efficient generation architectures.

Future development directions include scaling to higher resolutions, improving generation speed, and expanding the model's capabilities to handle video generation and more complex editing scenarios. The continuous token approach shows promise for integration with other modalities and could enable more sophisticated multimodal AI systems.

The open-source nature of NextStep-1 encourages community contribution and collaborative research. Researchers and developers can build upon the foundation to explore novel applications, optimization techniques, and architectural improvements that advance the field of AI-generated content.