HiDream-O1-Image

HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space — supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 × 2,048.

Project Updates

  • 🤗 May 10, 2026: Try HiDream-O1-Image online on Hugging Face Spaces — 🤗 HiDream-O1-Image and 🤗 HiDream-O1-Image-Dev.
  • 📕 May 10, 2026: Our technical report is now available — 📑 HiDream-O1-Image.pdf.
  • 🚀 May 8, 2026: We've open-sourced HiDream-O1-Image (8B), including both the undistilled and distilled Dev variants, together with the Reasoning-Driven Prompt Agent.

HiDream-O1-Image (codename: Peanut) debuts at #8 in the Artificial Analysis Text to Image Arena, which is positioned to be the new leading open weights Text to Image model (2026-5-5).

Artificial Analysis Text to Image Arena
Artificial Analysis Text to Image Arena at up to 2,048 × 2,048.

General text-to-image generation
General text-to-image generation at up to 2,048 × 2,048.

Long-text rendering and layout
Long-text rendering & layout control — accurate, multi-region, multilingual text.

Subject-driven personalization
Subject-driven personalization — preserve identity / IP across new scenes.

Key Features

  • 🧬 Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder.
  • 🎨 One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, and storyboard generation in a single architecture.
  • 🧠 Reasoning-Driven Prompt Agent — Built-in "thinking" agent that resolves implicit knowledge, layout, and text rendering before generation.
  • 🖼️ Native High Resolution — Direct synthesis up to 2,048 × 2,048 with sharp fine-grained detail.
  • Exceptional Efficiency and Versatility at 8B Scale — With only 8B parameters, achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models.

Models

NameScriptInference StepsHuggingFace Repo
HiDream-O1-Imageinference.py50🤗 HiDream-O1-Image
HiDream-O1-Image-Devinference.py28🤗 HiDream-O1-Image-Dev
Prompt Agentprompt_agent.py🤗 google/gemma-4-31B-it
Web Demoapp.py

Evaluation

We benchmark HiDream-O1-Image against state-of-the-art open-source and proprietary models on five widely-used evaluation suites covering compositional generation, dense prompt alignment, human preference, complex visual text generation, and long-text rendering. In each table, the best result is highlighted in bold and the second-best is underlined. Click any benchmark below to expand or collapse.

GenEval — compositional generation
Model#ParamsSingle-ObjTwo-ObjCountColorPositionAttrOverall
Nano Banana 2.01.000.960.710.840.860.650.83
Seedream-4.01.000.920.710.930.780.680.84
GPT Image 1 [High]0.990.920.850.920.750.610.84
GPT Image 20.990.980.850.930.850.770.89
PixArt4.3B + 0.6B0.980.500.440.800.080.070.48
Show-o1.3B0.950.520.490.820.110.280.53
Emu3-Gen8B0.980.710.340.810.170.210.54
SD3-Medium5.5B + 2B0.980.740.630.670.340.360.62
JanusFlow1.3B0.970.590.450.830.530.420.63
FLUX.1 [Dev]4.8B + 12B0.980.810.740.790.220.450.66
SD3.5 Large5.5B + 8.1B0.980.890.730.830.340.470.71
Janus-Pro-7B7B0.990.890.590.900.790.660.80
Z-Image-Turbo4B + 6B1.000.950.770.890.650.680.82
FLUX.2 [Dev]24B + 32B1.000.990.790.930.730.780.87
Qwen-Image7B + 20B0.990.920.890.880.760.770.87
HiDream-O1-Image8B1.000.990.790.890.930.780.90
HiDream-O1-Image-Pro200B+1.000.990.850.940.940.790.92
DPG-Bench — dense prompt alignment
Model#ParamsGlobalEntityAttributeRelationOtherOverall
GPT Image 1 [High]88.8988.9489.8492.6390.9685.15
GPT Image 287.2791.9190.8591.5991.5885.98
Nano Banana 2.085.1792.5591.1690.4591.0886.90
Seedream-4.087.1792.4192.2993.3395.4888.63
SD v1.50.12B + 0.86B74.6374.2375.3973.4967.8163.18
PixArt4.3B + 0.6B74.9779.3278.6082.5776.9671.11
Lumina-Next2B + 2B82.8288.6586.4480.5381.8274.63
SDXL0.81B + 2.6B83.2782.4380.9186.7680.4174.65
Hunyuan-DiT4.8B + 1.5B84.5980.5988.0174.3686.4178.87
Emu3-Gen8B85.2186.6886.8490.2283.1580.60
DALL-E 390.9789.6188.3990.5889.8383.50
FLUX.1 [Dev]4.8B + 12B74.3590.0088.9690.8788.3383.84
SD3 Medium5.5B + 2B87.9091.0188.8380.7088.6884.08
Janus-Pro-7B7B86.9088.9089.4089.3289.4884.19
Z-Image-Turbo4B + 6B91.2989.5990.1492.1688.6884.86
HiDream-I1-Full13.5B + 17B76.4490.2289.4893.7491.8385.89
FLUX.2 [Dev]24B + 32B92.2091.3693.2893.5289.7287.57
Qwen-Image7B + 20B91.3291.5692.0294.3192.7388.32
HiDream-O1-Image8B95.1592.3293.7492.8890.2589.83
HiDream-O1-Image-Pro200B+94.9795.4292.5990.8289.5090.30
HPSv3 — human preference across 12 categories
Model#ParamsAllCharactersArtsDesignArchitectureAnimalsNatural SceneryTransportationProductsPlantsFoodScienceOthers
Seedream-4.09.329.839.208.839.958.999.409.589.129.269.759.119.51
Nano Banana 2.010.0110.189.189.5810.969.7110.0410.3810.3610.1410.619.149.89
GPT Image 210.2110.759.9110.1510.5910.0510.2910.1710.2610.0710.7510.0510.00
Z-Image-Turbo4B + 6B8.358.988.297.659.268.518.338.817.838.468.647.938.57
FLUX.2 [Dev]24B + 32B9.2810.239.568.809.739.439.219.448.939.239.828.679.11
Qwen-Image7B + 20B9.9410.9110.479.5610.2210.619.8710.109.159.9910.089.199.83
HiDream-O1-Image8B10.3710.5910.4410.2911.0210.3410.3710.5410.5010.3810.859.6810.09
HiDream-O1-Image-Pro200B+10.4710.6310.5110.3311.1110.0810.4510.3710.7510.2911.1310.0910.39
CVTG-2K — complex visual text generation (click to expand)
Model#Params2 regions3 regions4 regions5 regionsAverageNEDCLIP Score
Nano Banana 2.00.74650.77200.80670.79800.78750.89450.7212
GPT Image 1 [High]0.87790.86590.87310.82180.85690.94780.7982
Seedream-4.00.89800.89490.90440.90150.90030.95110.8033
GPT Image 20.89040.88870.91010.90440.90030.95150.7798
TextDiffuser-20.12B + 0.9B0.53220.32550.17870.08090.23260.43530.6765
RAG-Diffusion4.8B + 12B0.43880.33160.21160.19100.26480.44980.7797
AnyText0.123B + 1.2B0.05130.17390.19480.22490.18040.46750.7432
3DIS0.81B + 2.6B0.44950.39590.38800.33030.38130.65050.7767
FLUX.1 [Dev]4.8B + 12B0.60890.55310.46610.43160.49650.68790.7401
SD3.5 Large5.5B + 8.1B0.72930.68250.65740.59400.65480.84700.7797
TextCrafter7B + 20B0.76280.76280.74060.69770.73700.86790.7868
Qwen-Image7B + 20B0.83700.83640.83130.81580.82880.91160.8017
Z-Image-Turbo4B + 6B0.88720.86620.86280.83470.85850.92810.8048
FLUX.2 [Dev]24B + 32B0.92610.88970.89950.87320.89260.94750.8104
HiDream-O1-Image8B0.90850.91590.92160.90150.91280.95610.8076
HiDream-O1-Image-Pro200B+0.91330.92210.93650.91750.92220.96280.8349
LongText-Bench — long-text rendering, EN & ZH (click to expand)
Model#ParamsLongText-Bench-ENLongText-Bench-ZH
Seedream-4.00.9360.946
GPT Image 1 [High]0.9560.619
GPT Image 20.9600.961
Nano Banana 2.00.9800.965
Janus-Pro-7B7B0.0190.006
BLIP3-o7B + 1.4B0.0210.018
Kolors 2.00.2580.329
BAGEL7B + 7B0.3730.310
OmniGen23B + 4B0.5610.059
X-Omni7B0.9000.814
HiDream-I1-Full13.5B + 17B0.5430.024
FLUX.1 [Dev]4.8B + 12B0.6070.005
Z-Image-Turbo4B + 6B0.9170.926
FLUX.2 [Dev]24B + 32B0.9630.757
Qwen-Image7B + 20B0.9430.946
HiDream-O1-Image8B0.9790.978
HiDream-O1-Image-Pro200B+0.9820.980

Installation

  1. Clone this repository:
git clone https://github.com/HiDream-ai/HiDream-O1-Image.git
cd HiDream-O1-Image
  1. Install the required dependencies:
pip install -r requirements.txt

Note on flash-attn. We highly recommend installing flash-attn for optimized attention computation. If you do not (or cannot) install flash-attn, you must edit models/pipeline.py line 291 and change "use_flash_attn": True to "use_flash_attn": False — otherwise inference will fail to import the kernel.

Reasoning-Driven Prompt Agent

HiDream-O1-Image ships with a Reasoning-Driven Prompt Agent (prompt_agent.py) that explicitly reasons through layout, subject attributes, physical logic, and text-rendering details, then rewrites a raw user instruction into a self-contained English prompt. It supports two backends — pick one with --backend.

The agent prints a JSON object with three fields: prompt (rewritten English prompt), reasoning, and resolved_knowledge. Feed the prompt field into inference.py for best results on intricate, reasoning-heavy requests.

Option A — Local Backend (Gemma-4-31B-it)

  1. Download the Gemma weights (requires accepting the Gemma license on HuggingFace):
huggingface-cli download google/gemma-4-31B-it --local-dir /path/to/gemma-4-31B-it
  1. Run the refiner locally:
python prompt_agent.py \
    --backend local \
    --model_id /path/to/gemma-4-31B-it \
    --prompt "李白的静夜思写在古墙上"

Option B — External OpenAI-Compatible API

Use any OpenAI-compatible endpoint (OpenAI, Azure, vLLM, SGLang, DeepSeek, etc.) by providing --base_url, --api_key, and --model_name:

python prompt_agent.py \
    --backend api \
    --base_url https://api.openai.com/v1 \
    --api_key $OPENAI_API_KEY \
    --model_name deepseek-v4-pro \
    --prompt "李白的静夜思写在古墙上"

Usage

A CUDA-capable GPU is required for inference. The examples below use the undistilled model (--model_type full); see the last subsection for running the same tasks with the distilled model (--model_type dev).

1. Text-to-Image Generation

Generate an image from a text prompt:

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "medium shot, eye-level, front view. A woman is seated in an ornate bedroom, illuminated by candlelight, with a calm and composed expression. The subject is a young woman with fair skin, light brown hair styled in an updo with loose tendrils framing her face, and blue eyes. She wears a cream-colored satin robe with delicate floral embroidery and lace trim along the neckline. Her ears are adorned with pearl drop earrings. She is seated on a bed with a dark, intricately carved wooden headboard. To her left, a wooden nightstand holds three lit white candles and a candelabra with multiple lit candles in the background. The bed is covered with patterned pillows and a dark, textured blanket. The walls are paneled with dark wood and feature a large, ornate tapestry with muted earth tones. The lighting creates soft highlights on her face and robe, with warm shadows cast across the room." \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

2. Instruction-Based Image Editing

Provide a single reference image and an editing instruction:

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "remove the earphones" \
    --ref_images assets/edit/test.jpg \
    --output_image results/edit.png \
    --keep_original_aspect

3. Multi-Reference Subject-Driven Personalization

Provide two or more reference images that define the subject(s), and a prompt that places them in a new scene:

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "A young boy with blonde hair stands on steps wearing light blue jeans, a white t-shirt with logo, and blue and white sneakers. He wears a brown cord necklace with beads, a black wristwatch with digital display, and carries a yellow fanny pack with white zipper. In his hand is a red boxing glove with white top, a teal plastic toy car, and a plastic toy figure of Captain America. He wears a straw hat with cream band. Natural light illuminates the scene." \
    --ref_images assets/IP/1.jpg assets/IP/2.jpg assets/IP/3.jpg assets/IP/4.jpg assets/IP/5.jpg assets/IP/6.jpg assets/IP/7.jpg assets/IP/8.jpg assets/IP/9.jpg assets/IP/10.jpg \
    --output_image results/subject.png

4. Running with the Dev Model

All three tasks above can be run with the Dev model by switching --model_path to the Dev checkpoint and setting --model_type dev. For example:

python inference.py \
    --model_path /path/to/HiDream-O1-Image-Dev \
    --prompt "A dog holds a sign that says \"HiDream-O1-Image release.\"" \
    --output_image results/t2i_dev.png \
    --model_type dev

Command Line Arguments

  • --model_path: Path to the complete HuggingFace model directory (undistilled or distilled).
  • --prompt: Text prompt for the generation or editing task.
  • --ref_images: Paths to one or more reference images (optional; space-separated).
  • --output_image: Path to save the generated image (default: output.png).
  • --height / --width: Output image dimensions (default: 2048 × 2048; values snap to valid resolutions internally).
  • --model_type: full or dev (default: full). Selects the inference recipe:
    • full: 50 steps, guidance scale 5.0, shift 3.0, default scheduler.
    • dev: 28 steps, guidance scale 0.0, shift 1.0, flash scheduler with predefined timesteps.
  • --seed: Random seed (default: 32).
  • --guidance_scale: Guidance scale (default: 5.0). Only effective when --model_type is full.
  • --noise_scale_start, --noise_scale_end: Control the scale of the noise injected by the scheduler at each denoising step; the per-step scale linearly interpolates from noise_scale_start (first step) to noise_scale_end (last step). See models/pipeline.py:262 and models/pipeline.py:273. Defaults: 7.5, 7.5.
  • --noise_clip_std: Per-step clipping threshold (in units of the injected noise's standard deviation) applied to the noise added during scheduler stepping. See models/flash_scheduler.py:348-350. Default: 2.5.
  • --keep_original_aspect: When exactly one reference image is provided, resize it with max_size=2048 and use its dimensions for the target image (preserves the reference's aspect ratio) if True.

Web Demo

app.py is a self-contained Flask web application that exposes all generation modes. It also integrates the Reasoning-Driven Prompt Agent.

Starting the server

python app.py \
    --model_path /path/to/HiDream-O1-Image \
    --host 0.0.0.0 \
    --port 7860

Then open http://localhost:7860 in your browser.

Command-line arguments

ArgumentDefaultDescription
--model_path$HIDREAM_MODEL_PATHPath to the checkpoint directory (HiDream-O1-Image or HiDream-O1-Image-Dev).
--model_typefullfull (50-step) or dev (28-step).
--host0.0.0.0Bind address for the Flask server.
--port7860Port for the Flask server.

All four arguments can also be set via environment variables (see .env.example): HIDREAM_MODEL_PATH, HIDREAM_MODEL_TYPE, HIDREAM_HOST, and HIDREAM_PORT.

Prompt Agent in the UI

The sidebar contains a Prompt Agent panel that calls the same Reasoning-Driven Prompt Agent used by prompt_agent.py. Select either the OpenAI-compatible API backend (any endpoint, key, and model name) or the Local · Gemma backend (set HIDREAM_AGENT_MODEL in .env or the environment to point to your local Gemma-4-31B-it weights).

License

The code in this repository and the HiDream-O1-Image models are licensed under MIT License.