Step-Audio-EditX: Edit Audio Like You Edit a Document

🎧 Immersive Model Demos

Experience text-driven audio editing and instantly generate your desired effects.

For example, if a user has a monotonous narration and wants to make it livelier, with Step-Audio-EditX, they just add an emotion tag (like adding [Happy]) to the text instruction. The model will adjust the voice to sound joyous. Adding paralinguistic effects like laughter will make the voice sound even more realistic. Users can even iterate multiple times, gradually enhancing the effect until the final audio meets their exact expectations. This is the power of Step-Audio-EditX.

Example/Theme

Original Audio/Content

User Instruction

Edited Audio/Content

Emotion Editing

(Theme Overview)

Source Audio:

“我总觉得，有人在跟着我，我能听到奇怪的脚步声。”

user prompt：

Fear Emotion

Edit Output:

“我总觉得，有人在跟着我，我能听到奇怪的脚步声。” (Fear)

Speaking Style Editing

(Theme Overview)

Source Audio:

“你到底想怎么着，上学的时候懒得学，工作的时候没时间学。”

user prompt：

Roar Speaking Style

Edit Output:

“你到底想怎么着，上学的时候懒得学，工作的时候没时间学。”

Paralinguistics Editing

(Theme Overview)

Source Audio:

“Wait, you're telling me you finished the entire book in one day? That's incredible!”

user prompt：

Add breathing sounds paralinguistic features

Edit Output:

Wait, you're telling me you finished the entire book in one day? [Surprise-oh] That's incredible!

Extension

(Theme Overview)

Source Audio:

“就是说你比如说我一共在这次看病我一共花了一百块钱，其中呢医生的这个劳动价值占了三十块钱。 ”

user prompt：

Silence Trimming

Edit Output:

“就是说你比如说我一共在这次看病我一共花了一百块钱，其中呢医生的这个劳动价值占了三十块钱。”

✨ Model Capabilities

Step-Audio-EditX boasts a rich set of audio editing capabilities, meeting various voice processing needs:

🔁

Zero-Shot Voice Cloning & Multi-Language Support

Instantly imitate any speaker's voice to read text without extra training; supports Mandarin, English, plus dialects like Sichuanese and Cantonese, enabling cross-lingual audio creation.

🎭

Emotion and Style Editing

Allows precise adjustment of voice emotion (e.g., anger, joy) and speaking style (e.g., serious, whisper) for the same sentence to deliver different expressive effects.

🫧

Paralinguistic Effects

Step-Audio-EditX can insert subtle details like breathing, laughter, and sighs, making the synthesized voice closer to human conversation and more expressive.

🔄

Iterative Editing

Supports multi-step continuous editing, allowing users to gradually enhance the target emotion or effect, maintaining meticulous control over the final audio output.

🌍 Use Cases

The functions of Step-Audio-EditX are applicable across a wide range of scenarios, making it highly versatile:

🎬

Media Content Creation

Short video creators can switch characters' voices (e.g., to a lively girl) with one click; audiobook authors can dub multi-character dialogues solo, making works more engaging and dynamic.

🤖

Smart Voice Assistants

With Step-Audio-EditX, virtual assistants can dynamically adjust tone (e.g., friendly, concerned) and add subtle details like breathing, making the voice sound natural and lively—no more robotic responses!

🌐

Dubbing and Localization

Film and game voice actors can use Step-Audio-EditX for zero-shot cloning of their voices and subtle emotion adjustments to match scenes; dialogue can also be instantly converted to different languages or dialects, dramatically cutting localization costs.

🧠 Model Architecture and Performance

Step-Audio-EditX is based on a large language model, achieving unified cross-modal editing. It's lightweight and efficient, with a compact 3 billion parameters, capable of running on a single GPU, making it easy to integrate into various applications.

Core Parameter Count

3 Billion

Training Data Type

Multi-lingual Emotional Voice

Key Metric Improvement

MOS Score 4.3 / BLEU Increase 17%

🚀 Quick Deployment & Experience

Get started with Step-Audio-EditX now.

🤖

Hugging Face Space

One-click online demo—no environment setup required. Start editing instantly!

💻

GitHub Local Deployment

Download the source code and run it in your local GPU environment for full control and performance.