Armox Academy 📚

AI Models ReferenceAudio Models

Audio Models

Audio models in Armox generate music, speech, and sound effects from text descriptions or reference inputs.

Overview

Audio models can:

Music generation — Create original music from descriptions
Text-to-speech — Generate natural voice from text
Sound effects — Create ambient sounds and effects
Voice cloning — Generate speech in specific voices
Audio continuation — Extend existing audio

Available Audio Models

Model	Provider	Cost	Duration	Best For
MusicGen	Meta	100 credits	8-30s	Music generation
Ace Step	Various	100 credits	60-300s	Long-form music
Dia TTS	Nari Labs	50 credits	Variable	Text-to-speech
Kokoro TTS	Kokoro	50 credits	Variable	Fast TTS
Chatterbox	Various	50 credits	Variable	Voice cloning

Connection Colors

In the Armox Canvas, audio connections use orange handles and edges:

Input Handle: Red circle on the left side of nodes
Output Handle: Red circle on the right side of nodes
Connection Edge: Red line connecting nodes

Common Settings

Duration

Control the length of generated audio.

Sample Rate

44.1kHz — CD quality
48kHz — Professional audio

Format

MP3 — Compressed, smaller files
WAV — Uncompressed, higher quality

Choosing the Right Model

For Music

MusicGen (100 credits) — Short music clips
Ace Step (100 credits) — Long-form music

For Speech

Dia TTS (50 credits) — Natural dialogue
Kokoro TTS (50 credits) — Fast generation
Chatterbox (50 credits) — Voice cloning

Best Practices

Be specific about genre — "jazz", "electronic", "orchestral"
Describe mood — "upbeat", "melancholic", "energetic"
Include instruments — "piano", "guitar", "synthesizer"
Specify tempo — "slow", "moderate", "fast"
For speech, use natural text — Write as you'd speak

Next Steps

Explore individual model documentation for detailed settings and use cases.

Ready to transform your creative workflow?

No credit card required1000 free credits