Audio Models
Audio models in Armox generate music, speech, and sound effects from text descriptions or reference inputs.
Overview
Audio models can:
- Music generation — Create original music from descriptions
- Text-to-speech — Generate natural voice from text
- Sound effects — Create ambient sounds and effects
- Voice cloning — Generate speech in specific voices
- Audio continuation — Extend existing audio
Available Audio Models
| Model | Provider | Cost | Duration | Best For |
|---|---|---|---|---|
| MusicGen | Meta | 100 credits | 8-30s | Music generation |
| Ace Step | Various | 100 credits | 60-300s | Long-form music |
| Dia TTS | Nari Labs | 50 credits | Variable | Text-to-speech |
| Kokoro TTS | Kokoro | 50 credits | Variable | Fast TTS |
| Chatterbox | Various | 50 credits | Variable | Voice cloning |
Connection Colors
In the Armox Canvas, audio connections use orange handles and edges:
- Input Handle: Red circle on the left side of nodes
- Output Handle: Red circle on the right side of nodes
- Connection Edge: Red line connecting nodes
Common Settings
Duration
Control the length of generated audio.
Sample Rate
- 44.1kHz — CD quality
- 48kHz — Professional audio
Format
- MP3 — Compressed, smaller files
- WAV — Uncompressed, higher quality
Choosing the Right Model
For Music
- MusicGen (100 credits) — Short music clips
- Ace Step (100 credits) — Long-form music
For Speech
- Dia TTS (50 credits) — Natural dialogue
- Kokoro TTS (50 credits) — Fast generation
- Chatterbox (50 credits) — Voice cloning
Best Practices
- Be specific about genre — "jazz", "electronic", "orchestral"
- Describe mood — "upbeat", "melancholic", "energetic"
- Include instruments — "piano", "guitar", "synthesizer"
- Specify tempo — "slow", "moderate", "fast"
- For speech, use natural text — Write as you'd speak
Next Steps
Explore individual model documentation for detailed settings and use cases.