audiocraft-audio-generation

PyTorch library for audio generation including text-to-music (MusicGen) and text-to-sound (AudioGen). Use when you need to generate music from text…

INSTALLATION
npx skills add https://github.com/davila7/claude-code-templates --skill audiocraft-audio-generation
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

AudioCraft: Audio Generation

Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.

When to use AudioCraft

Use AudioCraft when:

  • Need to generate music from text descriptions
  • Creating sound effects and environmental audio
  • Building music generation applications
  • Need melody-conditioned music generation
  • Want stereo audio output
  • Require controllable music generation with style transfer

Key features:

  • MusicGen: Text-to-music generation with melody conditioning
  • AudioGen: Text-to-sound effects generation
  • EnCodec: High-fidelity neural audio codec
  • Multiple model sizes: Small (300M) to Large (3.3B)
  • Stereo support: Full stereo audio generation
  • Style conditioning: MusicGen-Style for reference-based generation

Use alternatives instead:

  • Stable Audio: For longer commercial music generation
  • Bark: For text-to-speech with music/sound effects
  • Riffusion: For spectogram-based music generation
  • OpenAI Jukebox: For raw audio generation with lyrics

Quick start

Installation

# From PyPI

pip install audiocraft

# From GitHub (latest)

pip install git+https://github.com/facebookresearch/audiocraft.git

# Or use HuggingFace Transformers

pip install transformers torch torchaudio

Basic text-to-music (AudioCraft)

import torchaudio

from audiocraft.models import MusicGen

# Load model

model = MusicGen.get_pretrained('facebook/musicgen-small')

# Set generation parameters

model.set_generation_params(

    duration=8,  # seconds

    top_k=250,

    temperature=1.0

)

# Generate from text

descriptions = ["happy upbeat electronic dance music with synths"]

wav = model.generate(descriptions)

# Save audio

torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)

Using HuggingFace Transformers

from transformers import AutoProcessor, MusicgenForConditionalGeneration

import scipy

# Load model and processor

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")

model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

model.to("cuda")

# Generate music

inputs = processor(

    text=["80s pop track with bassy drums and synth"],

    padding=True,

    return_tensors="pt"

).to("cuda")

audio_values = model.generate(

    **inputs,

    do_sample=True,

    guidance_scale=3,

    max_new_tokens=256

)

# Save

sampling_rate = model.config.audio_encoder.sampling_rate

scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())

Text-to-sound with AudioGen

from audiocraft.models import AudioGen

# Load AudioGen

model = AudioGen.get_pretrained('facebook/audiogen-medium')

model.set_generation_params(duration=5)

# Generate sound effects

descriptions = ["dog barking in a park with birds chirping"]

wav = model.generate(descriptions)

torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)

Core concepts

Architecture overview

AudioCraft Architecture:

┌──────────────────────────────────────────────────────────────┐

│                    Text Encoder (T5)                          │

│                         │                                     │

│                    Text Embeddings                            │

└────────────────────────┬─────────────────────────────────────┘

                         │

┌────────────────────────▼─────────────────────────────────────┐

│              Transformer Decoder (LM)                         │

│     Auto-regressively generates audio tokens                  │

│     Using efficient token interleaving patterns               │

└────────────────────────┬─────────────────────────────────────┘

                         │

┌────────────────────────▼─────────────────────────────────────┐

│                EnCodec Audio Decoder                          │

│        Converts tokens back to audio waveform                 │

└──────────────────────────────────────────────────────────────┘

Model variants

Model

Size

Description

Use Case

musicgen-small

300M

Text-to-music

Quick generation

musicgen-medium

1.5B

Text-to-music

Balanced

musicgen-large

3.3B

Text-to-music

Best quality

musicgen-melody

1.5B

Text + melody

Melody conditioning

musicgen-melody-large

3.3B

Text + melody

Best melody

musicgen-stereo-*

Varies

Stereo output

Stereo generation

musicgen-style

1.5B

Style transfer

Reference-based

audiogen-medium

1.5B

Text-to-sound

Sound effects

Generation parameters

Parameter

Default

Description

duration

8.0

Length in seconds (1-120)

top_k

250

Top-k sampling

top_p

0.0

Nucleus sampling (0 = disabled)

temperature

1.0

Sampling temperature

cfg_coef

3.0

Classifier-free guidance

MusicGen usage

Text-to-music generation

from audiocraft.models import MusicGen

import torchaudio

model = MusicGen.get_pretrained('facebook/musicgen-medium')

# Configure generation

model.set_generation_params(

    duration=30,          # Up to 30 seconds

    top_k=250,            # Sampling diversity

    top_p=0.0,            # 0 = use top_k only

    temperature=1.0,      # Creativity (higher = more varied)

    cfg_coef=3.0          # Text adherence (higher = stricter)

)

# Generate multiple samples

descriptions = [

    "epic orchestral soundtrack with strings and brass",

    "chill lo-fi hip hop beat with jazzy piano",

    "energetic rock song with electric guitar"

]

# Generate (returns [batch, channels, samples])

wav = model.generate(descriptions)

# Save each

for i, audio in enumerate(wav):

    torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)

Melody-conditioned generation

from audiocraft.models import MusicGen

import torchaudio

# Load melody model

model = MusicGen.get_pretrained('facebook/musicgen-melody')

model.set_generation_params(duration=30)

# Load melody audio

melody, sr = torchaudio.load("melody.wav")

# Generate with melody conditioning

descriptions = ["acoustic guitar folk song"]

wav = model.generate_with_chroma(descriptions, melody, sr)

torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)

Stereo generation

from audiocraft.models import MusicGen

# Load stereo model

model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')

model.set_generation_params(duration=15)

descriptions = ["ambient electronic music with wide stereo panning"]

wav = model.generate(descriptions)

# wav shape: [batch, 2, samples] for stereo

print(f"Stereo shape: {wav.shape}")  # [1, 2, 480000]

torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)

Audio continuation

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")

model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")

# Load audio to continue

import torchaudio

audio, sr = torchaudio.load("intro.wav")

# Process with text and audio

inputs = processor(

    audio=audio.squeeze().numpy(),

    sampling_rate=sr,

    text=["continue with a epic chorus"],

    padding=True,

    return_tensors="pt"

)

# Generate continuation

audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)

MusicGen-Style usage

Style-conditioned generation

from audiocraft.models import MusicGen

# Load style model

model = MusicGen.get_pretrained('facebook/musicgen-style')

# Configure generation with style

model.set_generation_params(

    duration=30,

    cfg_coef=3.0,

    cfg_coef_beta=5.0  # Style influence

)

# Configure style conditioner

model.set_style_conditioner_params(

    eval_q=3,          # RVQ quantizers (1-6)

    excerpt_length=3.0  # Style excerpt length

)

# Load style reference

style_audio, sr = torchaudio.load("reference_style.wav")

# Generate with text + style

descriptions = ["upbeat dance track"]

wav = model.generate_with_style(descriptions, style_audio, sr)

Style-only generation (no text)

# Generate matching style without text prompt

model.set_generation_params(

    duration=30,

    cfg_coef=3.0,

    cfg_coef_beta=None  # Disable double CFG for style-only

)

wav = model.generate_with_style([None], style_audio, sr)

AudioGen usage

Sound effect generation

from audiocraft.models import AudioGen

import torchaudio

model = AudioGen.get_pretrained('facebook/audiogen-medium')

model.set_generation_params(duration=10)

# Generate various sounds

descriptions = [

    "thunderstorm with heavy rain and lightning",

    "busy city traffic with car horns",

    "ocean waves crashing on rocks",

    "crackling campfire in forest"

]

wav = model.generate(descriptions)

for i, audio in enumerate(wav):

    torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)

EnCodec usage

Audio compression

from audiocraft.models import CompressionModel

import torch

import torchaudio

# Load EnCodec

model = CompressionModel.get_pretrained('facebook/encodec_32khz')

# Load audio

wav, sr = torchaudio.load("audio.wav")

# Ensure correct sample rate

if sr != 32000:

    resampler = torchaudio.transforms.Resample(sr, 32000)

    wav = resampler(wav)

# Encode to tokens

with torch.no_grad():

    encoded = model.encode(wav.unsqueeze(0))

    codes = encoded[0]  # Audio codes

# Decode back to audio

with torch.no_grad():

    decoded = model.decode(codes)

torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)

Common workflows

Workflow 1: Music generation pipeline

import torch

import torchaudio

from audiocraft.models import MusicGen

class MusicGenerator:

    def __init__(self, model_name="facebook/musicgen-medium"):

        self.model = MusicGen.get_pretrained(model_name)

        self.sample_rate = 32000

    def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):

        self.model.set_generation_params(

            duration=duration,

            top_k=250,

            temperature=temperature,

            cfg_coef=cfg

        )

        with torch.no_grad():

            wav = self.model.generate([prompt])

        return wav[0].cpu()

    def generate_batch(self, prompts, duration=30):

        self.model.set_generation_params(duration=duration)

        with torch.no_grad():

            wav = self.model.generate(prompts)

        return wav.cpu()

    def save(self, audio, path):

        torchaudio.save(path, audio, sample_rate=self.sample_rate)

# Usage

generator = MusicGenerator()

audio = generator.generate(

    "epic cinematic orchestral music",

    duration=30,

    temperature=1.0

)

generator.save(audio, "epic_music.wav")

Workflow 2: Sound design batch processing

import json

from pathlib import Path

from audiocraft.models import AudioGen

import torchaudio

def batch_generate_sounds(sound_specs, output_dir):

    """

    Generate multiple sounds from specifications.

    Args:

        sound_specs: list of {"name": str, "description": str, "duration": float}

        output_dir: output directory path

    """

    model = AudioGen.get_pretrained('facebook/audiogen-medium')

    output_dir = Path(output_dir)

    output_dir.mkdir(exist_ok=True)

    results = []

    for spec in sound_specs:

        model.set_generation_params(duration=spec.get("duration", 5))

        wav = model.generate([spec["description"]])

        output_path = output_dir / f"{spec['name']}.wav"

        torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)

        results.append({

            "name": spec["name"],

            "path": str(output_path),

            "description": spec["description"]

        })

    return results

# Usage

sounds = [

    {"name": "explosion", "description": "massive explosion with debris", "duration": 3},

    {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5},

    {"name": "door", "description": "wooden door creaking and closing", "duration": 2}

]

results = batch_generate_sounds(sounds, "sound_effects/")

Workflow 3: Gradio demo

import gradio as gr

import torch

import torchaudio

from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('facebook/musicgen-small')

def generate_music(prompt, duration, temperature, cfg_coef):

    model.set_generation_params(

        duration=duration,

        temperature=temperature,

        cfg_coef=cfg_coef

    )

    with torch.no_grad():

        wav = model.generate([prompt])

    # Save to temp file

    path = "temp_output.wav"

    torchaudio.save(path, wav[0].cpu(), sample_rate=32000)

    return path

demo = gr.Interface(

    fn=generate_music,

    inputs=[

        gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"),

        gr.Slider(1, 30, value=8, label="Duration (seconds)"),

        gr.Slider(0.5, 2.0, value=1.0, label="Temperature"),

        gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient")

    ],

    outputs=gr.Audio(label="Generated Music"),

    title="MusicGen Demo"

)

demo.launch()

Performance optimization

Memory optimization

# Use smaller model

model = MusicGen.get_pretrained('facebook/musicgen-small')

# Clear cache between generations

torch.cuda.empty_cache()

# Generate shorter durations

model.set_generation_params(duration=10)  # Instead of 30

# Use half precision

model = model.half()

Batch processing efficiency

# Process multiple prompts at once (more efficient)

descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"]

wav = model.generate(descriptions)  # Single batch

# Instead of

for desc in descriptions:

    wav = model.generate([desc])  # Multiple batches (slower)

GPU memory requirements

Model

FP32 VRAM

FP16 VRAM

musicgen-small

~4GB

~2GB

musicgen-medium

~8GB

~4GB

musicgen-large

~16GB

~8GB

Common issues

Issue

Solution

CUDA OOM

Use smaller model, reduce duration

Poor quality

Increase cfg_coef, better prompts

Generation too short

Check max duration setting

Audio artifacts

Try different temperature

Stereo not working

Use stereo model variant

References

Resources

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card