arXiv cs.SD (Sound)

75 items · Generative Audio & Music Models · site ↗

Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech

arXiv cs.SD (Sound) 8h

nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies

arXiv cs.SD (Sound) 8h

Exploring LLMs for South Asian Music Understanding and Generation

arXiv cs.SD (Sound) 8h

Probing Spatial Structure in Pretrained Audio Representations

arXiv cs.SD (Sound) 8h

Sound Effects Dataset Unification With the Universal Category System

arXiv cs.SD (Sound) 8h

SB-RF: Schr\"odinger Bridge Rectified Flow for One-Step Robust Speech Enhancement

arXiv cs.SD (Sound) 8h

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

arXiv cs.SD (Sound) 8h

Do speech foundation models perceive speaker similarity as humans do?

arXiv cs.SD (Sound) 8h

SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework

arXiv cs.SD (Sound) 8h

UniVoice: A Unified Model for Speech and Singing Voice Generation

arXiv cs.SD (Sound) 8h

GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

arXiv cs.SD (Sound) 8h

Beyond WER: A Paired Acoustic Stress Test for Ambient Clinical Scribes

arXiv cs.SD (Sound) 8h

DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech Enhancement

arXiv cs.SD (Sound) 8h

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

arXiv cs.SD (Sound) 8h

Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition

arXiv cs.SD (Sound) 8h

Channel-Oriented Design for EEG-to-Music Reconstruction

arXiv cs.SD (Sound) yest

The Differentiable Auditory Loop (DAL): An ML Framework for Hyper-Personalized Hearing Aids

arXiv cs.SD (Sound) yest

Feasibility of Time-Domain DNN-Based Speech Enhancement on Embedded FPGA for Hearing Aid

arXiv cs.SD (Sound) yest

Gauss Circle Lattices with Geometric Convolutions for Synthesizing High Dimensional Image-Source Room Impulse Responses

arXiv cs.SD (Sound) yest

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

arXiv cs.SD (Sound) yest

A Second-Order Cepstral Signature of Contact-Vibration Sounds Reproduced by Laptop Loudspeakers: A Synthetic Case Study

arXiv cs.SD (Sound) yest

Flow-HOA: Generative Joint Optimization for Ambisonics Encoding via Flow Matching

arXiv cs.SD (Sound) yest

SHB-AE: Spherical harmonic beamforming based Ambisonics encoding and upscaling method for smartphone microphone array

arXiv cs.SD (Sound) yest

Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

arXiv cs.SD (Sound) yest

SURF: Separation via Unsupervised Remixing Flow

arXiv cs.SD (Sound) yest

FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

arXiv cs.SD (Sound) yest

Audio Interaction Model

arXiv cs.SD (Sound) yest

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

arXiv cs.SD (Sound) yest

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

arXiv cs.SD (Sound) yest

Representation Matters in Randomized Smoothing for Audio Classification

arXiv cs.SD (Sound) yest

SegTune: Structured and Fine-Grained Control for Song Generation

arXiv cs.SD (Sound) Jun 3

EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement

arXiv cs.SD (Sound) Jun 3

A Training-Efficient Transformer-Based Anti-Spoofing Network for Logical Access in ASVspoof 5

arXiv cs.SD (Sound) Jun 3

Audio Spotforming via Post-Filtering Using Cross-Array Non-target Estimates

arXiv cs.SD (Sound) Jun 3

SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

arXiv cs.SD (Sound) Jun 3

Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection

arXiv cs.SD (Sound) Jun 3

Tonal parsimony in chord-sequence analysis: combining modulation cost and tonal vocabulary

arXiv cs.SD (Sound) Jun 3

Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation

arXiv cs.SD (Sound) Jun 3

LiveBand: Live Accompaniment Generation in the Audio Domain

arXiv cs.SD (Sound) Jun 3

FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations

arXiv cs.SD (Sound) Jun 3

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

arXiv cs.SD (Sound) Jun 3

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

arXiv cs.SD (Sound) Jun 3

Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals

arXiv cs.SD (Sound) Jun 3

A Comparison of Generative and Discriminative Methods for Speech Enhancement: Robustness, Complexity, and Hallucination

arXiv cs.SD (Sound) Jun 3

AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following

arXiv cs.SD (Sound) Jun 3

DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

arXiv cs.SD (Sound) Jun 2

Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation

arXiv cs.SD (Sound) Jun 2

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

arXiv cs.SD (Sound) Jun 2

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

arXiv cs.SD (Sound) Jun 2

MelT: GEMM-Native NDFT for Efficient Single-Stage Audio Frontends on Modern Accelerators

arXiv cs.SD (Sound) Jun 2

A Lightweight Slot-Attention Framework for Multi-Instrument Multi-Pitch Estimation

arXiv cs.SD (Sound) Jun 2

UniVocal: Unified Speech-Singing Code-Switching Synthesis

arXiv cs.SD (Sound) Jun 2

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

arXiv cs.SD (Sound) Jun 2

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

arXiv cs.SD (Sound) Jun 2

MOSS-Audio Technical Report

arXiv cs.SD (Sound) Jun 2

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

arXiv cs.SD (Sound) Jun 2

C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound Classification

arXiv cs.SD (Sound) Jun 2

Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

arXiv cs.SD (Sound) Jun 2

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

arXiv cs.SD (Sound) Jun 2

Local Diagnostics of Continuous Normalizing Flow for Out-of-Distribution Detection

arXiv cs.SD (Sound) Jun 2

Mental Damage: Caption Poisoning Attacks on Retrieval-Augmented Text-to-Music Generation

arXiv cs.SD (Sound) Jun 1

3DAE: Binaural Quality Assessment for Audio Novel View Synthesis with Spatial Maps and Benchmark

arXiv cs.SD (Sound) Jun 1

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

arXiv cs.SD (Sound) Jun 1

AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing

arXiv cs.SD (Sound) Jun 1

Sound effects in media:A comparative analysis of recorded and synthetic samples in live-action and animation

arXiv cs.SD (Sound) Jun 1

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

arXiv cs.SD (Sound) Jun 1

Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation

arXiv cs.SD (Sound) Jun 1

Escaping the Linearity Trap: Manifold Detours for Black-Box Adversarial Attacks on Singing Audio Deepfake Detection

arXiv cs.SD (Sound) Jun 1

Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors

arXiv cs.SD (Sound) Jun 1

GaMi: Geometry-Agnostic Material Identification via Cross-Modal Subtractive Disentanglement

arXiv cs.SD (Sound) Jun 1

A Unified and Reproducible Experimentation Framework for Speech Understanding

arXiv cs.SD (Sound) Jun 1

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

arXiv cs.SD (Sound) Jun 1

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

arXiv cs.SD (Sound) Jun 1

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

arXiv cs.SD (Sound) Jun 1

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

arXiv cs.SD (Sound) Jun 1

arXiv cs.SD (Sound)

Keyboard