arXiv cs.SD (Sound)

75 items · Generative Audio & Music Models · site ↗

Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech arXiv cs.SD (Sound) 8h
nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies arXiv cs.SD (Sound) 8h
Exploring LLMs for South Asian Music Understanding and Generation arXiv cs.SD (Sound) 8h
Probing Spatial Structure in Pretrained Audio Representations arXiv cs.SD (Sound) 8h
Sound Effects Dataset Unification With the Universal Category System arXiv cs.SD (Sound) 8h
SB-RF: Schr\"odinger Bridge Rectified Flow for One-Step Robust Speech Enhancement arXiv cs.SD (Sound) 8h
Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition arXiv cs.SD (Sound) 8h
Do speech foundation models perceive speaker similarity as humans do? arXiv cs.SD (Sound) 8h
SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework arXiv cs.SD (Sound) 8h
UniVoice: A Unified Model for Speech and Singing Voice Generation arXiv cs.SD (Sound) 8h
GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech arXiv cs.SD (Sound) 8h
Beyond WER: A Paired Acoustic Stress Test for Ambient Clinical Scribes arXiv cs.SD (Sound) 8h
DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech Enhancement arXiv cs.SD (Sound) 8h
SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech arXiv cs.SD (Sound) 8h
Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition arXiv cs.SD (Sound) 8h
Channel-Oriented Design for EEG-to-Music Reconstruction arXiv cs.SD (Sound) yest
The Differentiable Auditory Loop (DAL): An ML Framework for Hyper-Personalized Hearing Aids arXiv cs.SD (Sound) yest
Feasibility of Time-Domain DNN-Based Speech Enhancement on Embedded FPGA for Hearing Aid arXiv cs.SD (Sound) yest
Gauss Circle Lattices with Geometric Convolutions for Synthesizing High Dimensional Image-Source Room Impulse Responses arXiv cs.SD (Sound) yest
CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding arXiv cs.SD (Sound) yest
A Second-Order Cepstral Signature of Contact-Vibration Sounds Reproduced by Laptop Loudspeakers: A Synthetic Case Study arXiv cs.SD (Sound) yest
Flow-HOA: Generative Joint Optimization for Ambisonics Encoding via Flow Matching arXiv cs.SD (Sound) yest
SHB-AE: Spherical harmonic beamforming based Ambisonics encoding and upscaling method for smartphone microphone array arXiv cs.SD (Sound) yest
Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification arXiv cs.SD (Sound) yest
SURF: Separation via Unsupervised Remixing Flow arXiv cs.SD (Sound) yest
FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors arXiv cs.SD (Sound) yest
Audio Interaction Model arXiv cs.SD (Sound) yest
Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models arXiv cs.SD (Sound) yest
DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities arXiv cs.SD (Sound) yest
Representation Matters in Randomized Smoothing for Audio Classification arXiv cs.SD (Sound) yest
SegTune: Structured and Fine-Grained Control for Song Generation arXiv cs.SD (Sound) Jun 3
EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement arXiv cs.SD (Sound) Jun 3
A Training-Efficient Transformer-Based Anti-Spoofing Network for Logical Access in ASVspoof 5 arXiv cs.SD (Sound) Jun 3
Audio Spotforming via Post-Filtering Using Cross-Array Non-target Estimates arXiv cs.SD (Sound) Jun 3
SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling arXiv cs.SD (Sound) Jun 3
Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection arXiv cs.SD (Sound) Jun 3
Tonal parsimony in chord-sequence analysis: combining modulation cost and tonal vocabulary arXiv cs.SD (Sound) Jun 3
Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation arXiv cs.SD (Sound) Jun 3
LiveBand: Live Accompaniment Generation in the Audio Domain arXiv cs.SD (Sound) Jun 3
FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations arXiv cs.SD (Sound) Jun 3
Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals arXiv cs.SD (Sound) Jun 3
SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models arXiv cs.SD (Sound) Jun 3
Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals arXiv cs.SD (Sound) Jun 3
A Comparison of Generative and Discriminative Methods for Speech Enhancement: Robustness, Complexity, and Hallucination arXiv cs.SD (Sound) Jun 3
AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following arXiv cs.SD (Sound) Jun 3
DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech arXiv cs.SD (Sound) Jun 2
Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation arXiv cs.SD (Sound) Jun 2
Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty arXiv cs.SD (Sound) Jun 2
Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning arXiv cs.SD (Sound) Jun 2
MelT: GEMM-Native NDFT for Efficient Single-Stage Audio Frontends on Modern Accelerators arXiv cs.SD (Sound) Jun 2
A Lightweight Slot-Attention Framework for Multi-Instrument Multi-Pitch Estimation arXiv cs.SD (Sound) Jun 2
UniVocal: Unified Speech-Singing Code-Switching Synthesis arXiv cs.SD (Sound) Jun 2
HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark arXiv cs.SD (Sound) Jun 2
JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions arXiv cs.SD (Sound) Jun 2
MOSS-Audio Technical Report arXiv cs.SD (Sound) Jun 2
Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space arXiv cs.SD (Sound) Jun 2
C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound Classification arXiv cs.SD (Sound) Jun 2
Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification arXiv cs.SD (Sound) Jun 2
DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions arXiv cs.SD (Sound) Jun 2
Local Diagnostics of Continuous Normalizing Flow for Out-of-Distribution Detection arXiv cs.SD (Sound) Jun 2
Mental Damage: Caption Poisoning Attacks on Retrieval-Augmented Text-to-Music Generation arXiv cs.SD (Sound) Jun 1
3DAE: Binaural Quality Assessment for Audio Novel View Synthesis with Spatial Maps and Benchmark arXiv cs.SD (Sound) Jun 1
Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS arXiv cs.SD (Sound) Jun 1
AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing arXiv cs.SD (Sound) Jun 1
Sound effects in media:A comparative analysis of recorded and synthetic samples in live-action and animation arXiv cs.SD (Sound) Jun 1
MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors arXiv cs.SD (Sound) Jun 1
Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation arXiv cs.SD (Sound) Jun 1
Escaping the Linearity Trap: Manifold Detours for Black-Box Adversarial Attacks on Singing Audio Deepfake Detection arXiv cs.SD (Sound) Jun 1
Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors arXiv cs.SD (Sound) Jun 1
GaMi: Geometry-Agnostic Material Identification via Cross-Modal Subtractive Disentanglement arXiv cs.SD (Sound) Jun 1
A Unified and Reproducible Experimentation Framework for Speech Understanding arXiv cs.SD (Sound) Jun 1
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer arXiv cs.SD (Sound) Jun 1
DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs arXiv cs.SD (Sound) Jun 1
Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus arXiv cs.SD (Sound) Jun 1
UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception arXiv cs.SD (Sound) Jun 1

Keyboard

j / k
move between items
Space
expand / collapse
o
open original
s
save / unsave
m
mark read
/
focus search
?
this help