New Apple-backed AI model can generate sound and speech from silent videos
SMRTR summary
Apple researchers developed VSSFlow, an AI model that can generate both sound effects and speech from silent videos using a single unified system. Unlike previous models that struggled with either speech or non-speech sounds, VSSFlow uses a 10-layer architecture that processes video frames and text transcripts simultaneously to create realistic audio. The model demonstrated competitive results against specialized single-task models and can produce mixed audio containing both environmental sounds and dialogue after additional fine-tuning on synthetic examples.
SMRTR provides this summary for quick context. The original article belongs to 9to5Mac.
Read the original article