Automatically tag non-speech audio
under consideration
J
J D
Perhaps these can be handled with a script layer?
I'm thinking that a Composition might be able to be processed through several AI filters to find interesting information.
We see already how amazing https://notebooklm.google.com is able to handle audio, but imagine if you can run three filters on each composition:
1) Sound and music - creating 1 script that can be searched for coughs, laughter, singing, etc.
2) Transcript - as usual but faster, like the speed of notebooklm.com for mp3 files
3) Video - looking for people, objects, images, animation, clocks and timers, text, etc. and creating a script with those objects named in the timeline.
J
J D
Another advantage to doing it that way is that we can imagine adding 3 fancy captions to the result that would allow a blind person to listen to the visual activity or a deaf person to see the audio activity in addition to the transcript.
So consider this an accessibility function in addition to making search and AI generation of the content more interesting because clips will be easier to find if the details are in text and accessible by tools that use the LLMs.
S
Support Team
Merged in a post:
breath
M
Miko
Would be so great to have a big breath/gasp detection to add to the fillers. Thanks.
Gabrielle M.
This would be AMAZING to see in the software! So frustrating when it counts breaths as blank audio clips. >:(
D
David Nadelberg
This is great. Ideally, this feature would also be able to label recordings of crowd/audience laughter and other noises. If you are editing content that was recorded in front of a live audience-- group laughter, shouts, gasps are key elements that need to be listed in the script. So I am hoping that this tool will be able to recognize group laughter, not just the type of laughter heard in a quiet, 2-person conversation that lacks crowd noise.
S
Shannon Post
Supporting this feature request
C
Cody Crabb
This is a big one! I think it would be a great opportunity to make this a feature for robust captions.
Things like:
🎵 music playing 🎵
[laughter]
As another user said, I'd be happy to go through and label these myself if Descript would just prompt me, like "We think this is music, is it?" or "select the caption that best describes this sound" with options for laughter, crying, coughing, or even a custom field.
M
Mohamed
Some of the Audio files I use in Descript occasionally has the speaker talking in another language since they are lectures teaching Arabic. So, there are expectantly some places where Descript doesn't recognize what's being said. Which is fine because it's not in English.
However, why does it treat those places as silences?
This is a major problem because that means that my file is full of places where Descript is saying there is silence when there isn't.
This means that the shorten word gap features becomes useless, and even worse on many occasions when I ignore a certain portion using the text on screen it also ignores some of the Arabic portions and it causes very unnatural jumps which I can't do anything about.
In order for us to be able to "trust" the transcript, any sound in the timeline needs some sort of representation in the transcript. Even if Descript doesn't recognize it. I mean, it wrongly transcribes most of the Arabic anyway. So I don't mind what's on screen, as long as there is something I can work with.
So I don't believe that any waveform in the timeline should ever be treated as "silence". Silence means there is nothing there!
So even if it says something like "not recognized, or Lorem Ipsum for that matter, or some random symbols like ***. Whatever it is. Something has to represent that Audio!
Not correctly transcribing a word is one thing, that can easily be corrected.
But treating a word as if it hasn't even been uttered because it couldn't be recognized is a disaster!
Weldon Johnson
When removing "word gaps" the vast, vast majority of people who edit audio actually want to remove silence. It needs to go off of silence not words. If there is laughter or something and it it cut off that often is problematic. But the bigger issue is transciption is often not accurate. I have a podcast with foreign names. The description often just ignores these and thinks they are silence. So the "Remoe all" word gap features can NEVER be used reliably.
I assume in the real world that is case for everyone.
So every podcast I spend 10 minutes to an hour going through each gap I want to shorten and manually inspecting it. I can do remove all the silence in audacity with the click of 1 button. Actually may save me time to export from descript, reimport into audacity, then trim silence, then carry on. Not sure why I don't do that actually until this issue is resolved.
Mercedes Rothwell
It would probably be difficult to automatically identify something as laughter or applause, but a solution might be how your "detect speakers" interface works, where you can listen to the clip and identify who it is. It would give users the opportunity to identify it for themselves. It would be great to be able to tell Descript to ignore those sounds the way we can highlight words and ignore them.
Charlie Harding
This is a persistent challenge in editing as well as when generating captions
Load More
→