Where is an editing program with Automated Speech Recognition?

Creative Community Conversations

Where is an editing program with Automated Speech Recognition?

Posted by Brett Sherman on December 29, 2015 at 8:27 pm

When I think about the feature that would really save me a ton of time editing, it’s obvious. Automated Speech Recognition. It would quite frankly be revolutionary for my editing process. The thing is, where is it? The technology is there. Siri, Dragon for Mac, Google Voice, etc.. Just about every Youtube clip uploaded is run through ASR that can be accessed with the Youtube API. If every cat video in the world gets ASR, why can’t all my clips?

Yes I know it won’t be even 90% reliable at this point. But that doesn’t mean it wouldn’t be useful at 80%. And that percent is only going to get better with time. Apple would seem to have some advantage in this with Siri technology.

So how about it Apple? Be revolutionary. Here’s your chance.

Andrew Kimery replied 8 years, 10 months ago 12 Members · 28 Replies
28 Replies

Oliver Peters
December 29, 2015 at 10:05 pm

Do you mean editing based on operator voice commands? If so CMX tried that decades ago and it never took off.

If you mean speech analysis to generate speech-to-text, then Adobe has had that for several versions. If you properly “teach” the module the type of speech to recognize, it does a passable job, sometimes. However, user response wasn’t great and as far as I can tell, Adobe has largely abandoned it. If it’s still there, it’s somewhere in Audition.

If you mean the opposite – lining up speech to existing text, then Avid and Boris have had that technology care of Nexidia. As this involves licensing, it becomes a sticky issue, which is where Avid and Nexidia still remain apart for any recent MC versions.

But what is it that you are really trying to accomplish? In theory, if you want audio to generate a transcript that shows up in the browser for example and is linked back to audio points within clips, I suspect Apple would run afoul of several patents. Not to say they couldn’t do it, but just that there’s more to it than the technology. Remember, too, that Siri works because of a cloud-based assist from Apple.

– Oliver

Oliver Peters Post Production Services, LLC
Orlando, FL
http://www.oliverpeters.com
Andy Field
December 29, 2015 at 10:58 pm

I think he meant what Premiere used to do (but they killed it recently) Voice recognition that would log and transcribe interviews — when Premiere’s worked – it was great (for standard clear slow spoken English) when it didn’t, it was worthless — Adobe gave up on it – but if you are a CC subscriber, you can download an earlier version of Premiere pro CS 6 (and even some early CC versions) and it still works

What i don’t get is that Dragon’s Voice Recognition app (free on IPhone) works great and is about 90 percent accurate for most speakers — why can’t they incorporate that in some of the NLE’s and take away hours of logging headaches for documentaries and news?

Boris’s work’s phonetically and only for searching something you know is already there — not for transcription.

Andy Field
FieldVision Productions
N. Bethesda, Maryland 20852
Brett Sherman
December 30, 2015 at 1:25 am

[Andy Field] “I think he meant what Premiere used to do (but they killed it recently) Voice recognition that would log and transcribe interviews”

That’s it.

[Andy Field] “What i don’t get is that Dragon’s Voice Recognition app (free on IPhone) works great and is about 90 percent accurate for most speakers — why can’t they incorporate that in some of the NLE’s and take away hours of logging headaches for documentaries and news?”

Yep. That’s my point. I have to limit my use of transcription services for cost reasons. If I had textual access to every single clip in my library, it would be a game changer.

As far as patents. I have no doubt there are B.S. patents. But if Oliver is suggesting it would be against a patent to tie specific speech via text to a specific time, then all closed captioning would run afoul of the same patent. That is in fact what an .SRT file is. Which is what Youtube’s ASR creates.
Oliver Peters
December 30, 2015 at 1:42 am

[Brett Sherman] “But if Oliver is suggesting it would be against a patent to tie specific speech via text to a specific time, then all closed captioning would run afoul of the same patent”

The patent I was referring to was Nexidia’s, which relates to Avid and Boris. I’m sure Adobe has a licensing deal with Autonomy for their technology. In the case of Nexidia the patent relates to waveform analysis based on sounds and the image they create in a waveform pattern. Compare the waveform against a known library and derive a link between the audio and existing text. That’s the opposite of what you are asking for, of course. It’s more detailed than that, but that’s the short explanation.

I have no idea what YouTube uses. Maybe the same thing live broadcasters use, which is hardware-based. YouTube also uses music analysis to find music licensing infringement. Out of curiosity, how long does it take YouTube to generate captioning after upload?

– Oliver

Oliver Peters Post Production Services, LLC
Orlando, FL
http://www.oliverpeters.com
Oliver Peters
December 30, 2015 at 1:55 am

FWIW – YouTube actually talks about manually adding your own subtitles.

https://www.youtube.com/watch?v=LCZ-cxfxzvk

For automatic subtitle creation, they recommend that you edit these after the fact for accuracy. Sounds a lot like what Adobe was doing with Autonomy.

– Oliver

Oliver Peters Post Production Services, LLC
Orlando, FL
http://www.oliverpeters.com

Some contents or functionalities here are not available due to your cookie preferences!

This happens because the functionality/content marked as “Google Youtube” uses cookies that you choosed to keep disabled. In order to view this content or use this functionality, please enable cookies: click here to open your cookie preferences.
Brett Sherman
December 30, 2015 at 2:31 am

Unbeknownst to most users, Youtube is actually analyzing the audio, creating it’s own SRT file, separate and apart from whatever the user uploads or creates. This is not intended for captioning and the can be accessed only with the Youtube API. At this point Youtube searches do not search these files. No doubt Google has intentions of mining this trove of textual data at some point.

I’m not sure how long it takes for Youtube to generate their transcript. We hired a company to catalogue our Youtube video library this way. For that they actually edit the automated transcript to get 100% accuracy. But it is all timecoded for them by Youtube.

But what we’re thinking about is setting up a workflow to upload interviews to Youtube as a private video. Grab their SRT file (unedited and thus about 80% accurate) and use it for text searches that would at least point to how many minutes and seconds into the clip to find it. Seems kind of crazy as it would be much easier if such capabilities were just built into the video editing program.

As far as Siri requiring the cloud. I think that has more to do with limited storage space and processing power of phones. Apple has speech to text built into the OS. I think it requires a couple GB of data to download for that capability. Not sure how it compares to Siri exactly, but it is reasonably capable.
Joe Marler
December 30, 2015 at 3:55 pm

[Brett Sherman] “…would really save me a ton of time editing, it’s obvious. Automated Speech Recognition. It would quite frankly be revolutionary for my editing process. The thing is, where is it? The technology is there….Yes I know it won’t be even 90% reliable at this point.”

I agree. The technology is apparently available for non-trained, speaker-independent automated transcription at a sufficient quality level for various purposes. My Comcast phone service has automated transcription of voice mail, and it works fairly well. https://www.comcast.com/readablevoicemail. The underlying patent for this technology is apparently stated here: https://www.google.com/patents/US6775651

Nuance also has voicemail-to-text; I don’t know if the underlying technology is shared with Comcast or not: https://www.nuance.com/for-business/mobile-solutions/voice-to-text-services/index.htm

Voicemail to text is speaker-independent automated transcription. It can obviously be done on longer input files.

At one time the web service https://www.voicebase.com/ did free automated transcription but it’s no longer free.

The previous Premiere Speech Analysis yielded generally poor accuracy for pure transcription. It was mainly designed to sync a written script with narrated audio, not stand-alone speech to text.

I tested Dragon Dictate 4 for Mac, which can transcribe pre-recorded audio files. It had better accuracy than VoiceBase (back when VoiceBase was free) but is designed for a single speaker and requires a brief training per voice. It also does not do auto-punctuation.

It would seem there’s a significant market for a stand-alone speaker-independent transcription product which works with pre-recorded audio and does not require training. For documentaries the conversion accuracy could be quite modest, yet still be useful. The fact that Comcast and Nuance already use this shows it is technically feasible.

I don’t know what the impediment is to implementing this more widely on personal computer platforms.
Mark Suszko
December 30, 2015 at 5:54 pm

Brett Said:
But what we’re thinking about is setting up a workflow to upload interviews to Youtube as a private video. Grab their SRT file (unedited and thus about 80% accurate) and use it for text searches that would at least point to how many minutes and seconds into the clip to find it. Seems kind of crazy as it would be much easier if such capabilities were just built into the video editing program.

I’ve actually been doing exactly this the past couple of weeks. There’s freeware apps that will translate the .SRT files from YouTube into the other popular captioning formats. I’m still having trouble at the final steps of generating 601 closed captions in my timeline from these, but I continue to experiment. YouTube is pretty fast at the transcription, and I find editing their transcripts right on their site is pretty simple. You can get better, faster transcripts on the fly if you use a headset and Dragon to listen to the program audio and “re-speak” everything in one clear voice to the machine.

Transcripts for clips makes them keyword-searchable, not just in the edit, but later, once in distribution on the net. So it’s not just for the deaf, but for everybody.

A lot of institutional users that get federal funding, schools, state agencies, hospitals, etc., are required to caption a lot of their programming, so an NLE that makes this easy is going to lock in a lot of users over the long haul.

Auto-transcriptioning and one-step broadcast legal captioning: Its my number one request to Apple for improving FCPX.
Joseph W. bourke
December 30, 2015 at 6:04 pm

Interestingly enough, I have a Windows phone, and the voice transcription on it is probably in the 95 to 100 percent range. Of course, this is me in front of the phone, speaking directly into the mic, but I can rattle off a text message or email with very little correction necessary.

There’s also voice transcription available in MS Word – I don’t know how you’d utilize it in your workflow, but it’s there.

Joe Bourke
Owner/Creative Director
Bourke Media
http://www.bourkemedia.com
Oliver Peters
December 30, 2015 at 6:14 pm

[Joseph W. Bourke] “Of course, this is me in front of the phone, speaking directly into the mic, but I can rattle off a text message or email with very little correction necessary. “

That’s the thing with any of this. There’s a certain amount of “training the device” that’s required. This calibrates it to your pronounciation. That’s why the accuracy is better than random audio from a raw field tape.

Oliver

Oliver Peters Post Production Services, LLC
Orlando, FL
http://www.oliverpeters.com

Page 1 of 3

1 2 3 →

Reply to this Discussion! Login or Sign Up

Creative Communities of the World Forums