Where is an editing program with Automated Speech Recognition?

Creative Community Conversations

Where is an editing program with Automated Speech Recognition?

Andrew Kimery replied 8 years, 10 months ago 12 Members · 28 Replies

Michael Phillips
December 30, 2015 at 7:26 pm

As Oliver points out, speech to text relies on some amount of training (although that’s getting better) but quality and speed of speech being dictated. If there is a lot of accents, environment noise and such, accuracy goes down quite a bit. There are two ways to go about this that have been mentioned:

1. Nuance (Dragon, Siri, Cortana, etc.) These are dictionary based systems in order to provide the text. They can only be as good as the dictionary that powers them and typically dictionaries are not that up to date with people names and places. Combine that with training for voice, accent, microphone type, and the effect of the environment, not to mention any emotion or volume changes you might find in your footage. Your mileage may vary. The BBC and other services use Dragon type applications once trained to the voice to listen to the playback via headphones while dictating back into Nuance. This removes a lot of the issues stated and lets humans interact with the output.

2. Phonetic based solutions like Nexidia. This is not a dictionary based system. It is based on phonemes. Every language has some number of phonemes that make up the entire language. On average it is like 30 to 36 depending on language. Unsure of Asian languages that also rely on tonality. The phonemes can be indexed at up to dozen times faster than real time per core processor and creates an index file that is slightly under 5MB for ever hour of audio indexed. That file also includes other metadata such as timing offset into file, file name, etc. This also allows the search to be lightening fast. I have done searches on libraries containing tens of thousands of hours with results in less than 2 seconds. The search is done by entering a text string which in turn gets transformed to its phoneme representation and then compared against all the indexed files for the results. But there is no correlation of phonemes to dictionary based solutions which is why Nexidia has not developed a speech to text solution. Phoneme based solutions provide a whole other and different benefit than speech to text.

3. Sync to text. This is an extension of the Nexidia technology when combined with a text file containing the dialogue. In this case, not only is the audio essence indexed into PAT files, but the text string is also “phomeme’d” as per previous and the results stored with the text file for later offset index into the media file. This same concept is what allows Nexidia to offer a close captioning and subtitle sync check for QC as it also does language identification, sync, completeness, etc.

Adobe’s technology was licensed from Autonomy. Overall the results were not that great because of the above issues, and it probably came down to the cost of the license to the actual success of the feature. In later versions it was recommended that a transcript be provided to help with the “sync” process, but that sort of defeated the whole purpose of speech to text to begin with.

As far as patents go, there are plenty – Nexidia has several surround their technology, Autonomy has theirs, Avid has some surrounding the use of speech technology as it relates to script and transcripts as part of an editorial process (of which one of them is one I created and now wish I owned…). The original Script Based Editing patent which was the manual process of syncing media to scripts was owned by Ediflex, traded to Avid for systems when they went out of business. That patent has since expired meaning any NLE could offer a script based editing interface. The ScriptSync side of the patents will expire in about 25 months.

I believe there is a whole lot more that can be done with script interfaces and that ScriptSync only scratched the surface of how content creators and editors can engage with their footage throughout the production process. As with all businesses, is it worth the ROI ad will users pay for it.

Michael
Michael Phillips
December 30, 2015 at 11:38 pm

And then again, breakthroughs are being made with a new company Voxil:

https://www.theonion.com/article/new-speech-recognition-software-factors-in-users-m-38257

😉

Michael
Joseph W. bourke
December 31, 2015 at 5:36 pm

And I believe the Voxil software comes with a sneeze and food chunk ballistic screen protector, which keeps pieces of bologna, chapati, rice, french fries (depending on your location/diet) from damaging the screen.

Joe Bourke
Owner/Creative Director
Bourke Media
http://www.bourkemedia.com
Brett Sherman
January 3, 2016 at 10:37 pm

[Michael Phillips] “1. Nuance (Dragon, Siri, Cortana, etc.) These are dictionary based systems in order to provide the text. They can only be as good as the dictionary that powers them and typically dictionaries are not that up to date with people names and places. Combine that with training for voice, accent, microphone type, and the effect of the environment, not to mention any emotion or volume changes you might find in your footage. Your mileage may vary. “

No doubt Dragon for Mac is better when trained, but you don’t have to train it. It now has a transcription mode where you can give it an .aiff file, select a profile and have it transcribe away. I just have a profile I call “generic female” and “generic male”. Of course it will never replace human transcription for 100% accuracy, but that’s not what I’m after. A 90% accurate transcript can easily help me find the needle in a haystack that I’m looking for in an interview. I can find interviews where they talk about a particular subject area. I can zero in on the part of the interview where they talk about that.
Rich Rubasch
January 4, 2016 at 7:23 pm

One of the Cable guys added voice recognition to its remote control…I think it is called Xfinity and the TV spot shows people holding the remote up to their mouth saying things like “record” or “go to channel…”

Coming soon?

Rich Rubasch
Tilt Media Inc.
Video Production, Post, Studio Sound Stage
Founder/President/Editor/Designer/Animator
https://www.tiltmedia.com
Michael Phillips
January 4, 2016 at 7:37 pm

I forgot to add that UI commands for certain functions are also more accurate as it comes from a list of predictive and expected library – record, play, change, etc. It also works well for show names, actors, directors as those proper names are also in the library.

I’ll have to give the Nuance another run of tests with its AIFF file based speech to text with my test media.

Speech to text is getting better, just different expectations for different applications.

Michael
Robin S. kurz
January 7, 2016 at 6:57 pm

[Brett Sherman] “The thing is, where is it? The technology is there. Siri… “

Ironically, and I think I mentioned it here before, I in fact used Siri (i.e. OS X’s built in dictation) to transcribe clips before. And yes, it’s something I’ve been calling for ever since, as a “unified feature” for FCP, since even though it had its caveats then and even more now, it basically worked really well as far as the recognition went. It just needed to be willing to work around the PITA parts. 😀

I would open a clip with the Quicktime Player and play it (with needed pre-roll). While playing, I’d jump into FCP, with the cursor already in the “Notes” field of a previous marked favorite (the segment I wanted to transcribe) and, with an external microphone, simply hit the shortcut for dictation, hold the mic up to the speaker and away I went. It actually worked amazingly well in terms of recognition and I had the favorites “tagged” with what was in fact said. So in principle it actually worked! After that I could search words and have it filter out *just that* section/favorite.

This was when dictation was first introduced (10.8?). So the primary caveat (the PITA part) was, that it wouldn’t take more than 30 secs at a time, had to be sent to the Siri cloud and come back. And the hectic switching back and forth to not miss the mark.

Since then, two major things happened. 1. the dictation has been moved locally (if you download the needed files) and become near realtime with that, which is very cool… but… 2. for whatever reason, the speakers are MUTED now, the moment you hit the dictation shortcut! Whaa? So the whole idea pretty much imploded with that. Aside from the fact that there were the major finger acrobatics involved (switching back and forth and hitting everything at the right moment) to even get it to work halfway with the first version. So I never really really got past an experimentation phase i.e. “proof of concept”.

So yes, ever since I have thought pretty much the exact same thing! The software is there… why couldn’t Apple simply open a “channel” from FCP to the OS’s own dictation?? It’s all there and it all works superbly by itself… why not simply connect the dots?

We’ll see if it ever comes to fruition.

– RK

____________________________________________________
Deutsch? Hier gibt es ein umfassendes FCP X Training für dich!
Michael Phillips
January 7, 2016 at 7:40 pm

How do you download the components needed to make it local? WOuld love to run these tests again without the 30 second limit.

I tried the OS X speech to text via a stream playing from a another computer and it had the same 30 second limit. I used for a test the Conan, Kevin Hart, Ice T video that was recently released streaming from YouTube via a nice quality speaker aimed at the MacBookPro microphone:

https://www.youtube.com/watch?v=1Za8BtLgKv8

So for the first 30 seconds I got via OS X:

One of my staff members Diana Chang hey Diana hey you getting your drivers license I thought help my staff members I take you out for a long time I got my lessons when I was 16 and I was like 5050 something with you are right my staff members I take you ou

And my manual transcription got:

Hey everybody, meet one of my staff members, Dianna Chang. Hey Dianna. Hey. You’re getting your driver’s license. Hmm hmm. I thought, cause I like to help my staff members, I’d take you out, give you some of my pointers ‘cause I’ve been driving for a long time. I got my license when I was sixteen, and that was … well.. How old you think I am? Like fifty… fifty something? [laughter] Which you are, right?

Now of course transcribing it allows for additional notes like audience laughter, and notation of hesitation which in itself is interesting to know as it hints at delivery or performance. Also, human transcription can further differentiate by speaker which is useful. I did not do that to have be a similar result (other than punctuation and such).

Word count shows 54 out of original 76 which from a word for word view is 71%. Comprehension or context is up to the user that may be familiar with the footage already and these become more keyword like and do have value. What you don’t get is the additional sync to source that ScriptSync or even Adobe’s text feature allowed once a transcript was available.

The other interesting thing to note is consistency, I went back and did the OS X version again with same set-up and got the following result:

But everybody needs one of my staff members sign you’re getting your drivers license I thought going to help my stock numbers I take you out on time I got my was 16 and I was 12 I will be 50 something

Michael

Some contents or functionalities here are not available due to your cookie preferences!

This happens because the functionality/content marked as “Google Youtube” uses cookies that you choosed to keep disabled. In order to view this content or use this functionality, please enable cookies: click here to open your cookie preferences.
Robin S. kurz
January 7, 2016 at 8:41 pm

[Michael Phillips] “How do you download the components needed to make it local? WOuld love to run these tests again without the 30 second limit. “

Simply go to the dictation system setting and check “Use Enhanced Dictation”.

And by no means am I saying that it is some sort of replacement for human transcription. Testing on a video like that of things (which I bizarrely had just finished watching about two minutes before I saw your post… what are the odds) may not exactly be the perfect choice. Something along the lines of an interview or dialog scene would possibly yield better results, as in my tests. But then I personally don’t expect 100% accuracy (yet) either, as long as certain keywords were recognized, in which case around 75% would be great. And either way it would be better than nothing. And if e.g. as a note on a favorite (as described), small edits/corrections aren’t a big deal if needed.

As always, YMMV.

– RK

____________________________________________________
Deutsch? Hier gibt es ein umfassendes FCP X Training für dich!
Robin S. kurz
January 8, 2016 at 5:50 pm

And by the way, as Philip recently pointed out, think of the possibilities if Apple in fact connected all the dots. They altogether have a truly impressive collection of metadata-deriving technologies. With all of them combined, they could have speech translated to text, keyword ranges extracted, with person detection (which of course already exists), people identified and named, emotions detected (with their most recent acquisition) and the content of b-roll labelled. All with just a simple import. And probably even more.

It’s not automated editing, but it would be a brilliantly good start on organization. Add some basic string-out building algorithms and there’s a basic starting point for e.g. non-scripted shows with EVEN LESS organizational time required as there is already with X.

I can only see Apple, of all the NLE makers, in that very unique position. We’ll see if they carry through on any or even part of it any time soon.

– RK

____________________________________________________
Deutsch? Hier gibt es ein umfassendes FCP X Training für dich!

Page 2 of 3

← 1 2 3 →

Reply to this Discussion! Login or Sign Up

Creative Communities of the World Forums