Speech analysis concept

Creative Community Conversations

Speech analysis concept

Michael Phillips replied 11 years, 2 months ago 16 Members · 40 Replies

Oliver Peters
July 28, 2014 at 1:22 am

I think they started charging with version 5. First for each separately and then later as a bundle.

Oliver

Oliver Peters Post Production Services, LLC
Orlando, FL
http://www.oliverpeters.com
Timothy Auld
July 28, 2014 at 1:31 am

That sounds right. I know I still have a machine running 3 that I use all the time that has it and I know it wasn’t extra then. One of the more bizzare and inexplicable moves Avid ever made was to begin charging thousands of dollars for a function that had always before been included. Don’t know what the legal ramifications are but if going forward Avid will not have this functionality then the only thing that will set them apart is having markers that stay where you put them, no matter what. (And that is not a small thing to me.)

Tim
Oliver Peters
July 28, 2014 at 1:51 am

You’ll remember that when Avid included these features, the retail price for Media Composer was higher. So Avid was absorbing the licensing fees. As Avid had to drop the price, they decided to option these functions to cover the expense.

Oliver

Oliver Peters Post Production Services, LLC
Orlando, FL
http://www.oliverpeters.com
Marcus Moore
July 28, 2014 at 1:53 am

Exactly. And unless I’m wrong, loads of Siri APIs are coming in iOS8 and Yosemite- so perhaps even if Apple doesn’t incorporate it themselves, a 3rd party might be able to write a plugin or extension against the siri API. I dunno, but it would be another big feather in X’s metadata cap.
Michael Phillips
July 28, 2014 at 2:20 am

One of the challenges facing a lot of these technologies is quality of recording and quality of speech. Volume, emotion, accents, whispering, etc. all affect the success of a speech to text engine, not to mention that the words have to be in a dictionary. Dictionaries do not represent entire languages, names, etc. are usually missing.

You’ll find on average, perfectly recorded, pronounced at a steady pace, has about 80% accuracy.And you’ll be amazed at how much you miss that 20% to make things truly useful. Premiere Pro’s speech to text is a good example of that.

This is the state of speech to text today, who knows over the next 5-10 years, but many companies have been working on this for the past 50 years.

Speech to text, and phonetic based technologies such as the Nexidia solution are very different technologies each with their own distinct advantages and disadvantages – just don’t expect the same functionality from both of them for all things.

Michael
Jeremy Garchow
July 28, 2014 at 3:27 am

https://m.youtube.com/watch?v=Av0ysFyXAlY

Some contents or functionalities here are not available due to your cookie preferences!

This happens because the functionality/content marked as “Google Youtube” uses cookies that you choosed to keep disabled. In order to view this content or use this functionality, please enable cookies: click here to open your cookie preferences.
Timothy Auld
July 28, 2014 at 10:31 am

True, I had forgotten that.

Tim
Timothy Auld
July 28, 2014 at 10:48 am

[Michael Phillips] ” just don’t expect the same functionality from both of them for all things. “

I can expect it, I’m just not going to get it. But even at it’s present state the technology is pretty amazing. And useful. Like you I am very interested what the state will be in 5-10 years.

Tim
Richard Herd
July 28, 2014 at 4:36 pm

[Michael Phillips] “This is the state of speech to text today, who knows over the next 5-10 years, but many companies have been working on this for the past 50 years.”

The nature of allophonic representation is such that computers just can’t do it, ever; they cannot computer a meaning. They are not good at inductive reasoning. Humans are great at inductive reasoning.

If speech recognition were solved via computation, then also video editing would be solved via computation. A bit of reflection and we can see this to be true: phonemes represented as allophones occur in sequences, just like images.

Speech to text is really a training the computer to match the allophonic representation of a single user (because we all have different shaped mouths and tongues) into the prescribed phonology that corresponds to the prescribed spelling.

Here is a better example: https://en.wikipedia.org/wiki/Ghoti
Neil Sadwelkar
July 28, 2014 at 5:42 pm

And all of you are ‘speaking of’ only English as it’s spoken in the US. Then there’s English as spoken in Scotland, Ireland, Australia, or heavily accented English spoken by non-English speaking people.

And just after they finish figuring out English, there’s the rest of the world, with its myriad languages. In a docu in India, for instance it’s not uncommon to have 3-6 Indian languages and languages with English words sprinkled in sentences of another language. It would take a super computer to just identify which language is being spoken. I think we are a way off from this speech-text interactivity in editing. At least on a global scale.

———————————–
Neil Sadwelkar
neilsadwelkar.blogspot.com
twitter: fcpguru
FCP Editor, Edit systems consultant
Mumbai India

Page 2 of 4

← 1 2 3 4 →

Reply to this Discussion! Login or Sign Up

Creative Communities of the World Forums