[Resolved] Performance on Mobile Device

This topic has 8 replies, 2 voices, and was last updated 11 years, 11 months ago by Halle Winkler.

Viewing 9 posts - 1 through 9 (of 9 total)

Advertisement: “RapidEars is an OpenEars™ plugin that lets you perform speech recognition while the user is still speaking!”

Author

Posts
May 6, 2012 at 9:22 pm #9613

mert
Participant

Recently I have been working on pocketsphinx and OpenEars. I have successfully created an acoustic model for Turkish and tested it on Linux machine that turns out to be fine for small dictionaries. However, the performance on the mobile device (even in simulator) is not even close to the performance on PC. Therefore, I would like to take this change to discuss, what would be wrong on mobile device or iOS environment.

The first thing comes to my mind is whether the format of recorded speech data in Objective-C affects the result. I think this is done continuously in cont_ad_read. Or Objective-C has nothing to do with speech data, since ContinuousADModule.mm seems to be written by CMU.

May 6, 2012 at 10:00 pm #9615

Halle Winkler
Politepix

Hmm, ARPA mobile performance on my end is usually fast. My test language model of approximately 400 words returns responses for long sentence dictation on an iPhone 4 in about 1-4 seconds, which is fast considering how constrained the CPU and RAM are on the device versus any desktop.

All the audio and processing code is in C whether written by me or as part of one of the dependencies. I’m pretty sure after a year and a half that there is nothing “wrong” per se with the framework performance, but it could be that you are encountering an issue with your application.

I would say that I’d need some timing examples with information about the following items:

ARPA or JSGF?
Is the acoustic model 8-bit or 16-bit?
What is the size of the language model or grammar?
How far away from the phone are you when speaking? (this has a huge effect for dictation applications)
What are you trying to recognize? Words, sentences, phrases?
Please show some representative time results along with the device they are taken from (not simulator). Feel free to show logs that have both OPENEARSLOGGING and VERBOSEPOCKETSPHINX turned on so it is possible to see the exact processing times once complete speech has been detected until the hypothesis is returned. I remember that the last time there was a similar question about ARPA, it turned out that something in the app was blocking recognition (although JSGF performance is unfortunately poor on the device, so if it is actually a JSGF grammar, that will be slow).

My other question is whether you have tested similar size models in English on a device, so you can rule out that it is connected to your acoustic or language models.

I don’t really think device vs. desktop is a fruitful discussion, since it is the expected result to see really dramatic performance differences there (or all the other implementations of speech recognition for the iPhone would not be server-based). But I’d be interested to know what kind of device performance you are seeing and also how it compares to similar-sized models in English, in case there is something specific to the Turkish model or implementation that is slowing things down.

May 6, 2012 at 10:06 pm #9616

Halle Winkler
Politepix

Can I ask you for some clarification about this question?

The first thing comes to my mind is whether the format of recorded speech data in Objective-C affects the result.

Are you asking about the format of your acoustic model recording or the input format on the iPhone? The input format of speech is always PCM and it is only ever stored in memory in a ringbuffer, so there is no format question there or any reading/writing to or from disk, it’s just the raw samples that are streamed from the mic being read out of a C buffer. It’s about as low-latency as it can be without requiring so many callbacks per second that it would hurt performance.

May 7, 2012 at 7:19 pm #9618

mert
Participant

ARPA or JSGF?
ARPA. I have checked the file content, it is supposed to be alphabetically sorted. However, due to special characters in Turkish (ç, ğ, ş etc.), it is not actually sorted in Turksih. This may have some effect, but ARPA I used on Linux machine and iPhone project is same.

Is the acoustic model 8-bit or 16-bit?
Acoustic model is created with wav files that are 8kHz, 16bit, mono.

What is the size of the language model or grammar?
In acoustic model the number of words in dictionary is 400+. To get a good accuracy I created a language model and dictionary with 7 words, which are days.

How far away from the phone are you when speaking? (this has a huge effect for dictation applications)
I speak directly into the mic.

What are you trying to recognize? Words, sentences, phrases?
As I mentioned before it is just the days, 7 words.

Can I ask you for some clarification about this question?
I thought maybe the format of iOS takes speech and feeds it to pocket_sphinx have effect in performance.

Here is a sample log: http://ug.bcc.bilkent.edu.tr/~m_kalender/Turkish_log.txt

I see you point, but still the big performance difference makes me search something in iOS to find out the reason.

May 7, 2012 at 7:51 pm #9621

Halle Winkler
Politepix

The big performance difference is 100% to be expected — it’s the wrong expectation to see anything comparable. Practically, you’re comparing the machine on your desktop and a desktop from the 1990s, so there’s no discovery to make there other than that there is much less horsepower and RAM available for your iOS task.

I will check out your logs. The alphabetization issue could be important because it is part of the specification for the dictionary and the ARPA model that they are alphabetized. It is something that could be a speed issue on the device, because the desktop performance will be basically realtime which means that you won’t notice anything that negatively impacts performance there. What could be happening is that the search method is supposed to short-circuit once it goes beyond the alphabet range, but it can’t because there is no predictable alphabet range for your model so the whole thing has to be searched for every search path. Again, you might not see any issue with this on the desktop because it is all happening so quickly that de-optimizations don’t show up in real usage.

I again suggest that you check out a similar English model on the device so that you can see what an average speed response is. For the days of the week in English, I would expect a near-realtime response from OpenEars after the obligatory silence detection period.

May 7, 2012 at 8:05 pm #9622

Halle Winkler
Politepix

OK, so here is the first recognition that I looked at in your log:

2012-05-07 21:14:14.005 2Sign-test[3860:13c07] OPENEARSLOGGING: Processing speech, please wait…
2012-05-07 21:14:14.037 2Sign-test[3860:13c07] OPENEARSLOGGING: Pocketsphinx heard “SALI SALI” with a score of (-12604) and an utterance ID of 000000001.

This is 32 milliseconds. Is it possible that you are looking at the wrong metrics for processing time?

By the way, I would really appreciate your troubleshooting this with me on a device instead of the simulator, since it is not the same audio driver and it’s not useful information about how the library performs in real use. What devices have you used the library with when you’ve seen the performance issue you’ve described?

May 7, 2012 at 9:19 pm #9623

Halle Winkler
Politepix

Hang on, when you say simulator performance, are you talking about performance or word accuracy rates? You will see terrible accuracy if you only test on the simulator, there are warnings about it everywhere.

May 7, 2012 at 9:50 pm #9628

mert
Participant

My mistake. I should have missed that point. I have been talking about simulator performance. (I have applied iOS Developer University Program, but rejected and looking for an account or will enroll.)

May 7, 2012 at 10:16 pm #9629

Halle Winkler
Politepix

OK, just for future reference there is a FAQ here where some similar questions are answered: https://www.politepix.com/openears/support

The intention of the simulator driver is basically to let you debug everything else in your app in the simulator without Pocketsphinx breaking — that is pretty much the full extent of its ambitions :) .

The device driver is tuned for Pocketsphinx and the iOS audio unit so it will be fast and reliable under high load with a small memory footprint (and so it won’t decrease the lifetime of the device flash drive by reading/writing continuously when Pocketsphinx is running). It can’t be translated into a simulator driver that works identically with all the different desktop devices that it might host without that becoming its own project, and I’d rather put the time into the device code. I also suspect that if the simulator driver were really good, some developers might test on it and be really surprised by the real-world performance of their app.
Author

Posts

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.