Well, the biggest issue you’re going to encounter is the one I linked to earlier — A and B and C all rhyme or nearly rhyme and have only one or two phonemes, and they have no surrounding context, so most speech recognition engines will get the utterance wrong much of the time.
OpenEars does continuous recognition, so it isn’t really designed for an approach like pressing a button to start recognition and pressing a button to stop it. This is kind of a tricky user experience in general because in order for the user to use touch to select the answer they would be tapping once, and in order for them to answer with the interaction you’ve described, they are tapping twice and speaking. I think this is a workaround that results from the problem recognizing letters, so I’d recommend not trying to recognize alphabet letters with speech recognition since it leads to generally difficult UX.
To answer your question, with stock OpenEars you can’t limit recognition in that way because it has to listen for a half-second of silence before it processes the speech, meaning that it will try to recognize speech that it perceives that slips in before that half-second of silence has been heard. The half-second is how it knows it isn’t interrupting the user in the middle of speech.
You can try Rejecto to reject speech that isn’t in your vocabulary if the primary issue is about extraneous sounds provoking recognition, or you could try RapidEars which can return a recognition hypothesis immediately instead of waiting a half-second (and then you stop your listening behavior immediately when the answer has been perceived). Alternately, you can just isolate the first speech out of the whole string and only respond to that (so in your example “B A B A C B” you’d just ignore everything after the first B).
But it’s all going to work a lot better if you pick something as the target speech besides alphabet letters.