Free CSS Website Template
In general, voice-only error correction is a two-step process. In the first step, the users speak a portion of the recognized text to select a target position to correct. Next, the users speak a replacement text. These two steps can perform one correction. However, as McNair and Waibel suggest, the correction process can instead be performed in a single step. In one-step correction, users speak only their replacement text, and the system automatically recognizes it correctly and finds the error region to replace. Seamless error correction is processed like one-step error correction, without any explicit command to enter the correction mode. Our interface automatically understands the purpose of the utterance whether the intention is to type a new sentence or to correct a misrecognized sentence. Then, the system detects an error region and corrects it. To complement the understanding of user intention, our interface provides a confirmation process.
Figure shows the word processing workflow using our interface. After the user utters the sentences to type or correct, our system detects analysis regions for accurate understanding of intention. In this process, the system finds the region of previously typed sentence most similar to the current utterance by the local alignment of the pronunciation sequences. Considering the characteristics of ASR, even a misrecognized sentence has a similar pronunciation sequence to the sentence that the users really want to type. Furthermore, for purposes of correction, , without explicit instruction, users tend to speak correctly recognized words surrounding an error region. The better the performance of the ASR system, the more similar the pronunciation sequences are. After that, user intention understanding proceeds, that is, the classification of correction or non-correction. When the intention of the current utterance is for correction, detected error region is replaced automatically, and confirmation proceeds. Otherwise, the current utterance is inserted at the end of the document.
The key novel process in our interface is user intention understanding. User intention understanding can be accomplished by the observation of clear speech. User utterances to ASR usually have the characteristics of clear speech, which is a speaking style adopted by a speaker aiming to increase the intelligibility for a listener. To make their speech more intelligible, users will make on-line adjustments; typically, they will speak slowly and loudly, and they will articulate in a more exaggerated manner. Furthermore, the utterances for correction display these characteristics more conspicuously than the utterances for non-correction. We approach the task as a classification problem. We collect data from users, label the data with intentions, and extract and refine some of the data's features.