Saturday, January 26, 2019

Alexa! OK Google! Hey Siri! - How These Voice Recognition Systems Work?

Voice recognition alexa ok google hey siri
Voice Recognition System
Suddenly an electronic voice comes out from Amazon Alexa:
“India is celebrating its 70th Republic Day. ”
"India beat New Zealand in second ODI."

A very Happy Republic Day to all my fellow Indians. Jai Hind!



Language technology has reached a level of maturity today where it is making a mass impact on users of English and many other languages of the world. Audio has been the key instrument in today’s technology which is driving towards future in a faster mode. It has been observed in recent times using various apps what is the importance of audio apart from playing the playlist of several mp3 files. The Speech Engine has acquired a unique place in the IT world and is helping to grow the industry in a different format.
Today computers can read out the information to illiterate or the blind through text-to-speech systems, remote data can be accessed through telephonic interfaces, sophisticated search provided to the internet. A holistic framework has already been established in this world for this voice recognition system. Various pitches and acoustic algorithms have been observed to understand every chunk of audio to convert into text or character format.



How the processing is done?

There are two parts to this technology:

  1. Text to Speech (TTS)
  2. Speech to Text (STT)


The Text to Speech technology allows the computer to read out the given text file. TTS can be used to allow a text file to be accessed by a blind person or an illiterate person. It can also allow interaction over the telephone, where the text cannot be seen by the user. In this user provides some written text. and these texts are segregated into specific characters. Once the characters are recognized, and their specific pitch is identified using the electronic voice using voice box, then the voice box generates the speech one wants to obtain for the specific text.

The Speech to Text allows the computer to listen to the spoken language and convert into text. Automatic Speech Recognition (ASR) is important where the computer needs to understand a spoken command in a language, and the needful has to be done in response to the user. In this, firstly the pitch is identified accordingly each character is generated. At the end when voice gets or precisely speech gets over, one will see the complete set of characters.

ASR has a high chance of noise which can deviate the pitch and can lead to the wrong recognition of characters. ASR engines are in their incubation stage and not completely mature. Where the noise gets added? The noise comes in the first stage when the voice is recorded and this noise can come from air and nearby devices and instruments which can hamper the recognition.


Practical Implementation of Text to Speech:

Various programming languages like python can be used for both the ends. Our computers and laptops have the audio driver through which we play the mp3 and mp4 and other supported formats of video and audio files due to the video and audio codecs pre-installed in the system.

These languages have fine libraries to call sufficient functions at the nominal period of time to trigger the voice-recognition process. These libraries have awarded by the various tech giants like IBM, Google etc. like PyAudio, Watson, and sr.

  • Once these files, once get the text wraps and compile the file in the format as per the discretion of the user.
  • When given the instructions to play by the user or by the automated codes, the specific speech will be played.
  • Other methods are like connecting audio codecs and drivers along with the microphones to the microcontrollers like Raspberry Pi, Arduino, etc. and accordingly embedded based libraries will handle the data whichever side has to be worked out.

Applications of Speech to Text (Voice Recognition):

Tell me the truth, it has immense potential can bring the automation process in the system everywhere completely.

Sundar  Pichai, Google CEO displayed the power of this Audio system using Ok Google when the google app has been allocated the task for scheduling the hair cut in a barber shop. The text once arrives in the web search world is completely free to float anywhere the way the user wants to drive it. You just say: “Call Near sweet shop” and the pain of typing these 17 characters will be eradicated in minutes if the same given by the human voice. The result will be obtained in the same format.

  • Google Home, Alexa have been the key instruments to handle home automation these days which can bring it the entire home device linked with electricity in the command of your voice.
  • Automatic Machine Translation (MT) translates a given text in one language to other, instantly. While the quality of translation produced varies depending on the distance between language pairs, and the technology used, it provides instant access to text in another language to the user.
  • Apps like Kilkari are amazing in itself which is providing the assistance to pregnant ladies by giving 72 lullabies for their babies and which can be played accordingly apart from the instructions required for the lady to handle maternity period and post-delivery time too.

Localization and e-Content:

  • Localization in our context means that the electronic device is enabled with the local language using the standards. Eg. when one buys a phone, it should already have the language of the region built into it along with English, for displaying keyboards etc. 
  • Use of standards is most important. This ensures the data created one device is usable on any other electronic device thus amplifying the portability. There is an acute need of e-content in Indian languages. While e-content is not a replacement of books, the young generation has started placing increasing reliance on the content available over the internet. 
  • It was observed in Germany not so long ago, that the German youth were accessing English language contents much more than German content. It was realized that this situation had arisen because there was not sufficient content in German on the internet.
This conversion can be less tedious if we have the power of the audio system to go both side ie either converting the text to speech and speech to text. The text in between lying can be used to convert or transform from one language to another thus increasing productivity and efficiency this will lead to decrement in the effort time and energy and increase the performance in turn.   

Supply Chain, Logistics, Manufacturing, and Security:

The above-mentioned industry can somehow enhance the process of automation if they adopt the voice instruction based process. Manufacturing process like the products which runs from conveyor belts and other things can enhance and will lead human interventions through the buttons from the remote areas using PLC, SCADA. These PLC based systems if takes the data through the voice-based instructions can drive the processes by keeping the rest process same.

Supply Chain and logistics can provide day to day or time to time status using notifications from the mobile apps rather than unlocking the phone and opening the apps. The apps if provide voice-based notifications will lead to less mobile engagement time.

Security which always the remains the biggest concern can be given more advancement if the voice system is provided as a security layer and things can be accessed to the outside unless and until it has the approval from the concerned user using his or her voice. However, these voices should be updated from time to time due to aging.


No comments:
Write comments

Featured Post

What is Microsoft 365? How AI in Microsoft 365 is helping in making things better?

What is Microsoft 365? In a nutshell, Microsoft 365 is an integrated bundle of the operating system Windows 10, Microsoft Office 36...