To speak, that ability that characterizes human beings and enables communication, is no longer a distinguishing attribute of people: it has extended towards the world of computers. Nowadays, it is possible to tell a mobile phone to call somebody or to listen to the balance of a bank account reproduced by a device. However, researches are still trying to enhance the systems to recognize voice.
The first attempts to create machines capable of imitating the communication skill of humans emerged in the second half of the eighteenth century and aimed to produce successful interaction with them. Later on, scientists realized that to achieve speech understanding it was necessary first to recognize it.
Jorge Gurlekian, CONICET principal researcher at the Laboratorio de Investigaciones Sensoriales del Instituto de Inmunología, Genética y Metabolismo (INIGEM, CONICET-UBA) [Sensory Research Laboratory of the Institute of Immunology, Genetics, and Metabolism] and an interdisciplinary research team study the development of a voice recognition system.For the researcher, they have a lot of work to do because the implications of speech seem to be simple but actually represent a significant challenge for machines to understand them.
The automatic speech recognition (ASR) or the automatic voice recognition is a discipline of artificial intelligence whose objective is to enable spoken communication between humans and computers. “We usually use oral language without noticing the quantity and the complexity of processes involved in a conversation.Nonetheless, many of those processes pose considerable difficulties for computer systems”, the researcher explains. Besides, he adds that they have to overcome a great limitation because the speech does not only involve a “what” but also a “how”: silences, pauses and intonations are the key to effective communication.
For this reason, Gurlekian states that information is not only transmitted through words: the way in which a sentence is expressed, its intonation and other factors enrich the discourse but they tend to confuse the computer. “When we talk, the main challenge is to identify what is a voice and what is noise. For a machine, it is not easy to know in which sounds it should concentrate. When we know somebody, we adapt to the person’s pitch, tone and volume automatically, without asking him or her to speak for some time. Apart from that, it is very hard for a computer to distinguish similar phrases”.
Considering all those factors, researchers face a great challenge: to create a system that recognizes the speech of any human, taking into account that men are incredibly good at that.
The computer knows what I say
Gurlekian uses radio and television recordings to train an automatic system to learn words in the actual conditions they are said. This learning process takes place through the formation of acoustic models for each phoneme and according to the previous and subsequent phoneme. “The structure of language is represented by the most probable sequence of words in the discourse. This information, together with a dictionary of possible pronunciation for each word – for instance, the word ciudad (city), can be pronounced ciudad or ciudá – , are part of the language model”, the researcher explains. Furthermore, he adds that “the database that was created includes allophonic variations as well as dialectal prosodic variations produced in each region of the country, intonation, musicality, and rhythm”.
The voice recognition system uses a classification process for some patters that it stores in dictionaries. During a dictation, if the words used are not in its vocabulary, the software will look for other phonetically similar words available. This results in errors and highlights the need to train the programme to achieve more accurate recognition.
“These systems are based on the development of probabilistic models for each acoustic unit of language, statistical models for the words that the user will be able to use and the pronunciation models that indicate how the acoustic units are connected to form words. The performance of these recognition systems will depend on the quality of the recordings used to undertake the task, the kind of speech and the characteristics of each speaker. With professional broadcasters and special environment to record, the recognition percentages calculated in the laboratory exceed 97 per cent”, Gurlekian states.
Speech recognition technology has many possible applications, such as devices control, speech to text dictation and search within a sound archive. Furthermore, it can facilitate the communication between people with disabilities and promote the development of security measures based on voice, among many other possibilities.
- By Jimena Naser
- About the research:
- Jorge Gurlekian. Principal researcher. INIGEM.
- Evin Diego. Associate researcher. INIGEM.
- Humberto Torres. Assistant researcher. INIGEM.
- Cossio Christian. Fellow. INIGEM.
- Miguel Martínez. UBA.
- Pedro Univaso. UBA