Oh dear, oh dear… At 4BIS, we love technology, we live technology, so much so that we totally forgot about the other thing that unites crowds and connects humans: football. On Tuesday April 16, while the Ajax football club were offering an unexpected victory to their many fans, we and the Haarlem Tech community were having another fun evening meet-up about Artificial Intelligence, wondering where everyone else was at.

Not to worry, a few people still showed up, and there was pizza leftover which, let’s face it, is worth missing football for any day. Anyway. This isn’t an article to declare my undying love for pizzas, I’ve already written too many of these in my young days. This is for those of you who couldn’t make it to the meet-up, to summarise a bit what was said, what we learnt, and how we ended up somehow discussing the eventual future robot uprising with pints of delicious Dutch beer in our hands…

Complete science fiction just a couple of decades ago, speech recognition is today one of the most remarkable advancements in the field of Artificial Intelligence. Thanks to Voice Command Devices, humans are now able to have live conversations with virtual assistants on smartphones and smart speakers, without even pressing a single button. The ultimate goal in the quest to humanise technology is to teach computers to mimic human interactions in the most natural way possible, with emotions and the whole shebang. Can you imagine if your home assistant device could order you to do the dishes or complain about your smelly feet, just as realistically as a real human girlfriend would? Well, this might be happening sooner than we think…

At the meet-up, we focused on one particular virtual assistant, pioneer and very successful on the market: Alexa, developed by Amazon Web Services. Alexa is a superstar as far as Voice User Interface (VUI) goes; it is capable of voice interaction, music playback, making lists, setting alarms and scheduled reminders, streaming podcasts, playing audiobooks, providing real-time information such as news, traffic, sports or weather… it can even control several smart devices inside your house and be used as a home automation system, managing lighting, home security, temperature, etc. If Alexa is currently “only” programmed to understand English, German, French, Italian, Spanish and Japanese, it was announced by Amazon in January 2019 that over 100 million Alexa-enabled devices have already been sold.

And it’s not going to stop there. The possibilities are endless. Alexa is so versatile that users can extend their Alexa’s capabilities by installing “skills” – aka, additional functionality – themselves. Literally anyone can go ahead and build an Alexa Skill. Amazon has made available to the public an extensive collection of tools, tutorials, and APIs to help users taking their first steps within the fascinating world of machine learning. It could be you, check it out here.

 

We invited Alexa experts to give presentations at the meet-up and with them we discussed the different levels of software infrastructure for the Alexa Voice User Interface, and how user-friendly it really is.

Our first speaker, who kindly came all the way from Madrid just for us, was Alexa Technical evangelist German Viscuso. Experienced developer, technical writer and consultant, he is dedicated to make technology approachable to others and develop communities around it.

Our second speaker was Jieke Pan, passionate about team leadership and extreme programming, and director of engineering at Mobiquity Inc, a digital consulting company that partners with the world’s leading brands to design and deliver compelling digital services for their customers. Jieke told us specifically about the GoodNes skill that Mobiquity developed for the giant Nestlé using Alexa VUI: GoodNes combines voice and visual to create a smart cooking assistant and provide the ultimate cooking experience to users.

 

Understanding how Alexa Skills work

 

Scenario: Mrs. Jones is 56 years old; she has three children and two cats. She loves pizzas, Stephen King’s books and quiet walks in the forest. She uses Alexa every day and built a “New York Pizza” skill on her device.

1/ Automatic Speech recognition (ASR)

With ASR Technology, Alexa can detect and convert spoken words into computer text.

ASR can also be used for authenticating users via their voice: Mrs. Jones can register her voice when she first configures her device and that trains it to recognise her speech patterns and vocabulary and respond to these. That is to prevent her cats from trying to take over and destroy the world, obviously. This aspect of the technology, however, is sensitive to privacy laws – can a private company like Amazon really record and retain extremely personal customer information, such as a voice? For this reason individual voice detection is only available in the US and is still at the beta phase (not in production). 

2/ Natural language understanding (NLU)

NLU is the post-processing of text, that identifies words and utilises context to discern meaning from a voice command. After recognising the information, NLU can derive the intent (Mrs. Jones wants pizza) and send the correct output message (a structured representation of Mrs. Jones’ request) to the needed service (the pizza skill) for it to execute the action.

3/ Text-to-speech (TTS)

Alexa includes a text-to-speech (TTS) system that assigns phonetic transcriptions to written words and then send them to a synthesiser which converts the linguistic representation into sound. That’s how Alexa can speak!

Stuff to look forward to

 

Above is the basic cycle of events when Alexa and a human are interacting, but of course Alexa development teams are constantly looking to further improve the technology and make Alexa’s spoken delivery sound as natural as possible. Throughout the meet-up our speakers explored with us some of the exciting things that may well be in Alexa’s near future.

Artificial Intelligence is equipped with artificial neural networks which are either standard feedforward (information is moving in one direction), or recurrent (information is moving in a cycle). The latter totally revolutionised deep learning:  Long short-term memory (LSTM) recurrent neural network makes the machine capable of not only processing but also remembering information. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. To put it simply, LSTM can predict what’s coming based on past “experience”. This is a fantastic little piece of magic for an interface like Alexa, which thanks to LSTM can learn over time and produce increasingly human-like sentences the more it interacts with humans.

Beyond voice context, LSTM also has applications for many more fun things, which are still in the experimenting phase. For instance, acoustic event detection could be such a strong asset for home security systems: imagine if Alexa was able to detect the sound of glass breaking, or a smoke alarm…

Although so far virtual assistants continue to speak in a relatively robotic, monotonous, emotionless way, it won’t be too long until the programming wizards of planet Earth manage to develop the technology to give the perfect illusion of a natural, spontaneous conversation, hence creating the ultimate user experience. All the big boys in the market are working hard on this and Amazon Alexa is no exception. Their latest text-to-speech system, which uses a generative neural network, can learn to employ a newscaster style from just a few hours of training data. The new whisper mode is available since 2018, and is quite handy when you want to speak to Alexa without waking up your wife next to you. These are just two examples of how fast Alexa is changing, and it’s just the beginning!

While enjoying our well-deserved post meet-up drinks, we had lively debates and very interesting discussions regarding the future of Artificial Intelligence and how far humans are going to take it. It was overall a great evening. 

We are happy to read your thoughts in the comments section below, and we hope you can join us at the next meet-up!

Have a nice day,

The 4BIS team.