Agora Inc.

18/07/2024 | News release | Distributed by Public on 19/07/2024 02:55

The Impact of Latency in Speech-Driven Conversational AI Applications

As conversational AI tech continues to advance, there are still some major hurdles to rolling out real-time voice and video communication with LLMs. In a previous blog we highlighted the importance of overcoming the challenging conditions of the last mile. Latency (delay) is another challenge that must be overcome when enabling speech driven conversational AI in an application.

In this blog we focus on the impact of latency (delay) in speech driven conversational AI applications, including:

  • Research studies and industry standards which specify the level of latency typical and acceptable for a natural and fluent conversation between humans.
  • The components which contribute to higher latency when a human interacts with a Large Language Model (LLM) using speech over the internet.
  • How latency can be minimized to deliver the best possible human-machine conversational experience.

Latency in natural human conversation

On the OpenAI GPT-4o announcement page, the company highlights that GPT-4o "can respond to audio inputs in as little as 232 milliseconds (ms), with an average of 320 ms which is like human response time in a conversation. The referenced study is titled "Universals and cultural variation in turn-taking in conversation." It covers 10 representative languages and found that the mean response offset of turn transitions was about 208 ms. The conversations analyzed were from videotaped interactions of participants in the same location. For in-person conversations the mouth-to-ear delay (time between one person speaking and the other person hearing) is quite low. With speakers about 2 meters apart, it is about 6 ms. See Figure 1 below as an illustrative example.

Figure 1: Latency with in-person conversation

For an application where the intent is to support 'Conversational AI,' it is important to emulate natural conversation. To emulate natural conversation, it is crucial to consider the latency for both mouth-to-ear delay and the latency in turn taking in conversation. Further, since Conversational AI applications today require interaction between a user leveraging a device and infrastructure in the cloud, all elements which contribute to increasing latency must be understood and minimized for the best experience.

Latency in human conversations via RTC applications

Let us look at a case (see Figure 2 below) where two people are in separate locations and are using an application on their mobile devices to communicate with each other on an audio call.

Figure 2: Mouth-to-Ear Delay in Conversation Between 2 People Using a Mobile Phone Application to Talk

If the user of mobile phone 1 is holding the phone microphone directly to their mouth and the user of mobile phone 2 is holding the phone speaker directly to their ear, the mouth-to-ear delay in this case would be the sum of the individual delays in all the boxes shown above. The delay contribution for both mobile devices is shown in Table 1. For comparison, we show the typical delay as well as the reduced delay which Agora has achieved with device and operating system optimizations.

Mobile Device Delay Contributor Typical Delay (ms) for iOS Agora Optimized (ms) iOS Typical Delay (ms) for Android* Agora Optimized (ms) Android
Mic input delay 25 15 60-80 25
Pre-processing delay (HW/SW) 60 10 60-100 10
Codec encoding delay 10 ~0 10 ~0
Packetization delay ~0-40 ~0 ~0-40 ~0
Jitter buffer delay > 60 20 > 60 20
Codec decoding delay 1 ~0 1 ~0
Speaker output (playout) delay 25 15 160-250 20
Total for Device > 181 60 > 350 ~0
Table 1: Sum of all delays contributed by mobile devices

*By default, Java ADM is typically used due to its broader compatibility across devices, however the playout delay is often remarkably high.

The Network Stacks and Transit delay defined as the total time it takes for speech packets to transit the network edge-to-edge can vary significantly depending upon whether the users are in the same city or different cities, states, or countries. In our testing, we compared the one-way latency over the public internet and over Agora's proprietary Software Defined Real-Time Network (SD-RTN™). This one-way latency is measured from network edge to network edge, not including the last mile hop on each end. We compared data both within a continent (intra-region) and between continents (inter-region). The results are shown in Figure 3 below.

Figure 3: One-way Latency over the public internet vs. Agora SD-RTN™

For simplicity, the key takeaway is that 95% of users within the same region or between geographic regions experience > 50% improvement (reduction) in latency.

Let us assume that the two mobile phone users are both located within North America. In this case 95% of users on the public internet would have no more than ~94 ms latency and using Agora's SD-RTN™ would have ~33 ms latency. The best possible latency for the mobile last mile hop is approximately 10 ms between servers on the public internet and the mobile device and 10 ms between Agora's SD-RTN™ and the mobile device using Agora's SDK. This 10 ms number assumes that the last hop is in the same city as the user on the mobile device and that the last mile connection is excellent. Using these numbers the total mouth-to-ear delay can be estimated as shown in Table 2.

Case Total for Device (ms) Network Stacks & Transit Delay (ms) Total mouth-to-ear delay (ms)
Two iOS Devices on Public Internet >181 94+20 = 114 >295
Two Agora Optimized iOS Devices on SD-RTN™ 60 33+20 = 53 113
Two Android Devices on Public Internet >350 94+20 = 114 >464
Two Agora Optimized Android Devices on SD-RTN™ 100 33+20 = 53 153
Table 2: Total mouth-to-ear delay estimation

Now that we have these estimations, how do we know whether the mouth-to-ear delay levels are acceptable to users or not? Fortunately, the International Telecommunication Union published a standard called G.114 which answers this question.

The figure below, extracted from the ITU G.114 standard, depicts the telecommunication industry's study on voice latency vs. user satisfaction quality.

Figure 4: From ITU G.114 standard on mouth-to-ear delay vs. user satisfaction

Referring to Figure 4, with up to 275 ms mouth-to-ear delay users are satisfied. Between 275ms and 385ms some users are dissatisfied. Beyond this, the experience is poor.

Case Total mouth-to-ear delay (ms) ITU G.114 User Satisfaction Rating
Two iOS Devices on Public Internet >295 Some Users Dissatisfied
Two Agora Optimized iOS Devices on SD-RTN™ 113 Users Very Satisfied
Two Android Devices on Public Internet >464 Many Users Dissatisfied
Two Agora Optimized Android Devices on SD-RTN™ 155 Users Very Satisfied
Table 3: ITU G.114 user satisfaction rating for mobile-to-mobile audio call from Table 2.

Referring to Table 3, the device, and operating system level optimizations, in addition to network optimization for latency supported by Agora result in far lower overall latency and higher G.114 user satisfaction ratings.

Latency in human-to-AI conversations

With this background and context, let us now look at an example of speech-driven conversational AI where the AI agent is at the network edge, as shown in Figure 5. For simplicity, we will assume that the AI workflow and inference takes place at the edge of the network. In this example, we assume the LLM supports a direct speech interface (Audio LLM) which means there is no Speech-to-Text conversion required. The acronym TTS TTFB refers to the Time-To-First-Byte or the duration from when the request is made by the LLM to generate the Text-To-Speech response and the first byte of the response is generated.

Figure 5: Latency in a Conversational AI Example

Using this example, let us estimate the mouth-to-ear delay from a human using conversational AI app on their mobile phone to Audio LLM based AI, the turn taking delay for the Audio LLM based AI, and the mouth-to-ear delay from the Audio LLM based AI to the human using the conversational AI app on their mobile phone. In this example, we will assume the human user is using an Android Phone.

Delay Contributor Typical Delay (ms) Agora Optimized (ms)
Mic input delay 60-80 25
Pre-processing delay (HW/SW) 60-100 10
Codec encoding delay 10 ~0
Packetization delay ~0-40 ~0
Network Stacks & Transit Delay 10 10
Audio LLM Jitter buffer delay 4 40
Audio LLM Codec decoding delay 1 1
Total > 121 86
Table 4: Estimated mouth-to-ear delay from Android phone user to Audio LLM based AI
Delay Contributor Optimized Delay
Audio LLM Delay 100
Sentence Aggregation Delay 100
TTS TTFB Delay 80
Total 280
Table 5: Estimated turn-taking delay of the Audio LLM based AI
Delay Contributor Typical Delay Agora Optimized (ms)
Audio LLM Codec encoding delay 21 21
Audio LLM Packetization delay 2 2
Network Stacks & Transit Delay 10 10
Jitter buffer delay >60 20
Codec decoding delay 1 ~0
Speaker output (playout) delay 160-250 45
Total > 254 98
Table 6: Estimated mouth-to-ear delay from Audio LLM based AI to Android Phone User

In this example, the estimated mouth-to-ear delay from Audio LLM based AI to the Android Phone user is near the 'Some Users Dissatisfied' threshold according to ITU G.114. This is a scenario where the network stacks and transit delay are minimal, given that the AI workflow and inference is assumed to be performed the edge of the network closest to the user. There will be many scenarios where humans will be interacting with other humans and one or more conversational AI agents over distance. Referring to Figure 3, the latency contribution of the network stacks and transit delay, in conjunction with the mobile device delay contribution can often cause the mouth-to-ear delay to exceed the threshold where users will be dissatisfied with their conversational AI experience.

Latency in human-to-AI conversations with AI agent located intra-region

Finally, let us look at at the same scenario where the AI agent is located intra-region versus right at the network edge. This scenario will become more common as conversational AI solutions scale up and people will be able to interact with one or more AI agents during a session.

For simplicity, let us assume that the user and the AI agent are both located within North America. In this case 95% of users on the public internet would have no more than ~94 ms latency and using Agora's SD-RTN™ would have ~33 ms latency.

Delay Contributor Typical Delay Agora Optimized (ms)
Mic input delay 60-80 25
Pre-processing delay (HW/SW) 60-100 10
Codec encoding delay 10 ~0
Packetization delay ~0-40 ~0
Network Stacks & Transit Delay 104 43
Audio LLM Jitter buffer delay 40 40
Audio LLM Codec decoding delay 1 1
Total > 275 119
Table 7: Estimated mouth-to-ear delay from Android phone user to Audio LLM based AI
Delay Contributor Optimized Delay
Audio LLM Delay 100
Sentence Aggregation Delay 100
TTS TTFB Delay 80
Total 280
Table 8: Estimated turn-taking delay of the Audio LLM based AI
Delay Contributor Typical Delay Agora Optimized (ms)
Audio LLM Codec encoding delay 21 21
Audio LLM Packetization delay 2 2
Network Stacks & Transit Delay 104 43
Jitter buffer delay >60 20
Codec decoding delay 1 ~0
Speaker output (playout) delay 160-250 45
Total > 348 131
Table 9: Estimated mouth-to-ear delay from Audio LLM based AI to Android Phone User

In this example, the estimated mouth-to-ear delay from Audio LLM based AI to the Android Phone user is well within the 'Some Users Dissatisfied' region according to ITU G.114. For an inter-region case, the experience can easily enter the 'Many Users Dissatisfied' region.

In conclusion, it is essential to minimize latency when implementing speech-driven conversational AI in your application. As discussed in this blog; to emulate natural conversation, it is crucial to consider the latency for both mouth-to-ear delay and the latency in turn-taking in conversation. To minimize the mouth-to-ear delay, it is essential to partner with a provider that offers a proven solution which optimizes latency both at the device level and at the network level to ensure a satisfying conversational AI experience with your application. To minimize the latency in turn-taking in conversation, consider an LLM provider and solution provider who have demonstrated actual performance in this area. Learn more about how Agora helps developers build conversational AI.