Simple ways to find out what AI can do

apart from drawing photo-realistic images and holding seemingly sensitive conversations, the AI ​​failed on many promises. The resulting rise in AI skepticism leaves us with a choice: we can get too cynical and sideways watch the winners emerge, or find a way to filter through the noise and quickly identify business breakthroughs to participate in a historic economic opportunity.

There is a simple framework for differentiating short-term reality from science fiction. We use the most important measure of maturity of any technology: its ability to handle unforeseen events commonly referred to as edge cases. As a technology hardens, it becomes more adept at handling increasingly rare edge cases and, therefore, gradually unlocking new applications.

Edge case reliability is measured differently for different technologies. The availability of a cloud service could be a way to gauge reliability. For AI, a better measure would be its accuracy. When an AI fails to handle a borderline case, it produces a false positive or a false negative. Precision is a metric that measures false positives, and Recall measures false negatives.

Here’s an important insight: Today’s AI can achieve very high performance if it focuses on either precision or recall. In other words, it optimizes one at the expense of the other (i.e. fewer false positives in exchange for more false negatives, and vice versa). But when it comes to achieving high performance on both elements simultaneously, AI models struggle. Solving this remains the holy grail of AI.

Low fidelity or high fidelity AI

Based on the above, we can categorize AI into two classes: high fidelity versus low fidelity. An AI with high precision or high recall is lo-fi. And one with both high precision and high recall is hi-fi. Today, AI models used in image recognition, content personalization, and spam filtering are lo-fi. The models required by robot taxis, however, must be hi-fi.

There are a few important tidbits about lo-fi and hi-fi AI worth noting:

  • Lo-fi works: Most algorithms today are designed to maximize precision at the expense of recall or vice versa. For example, to avoid missing fraudulent credit card charges (by minimizing false negatives), a pattern can be designed to aggressively report charges with the slightest indication of fraud, thereby increasing false positives.
  • hi-fi = science fiction: Today, there is no commercial application based on hi-fi AI. In fact, hi-fi AI may be decades away, as shown below.
  • Hi-fi is rarely needed: In many areas, smart product and business decisions could shift AI needs from hi-fi to lo-fi, with minimal/acceptable business impact. To do this, product managers need to understand the limitations of AI and apply it in their design process.
  • Urgent security requires hi-fi: Urgent security decisions are an area where hi-fi AI is often needed. This is where many self-driving car use cases tend to focus.
  • Lo-fi + humans = hi-fi: Security uses cases aside, it is often possible to achieve hi-fi performance by combining artificial and human intelligence. Products can be designed to incorporate human assistance at appropriate times, either by the user or by support personnel, to achieve desired levels of accuracy and recall.

Quantify AI fidelity

A popular metric for assessing the reliability of AI is the F1 Ranking, which is a type of numerical average of precision and recall, thus measuring both false positives and false negatives. An F1 of 100% represents a perfectly error-free AI that handles all edge cases. By our estimate, some of the best AIs today perform at a rate of 99%, although a score above 90% is generally considered high.

Let’s calculate the F1 score for two applications:

  • If Spotify plays songs you like 95% of the time (accuracy), but only brings up half of the songs you like (50% recall), its F1 would be 65%. This is an adequate score because high accuracy leads to great user experience and low churn, while low recall is not noticed by users.
  • When a robot taxi decides to cross at a red light, it is making an urgent safety decision. Blowing a red light (false negative) and unexpectedly braking at green (false positive) pose a high risk of collision. We designed a method to estimate the level of AI accuracy needed to achieve parity between autonomy and human drivers, taking into account current crash rates at intersections and other factors. We estimate that a robo-taxi must achieve over 99.9999% accuracy and 99.9999% recall in red light detection in order to be on par with humans. It’s an F1 99.9999%-Where six new.

It’s clear from the examples above that a 65% F1 is easily achievable by today’s AI, but how far are we from a six nine F1?

A roadmap to hi-fi

As stated earlier, the market maturity and readiness for any technology is tied to how well it handles edge cases. For AI, the F1 score can be a useful proxy for maturity. Similarly, for previous waves of digital innovation such as web and cloud, we can use their availability as a signal of maturity.

As a 30-year-old technology, the web is one of the most trusted digital experiences. The most mature sites like Google and Gmail aim 99.999% availability (five nines), which means that the service is not available for more than six minutes per year. This is sometimes missed by a wide margin, like YouTube’s 62-minute disruption in 2018 or Gmail’s six-hour outage in 2020.

At about halfway through the age of the web, the cloud is less reliable. Most services offered by Amazon AWS have an availability SLA of 99.99%, or four nines. That’s an order of magnitude lower than Gmail, but still very high.

A few comments :

  • It takes decades: The examples above show that it often takes decades to move up the edge case maturity ladder.
  • Some use cases are particularly difficult: The extremely high level of edge performance required by robo-taxis (six nine) even exceeds that of Gmail. Keep in mind that self-driving also works on computers similar to cloud services. Yet, the operational availability required by robo-taxis must exceed what current web and cloud services can achieve!
  • Narrow applications beat general purpose: Web applications are narrowly defined use cases for cloud services. As such, web services can achieve higher uptimes than cloud services because the more widespread the technology, the harder it is to harden.

Case study: Not all autonomy is created equal

The Google engineers who left their self-driving car team to start their company had a common thesis: Narrowly defined applications of autonomy will be easier to market than general autonomous driving. In 2017, Aurora was founded to transport goods via long-haul trucks on highways. Around the same time, Nuro was founded to transport goods in small cars and at slower speeds.

Our team also shared this thesis when we started inside Postmates (also in 2017). We have also focused on transporting goods but, unlike others, we have chosen to leave cars behind and focus instead on smaller robots that operate off the street: Autonomous Mobile Robots (AMR ). These are widely adopted in controlled environments such as factory floors and warehouses.

Consider red light detection for delivery robots. While they should never cross on red given the risk of collision with vehicles, stopping cautiously on green does not introduce any safety risk. Therefore, a recall rate similar to that of robot taxis (99.9999%) with modest accuracy (80%) would be adequate for this AI use case. This results in an F1 of 90% (one new), which is easy to do. Going from street to sidewalk and from a full-size car to a small robot, the accuracy of the AI ​​required decreases from six nines to one.

The robots are here

Delivery AMRs are the first urban autonomy app to hit the market, while robo-taxis are still awaiting unattainable hi-fi AI performance. The pace of progress in this industry, along with our experience over the past five years, has reinforced our view that the best way to commercialize AI is to focus on narrower applications enabled by lo-fi AI and use human intervention to achieve hi-fi performance where needed. In this model, lo-fi AI leads to early commercialization, and incremental improvements thereafter help drive business KPIs.

By targeting more forgiving use cases, companies can use lo-fi AI to quickly achieve commercial success, while maintaining a realistic view of the multi-year timeline to achieve hi-fi capabilities. After all, science fiction has no place in business planning.

Ali Kashani is the co-founder and CEO of Serve Robotics.

Comments are closed.