An Apple research paper describes how the company has been developing Ferret-UI, a generative AI system specifically designed to be able to make sense of app screens.
The paper is somewhat vague about the potential applications of this – likely deliberately so – but the most exciting possibility would be to power a much more advanced Siri …
The challenges in going beyond ChatGPT
Large Language Models (LLMs) are what power systems like ChatGPT. The training material for these is text, mostly taken from websites.
MLLMs – or Multimodal Large Language Models – aim to extend the ability of an AI system to make sense of non-textual information also: images, video, and audio.
MLLMs aren’t currently very good at understanding the output of mobile apps. There are several reasons for this, starting with the mundane one that smartphone screen aspect ratios differ from those used by most training images.
More specifically a lot of the images they need to recognize, like icons and buttons, are very small.
Additionally, rather than comprehend information in one hit, as they would when interpreting a static image, they need to be able to interact with the app.
Apple’s Ferret-UI
These are the problems Apple researchers believe they have solved with the MLLM system they call Ferret-UI (the UI standing for user interface).
Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate “any resolution” on top of Ferret to magnify details and leverage enhanced visual features […]
We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model’s reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference.
The result, they say, is better than both GPT-4V and other existing UI-focused MLLMs.
From UI development, to a highly advanced Siri
The paper describes what they have achieved, rather than how it might be used. That is typical of many research papers, and there can be a couple of reasons for this.
First, the researchers themselves may not know how their work might end up being used. They are focused on solving a technical problem, not on the potential applications. It may take a product person to see potential ways to make use of it.
Second, especially where Apple is concerned, they may be instructed not to disclose the intended use, or to be deliberately vague about it.
But we could see three potential ways this ability might be used …
One, it could be a useful tool for evaluating the effectiveness of a UI. A developer could create a draft version of an app, then let Ferret-UI determine how easy or difficult it is to understand, and to use. This could be both quicker and cheaper than human usability testing.
Two, it could have accessibility applications. Rather than a simple screen-reader reading everything on an iPhone screen to a blind person, for example, it summarize what the screen shows, and list the options available. The user could then tell iOS what they want to do, and let the system do it for them.
Apple provides an example of this, where Ferret-UI is presented with a screen containing podcast shows. The system’s output is: “The screen is for a podcast application where users can browse and play new and notable podcasts, with options to play, download, and search for specific podcasts.”
Three – and most exciting of all – it could be used to power a very advanced form of Siri, where a user could give Siri an instruction like “Check flights from JFK to Boston tomorrow, and book a seat on a flight that will get me there by 10am with a total fare below $200.” Siri would then interact with the airline app to carry out the task.
Thanks, AK. 9to5Mac composite image from Solen Feyissa on Unsplash and Apple.
FTC: We use income earning auto affiliate links. More.