Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Subscribe to our daily and weekly newsletters for the latest updates and content from the industry’s leading AI site. learn more
A new AI assistant has come out from TikTok’s parent company to monitor your computer and perform critical tasks.
Similar to Anthropic Computer UsageByteDance’s new UI-TARS understands graphical user interfaces (GUIs), uses reasoning and action, step by step.
Trained in approximately 50B tokens and offered in 7B and 72B versions, PC/MacOS agents achieve a state-of-the-art operating system (SOTA) on 10-plus GUI indicators for functionality, logic, background and all the capabilities of an agent, continuously beating. outside of OpenAI’s GPT-4o, Claude is Google Gemini.
“Through repeated learning and reflection, UI-TARS continuously learns from its mistakes and adapts to unpredictable situations with little human intervention,” ByteDance and Tsinghua University researchers wrote in new research paper.
UI-TARS works on desktop, mobile and web, using a variety of inputs (text, images, communication) to understand the visual environment.
Its UI consists of two tabs – one on the left shows its “ideas” in detail, and a larger one on the right where it pulls up files, websites and apps and just takes action.
For example, in the presentation video released today, the model is recommended to “Find round-trip flights from SEA to NYC on the 5th and back on the 10th of next month and filter by the highest price.”
In response, UI-TARS goes to the Delta Airlines website, fills in the “from” and “to” fields, clicks on the relevant dates and types and filters on the price, explaining each field in the thought box before taking action.
In some cases, it is recommended to install the autoDocstring extension in VS Code. Here are some of his thoughts after completing the task:
Among various benchmarks, researchers say that UI-TARS consistently outperformed OpenAI’s GPT-4o; Anthropic a Claude-3.5-Sonnet; Gemini-1.5-Pro and Gemini-2.0; four Qwen examples; and many educational examples.
For example, in VisualWebBench – which tests the model’s ability to place web content including web page quality assurance and character recognition – UI-TARS 72B scored 82.8%, beating GPT-4o (78.5%) and Claude 3.5 (78.2%).
It also performed very well on the benchmarks WebSRC (understanding semantic content and web page layouts) and ScreenQA-short (understanding complex mobile layouts and web layouts). The UI-TARS-7B scored an impressive 93.6% on WebSRC, while the UI-TARS-72B scored 88.6% on ScreenQA-short, Qwen, Gemini, Claude 3.5 and the award-winning GPT-4o.
“These results demonstrate the superior capability and understanding of UI-TARS for web and mobile applications,” the researchers wrote. “This kind of insight lays the foundation for the work of agents, where a better understanding of the environment is essential for performance and decision-making.”
UI-TARS also showed impressive results in ScreenSpot Pro and ScreenSpot v2, which evaluates a model’s ability to understand and place objects in GUIs. Furthermore, researchers tested its ability to organize multi-step actions and low-level tasks on mobile phones, and placed them on OSWorld (which tests computer tasks that cannot be performed) and AndroidWorld (which finds independent assistants for 116 tasks in 20 mobile applications. ).
In order to facilitate action and identify what it sees, UI-TARS was trained on large image files that distributed metadata including product and color descriptions, visual descriptions, bound boxes (more), functionality. and documents from various websites, programs and operating systems. This allows the model to provide a detailed, detailed description of the image, drawing not only the objects but the relationships between them and the whole layout.
This model also uses state transition definitions to detect and describe the difference between two consecutive frames and to see if something – such as a mouse click or keyboard input – has occurred. Meanwhile, the set-of-symbols (SoM) allows it to analyze different characters (letters, numbers) in certain parts of the image.
This type has both short-term and long-term memory to handle current tasks and store past events to make future decisions. The researchers trained the model to perform both System 1 (quick, automatic and natural) and System 2 (slow and deliberate). This allows for more decision-making, “thinking” thinking, identifying critical situations and correcting errors.
The researchers emphasized that it is important for the model to maintain a fixed goal and do trial and error to think, try and test what is possible before completing the task. They introduced two types of data to support this: error correction and post-analysis data. In order to correct errors, they identified errors and wrote corrective actions; at a later review, he adopted recovery strategies.
“This approach ensures that the agent not only learns to avoid errors but also dynamically adapts when they occur,” the researchers write.
Clearly, UI-TARS shows impressive potential, and it will be interesting to see its use cases in competing AI environments. As the researchers say: “When we look to the future, where assistants are at the forefront, the future lies in the integration of active and lifelong learning, where assistants manage their own continuous learning, real-world interactions.”
The researchers point out that Claude Computer Use “works strongly on the web but suffers greatly in mobile experiences, indicating that Claude’s GUI capabilities have not been transferred well to mobile.”
In contrast, “UI-TARS shows good performance on both websites and mobile devices.”