Vision Exploration (Experimental)

warning

Vision Exploration is still experimental and is currently supported only on Windows.

What is Vision Exploration?

Vision Exploration is an intelligent desktop operation assistant. It can "look" at screenshots like a human, then automatically plan and execute mouse clicks, keyboard input, and other actions to complete tasks in desktop applications.

Requirements

Your client device must support the following permissions:

Screen capture: the Agent needs to see your screen
Computer control: the Agent needs to control the computer and simulate mouse and keyboard actions

How to Use It

Just describe the complete goal you want to achieve in natural language. The Agent will plan and execute step by step automatically.

Good examples:

"Open Tencent Meeting and create a meeting"
"Open DingTalk and send a message to xxx"

Bad examples:

"Click the lower-left corner" - do not provide step-by-step instructions; let the Agent plan
"Operate the computer" - too vague; the goal needs to be explicit

What the Agent Can Do

Action	Description	Example
Single click	Click buttons, links, and other on-screen elements	Click the `Confirm` button
Double click	Open files or select text	Double-click a document on the desktop
Right click	Open context menus	Right-click a file to view properties
Type text	Enter content into an input box	Enter keywords into a search box
Open app	Launch a desktop application	Open a browser or Notepad
Keyboard input	Simulate keys or shortcuts	`Ctrl+S` to save, `Enter` to confirm
Hover	Trigger menus or tooltips by moving the mouse over an element	Hover to view a tooltip
Drag	Drag from one location to another	Drag a file into a folder
Scroll	Scroll vertically or horizontally	Scroll down to see more

Execution Process

Observe the screen: the Agent captures the current screen
Analyze and think: AI examines the screenshot and decides the next action
Perform the action: it executes one action automatically, such as clicking a button
Check the result: it compares before and after screenshots to see whether the action succeeded
Repeat: it continues until the task is complete

Throughout the process, you can see each action in real time in the step list.

Notes

Make sure the target application's interface is visible and not blocked. You can use Floating Window Mode to hide the bit-Agent window.
If an action fails, the Agent automatically tries alternative methods
Complex tasks may require multiple steps, so please be patient
If you need to cancel midway, you can stop the task at any time

What is Vision Exploration?​

Requirements​

How to Use It​

What the Agent Can Do​

Execution Process​

Notes​