TomorrowBye | WAKAJA

前面七篇文章，我们让 Agent 拥有了：

安全的沙箱环境
持久化的任务执行
多模态处理能力
工具调用能力
长期记忆

但 Agent 依然无法独立完成很多任务，比如：

"帮我在购物网站上搜索最便宜的笔记本电脑"
"自动填写这个表单并提交"
"检查网站上是否有新的通知"

这些任务需要 Agent 具备视觉能力 - 理解网页 UI、识别元素、执行操作。

今天我们聊聊：如何让 Agent 看懂网页并自主操作。

为什么 Agent 需要视觉能力？

传统自动化的局限

Selenium/Puppeteer 的问题：

// 传统方式：依赖固定选择器
await page.click("#submit-button")
await page.fill("[name=username]", "john")
 
// 网站改版后立即失效
// button id 从 'submit-button' 改为 'btn-submit' → 脚本崩溃

问题：

脆弱性：依赖 CSS 选择器，网站改版即失效
无上下文理解：不知道"提交按钮"的语义
无容错能力：找不到元素就崩溃

Agent 的优势

// Agent 方式：基于语义理解
await agent.execute("点击提交按钮")
 
// 网站改版后依然有效
// Agent 通过视觉理解找到"提交"按钮（无论 id 是什么）

优势：

鲁棒性：基于视觉和语义，不依赖 DOM 结构
智能决策：理解上下文，选择最合适的操作
自我修复：遇到异常能尝试替代方案

Agent 如何理解 UI？

方案一：视觉 API + 元素标注

原理： 截图 → 标注元素 → Vision API 识别

import Anthropic from "@anthropic-ai/sdk"
import puppeteer from "puppeteer"
 
async function detectUIElements(page: puppeteer.Page) {
  // 1. 截图并标注
  const screenshot = await page.screenshot({
    encoding: "base64",
    fullPage: true,
  })
 
  // 2. Vision API 分析
  const anthropic = new Anthropic()
  const response = await anthropic.messages.create({
    model: "claude-3-5-sonnet-20241022",
    max_tokens: 1024,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: {
              type: "base64",
              media_type: "image/png",
              data: screenshot,
            },
          },
          {
            type: "text",
            text: "识别页面中的所有交互元素（按钮、输入框、链接），并返回它们的位置和功能",
          },
        ],
      },
    ],
  })
 
  return response.content[0].text
}

优点： 简单直接
缺点： 成本高（Vision API token 消耗大）、速度慢

方案二：Accessibility Tree + Vision 混合

核心思路： 用 Accessibility Tree 提供结构信息 + Vision API 理解布局

async function getAccessibilitySnapshot(page: puppeteer.Page) {
  const client = await page.target().createCDPSession()
 
  // 获取无障碍树
  const { nodes } = await client.send("Accessibility.getFullAXTree")
 
  // 过滤交互元素
  const interactiveElements = nodes.filter((node) =>
    ["button", "textbox", "link", "combobox"].includes(node.role),
  )
 
  return interactiveElements.map((node) => ({
    role: node.role,
    name: node.name,
    description: node.description,
    bounds: node.boundsInViewport,
  }))
}
 
// 结合视觉上下文
async function findElement(page: puppeteer.Page, instruction: string) {
  const axTree = await getAccessibilitySnapshot(page)
  const screenshot = await page.screenshot({ encoding: "base64" })
 
  const response = await anthropic.messages.create({
    model: "claude-3-5-sonnet-20241022",
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: {
              type: "base64",
              media_type: "image/png",
              data: screenshot,
            },
          },
          {
            type: "text",
            text: `可交互元素：\n${JSON.stringify(axTree, null, 2)}\n\n任务：${instruction}\n\n返回应该操作的元素索引`,
          },
        ],
      },
    ],
  })
 
  return response
}

优点： 成本低、速度快、准确率高
缺点： 需要 CDP (Chrome DevTools Protocol) 支持

实战：自主填写表单 Agent

完整示例：让 Agent 自动完成用户注册

import puppeteer from "puppeteer"
import Anthropic from "@anthropic-ai/sdk"
 
class FormAgent {
  private page: puppeteer.Page
  private anthropic: Anthropic
 
  constructor(page: puppeteer.Page) {
    this.page = page
    this.anthropic = new Anthropic()
  }
 
  async fillForm(userData: Record<string, any>) {
    // 1. 分析表单结构
    const formFields = await this.analyzeForm()
 
    // 2. 智能匹配字段
    const fieldMapping = await this.mapFields(formFields, userData)
 
    // 3. 执行填写
    for (const [fieldId, value] of Object.entries(fieldMapping)) {
      await this.fillField(fieldId, value)
    }
 
    // 4. 提交表单
    await this.submitForm()
  }
 
  private async analyzeForm() {
    const axTree = await this.getAccessibilityTree()
    const screenshot = await this.page.screenshot({ encoding: "base64" })
 
    const response = await this.anthropic.messages.create({
      model: "claude-3-5-sonnet-20241022",
      messages: [
        {
          role: "user",
          content: [
            {
              type: "image",
              source: {
                type: "base64",
                media_type: "image/png",
                data: screenshot,
              },
            },
            {
              type: "text",
              text: `表单元素：\n${JSON.stringify(axTree)}\n\n识别所有输入字段，返回 JSON 格式：[{id, label, type, required}]`,
            },
          ],
        },
      ],
    })
 
    return JSON.parse(response.content[0].text)
  }
 
  private async mapFields(formFields: any[], userData: Record<string, any>) {
    const response = await this.anthropic.messages.create({
      model: "claude-3-5-sonnet-20241022",
      messages: [
        {
          role: "user",
          content: `表单字段：${JSON.stringify(formFields)}\n用户数据：${JSON.stringify(userData)}\n\n匹配字段并返回填写方案 JSON`,
        },
      ],
    })
 
    return JSON.parse(response.content[0].text)
  }
 
  private async fillField(fieldId: string, value: string) {
    const element = await this.page.$(`[id='${fieldId}']`)
    if (element) {
      await element.type(value, { delay: 50 }) // 模拟人类输入速度
    }
  }
 
  private async submitForm() {
    const axTree = await this.getAccessibilityTree()
    const submitButton = axTree.find(
      (node) =>
        node.role === "button" && /submit|register|sign up/i.test(node.name),
    )
 
    if (submitButton) {
      await this.page.click(`[id='${submitButton.id}']`)
    }
  }
 
  private async getAccessibilityTree() {
    const client = await this.page.target().createCDPSession()
    const { nodes } = await client.send("Accessibility.getFullAXTree")
    return nodes.filter((n) => n.role && n.role !== "generic")
  }
}
 
// 使用示例
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto("https://example.com/register")
 
const agent = new FormAgent(page)
await agent.fillForm({
  email: "user@example.com",
  username: "john_doe",
  password: "secure123",
  phoneNumber: "1234567890",
})
 
await browser.close()

最佳实践

1. 成本优化

问题： Vision API 很贵（每张图 ~$0.005）

优化策略：

// ❌ 每次操作都截图
await agent.click("按钮") // 内部截图 → Vision API
 
// ✅ 缓存页面状态
const pageState = await agent.captureState() // 一次截图
await agent.click("按钮", { state: pageState }) // 复用
await agent.fill("输入框", "内容", { state: pageState }) // 复用

2. 容错机制

async function robustClick(page: puppeteer.Page, target: string) {
  const strategies = [
    () => clickByVision(page, target), // 策略 1：视觉识别
    () => clickByAccessibility(page, target), // 策略 2：无障碍树
    () => clickByText(page, target), // 策略 3：文本匹配
    () => clickByXPath(page, target), // 策略 4：XPath
  ]
 
  for (const strategy of strategies) {
    try {
      await strategy()
      return // 成功则退出
    } catch (error) {
      console.log(`策略失败：${strategy.name}，尝试下一个`)
    }
  }
 
  throw new Error(`所有策略失败：无法点击 ${target}`)
}

3. 人类化操作

// ❌ 机器人式操作
await page.click("#button")
 
// ✅ 模拟人类行为
await page.hover("#button") // 先悬停
await new Promise((r) => setTimeout(r, 200 + Math.random() * 300)) // 随机延迟
await page.click("#button") // 再点击
await page.mouse.move(Math.random() * 10, Math.random() * 10) // 微小移动

现成工具推荐

如果不想从零开始，可以用：

1. Anthropic Computer Use (实验性)

让 Claude 直接操作电脑
支持鼠标、键盘、截图

2. GPT-4 Vision + Browser Extension

Chrome 插件捕获页面
GPT-4 Vision 分析操作

3. Playwright + AI

Playwright 提供稳定的浏览器控制
AI 提供智能决策

总结

核心要点：

视觉能力 = 理解 UI + 自主操作
混合方案最优：Accessibility Tree + Vision API
成本优化：缓存状态、减少 API 调用
容错设计：多策略备份、智能重试

下一步行动：

尝试 Puppeteer + Vision API 组合
实现一个简单的表单填写 Agent
探索 Anthropic Computer Use API

系列完结！ 8 篇文章完整覆盖 Agent 基础设施核心能力。接下来的实战看你的了！🚀