案例复盘 · 2025.11 — 2026.02 Case Study · Nov 2025 — Feb 2026

Pre-Flight
Agent
Pre-Flight
Agent

为 AIGC 短剧行业设计的投产前置智能审核系统。在视频生成前,完成需求补全、成本估算、合规拦截与专业提示词生成。 A pre-production intelligence system for the AIGC short drama industry — completing requirement checks, cost estimation, compliance screening, and professional prompt generation before a single frame is rendered.

查看架构View Architecture 设计决策Design Decisions
pre-flight-agent · workflow preview
输入Input
用户自然语言描述User natural language input
Step 0
意图识别Intent Classification
Qwen3-30B
新需求New Request
重置记忆Reset Memory
补充信息Supplement
读取 · 合并Read · Merge
ERNIE-4.0
确认投产Confirm
→ 完成→ Done
Step 1
智能阻断器 · 三级红线Smart Interceptor · 3-tier gates
DeepSeek-V3.2
PASS NEED_EDIT REJECT
Step 2.5
成本计算器Cost Estimator
ERNIE-4.0
Step 2.6
决策路由器Decision Router
Python · Code
Step 3
Prompt 生成器 + RAGPrompt Generator + RAG
ERNIE-4.0 + KB
Step 4
C端回复生成器User Response Generator
ERNIE-Speed
项目概览Project Snapshot

一个项目
一套方法论
One project.
One methodology.

项目类型Type
AIGC Agent 系统设计AIGC Agent System Design
周期Duration
2025.11 — 2026.02
我的角色My Role
独立设计与搭建(核心负责人)Solo Design & Build (Lead)
开发平台Platform
百度千帆 AppBuilderBaidu Qianfan AppBuilder
项目阶段Stage
MVP 验证完成 · 商业化准备中MVP Validated · Pre-commercial
版本演进Versions
v0.1 → v0.2 → v0.2.5 → v0.3 Beta
核心技术标签Tech Tags
Multi-Model Routing RAG Multi-turn Memory Qwen3-30B DeepSeek-V3.2 ERNIE-4.0 Cost Estimation Prompt Engineering Workflow Orchestration
背景与问题定义Problem Background

"盲盒式投产"
的真实代价
The real cost of
blind production

短剧行业正在大规模拥抱 AIGC,但没有人在"生成"之前做好把关。投产即开盲盒,试错成本极高。 The short drama industry is rapidly adopting AIGC, but nobody is gatekeeping before generation. Every production is a blind box — the cost of failure is enormous.

  • 需求模糊即投产Vague requirements go to production
    创作者一句"我要仙侠视频"就开始生成,缺少时长、分辨率、预算等关键参数,最终结果往往与预期完全不符。 Creators submit "I want a xianxia video" and immediately pay for generation, missing duration, resolution, and budget — results never match expectations.
  • 预算超支无预警Budget overruns with zero warning
    没有前置的成本估算,用户在生成完成后才发现费用远超预算,重新生成带来双重损耗。 No upfront cost estimation means users discover budget overruns only after generation — paying twice for the same output.
  • 术语幻觉严重Severe hallucination on domain terms
    通用大模型不懂"御剑飞行""丹炉"等仙侠专业词汇的正确英文表达,生成结果专业度低、风格跑偏。 General LLMs don't know the correct English Prompt equivalents for "immortal sword riding" or "alchemy furnace," producing generic, off-style results.
  • 剧本拆解极耗时Script breakdown is painfully slow
    人工编写分镜 Prompt 需要熟悉 AIGC 工具的英文关键词体系,一集短剧需要数小时甚至一整天。 Manually writing storyboard prompts requires AIGC keyword expertise — a single episode takes hours to a full day.

这不是一个"生成工具"能解决的问题。用户需求是渐进式的,信息是分散的,判断是多维度的,优化是循环的。只有 Agent 能处理多轮对话记忆、动态路由和跨节点状态管理。 This isn't a problem a simple generation tool can solve. User requirements are incremental, information is scattered, judgment is multi-dimensional, and optimization is iterative. Only an Agent can handle multi-turn memory, dynamic routing, and cross-node state management.

30%+
预计降低无效投产成本Projected waste reduction
<15%
成本估算误差Cost estimation error
42s
平均响应时间Avg response time
系统架构System Architecture

多模型混排
工作流编排
Multi-model routing
workflow orchestration

不同节点选用不同模型,核心原则:把对的模型用在对的任务上,而不是全程用最贵的。 Different nodes use different models. Core principle: use the right model for the right task — never default to the most expensive one everywhere.

完整工作流 · v0.3 BetaFull Workflow · v0.3 Beta PASS NEED_EDIT REJECT
Pre-Flight Agent — 完整工作流 用户输入 自然语言描述需求 Step 0 · 意图识别 Qwen3-30B · 三分类判断 新需求 补充信息 确认投产 重置记忆 清空历史 记忆读取 · 合并 ERNIE-4.0 · 新旧融合 确认投产 输出投产指令 流程输入选择器 Python代码 · 选最终文本 Step 1 · 智能阻断器 DeepSeek-V3.2 · 三级红线检查 REJECT 拦截终止 NEED_EDIT 追问缺失字段 PASS Step 2.5 · 成本计算器 ERNIE-4.0 · 参数提取+公式计算 Step 2.6 · 决策路由器 Python代码 · APPROVE / EDIT Step 3 · 企业级生成器 ERNIE-4.0 + RAG知识库 · 三档Prompt RAG知识库 仙侠+镜头语言 Step 4 · C端回复生成器 ERNIE-Speed · JSON→自然语言 用户收到回复 成本+决策+三档Prompt 图例: 自然语言模型 安全检查模型 计算/生成模型 RAG增强生成 代码节点/纯逻辑 前端展示层
STEP 0 · Qwen3-30B
为什么用 Qwen 做意图识别?Why Qwen for intent classification?
意图识别是三分类任务,核心诉求是分类准确率而非生成质量。Qwen3-30B 在中文自然语言理解和文本分类上表现更稳定,误判率低于测试中的其他备选模型。Intent classification is a 3-class task. The core need is accuracy, not generation quality. Qwen3-30B demonstrated superior stability in Chinese NLU classification with the lowest error rate among models tested.
STEP 1 · DeepSeek-V3.2
为什么用 DeepSeek 做合规检查?Why DeepSeek for compliance?
合规检查需要对语义风险有深层理解——"暗黑系打斗"和"暴力血腥"是不同的。DeepSeek 在安全对齐和结构化 JSON 输出上的稳定性高于备选模型,且误判率更低。Compliance requires nuanced semantic risk judgment — "dark style combat" vs. "graphic violence" are different. DeepSeek excels in safety alignment and stable structured JSON output with lower false positive rates.
STEP 2.5 + STEP 3 · ERNIE-4.0
为什么成本和 Prompt 都用 ERNIE?Why ERNIE for cost & prompts?
ERNIE-4.0 在中文内容理解和生成上有显著优势,结合仙侠场景 RAG 知识库的效果最好。同时它对百度平台的千帆工作流集成度最高,调用延迟最低。ERNIE-4.0 has a clear advantage in Chinese content understanding and generation, with the best results when combined with the xianxia RAG knowledge base and the lowest latency on the Qianfan platform.
STEP 2.6 · Python Code
为什么决策路由不用模型?Why use code instead of a model for routing?
决策路由是确定性逻辑:缺字段就 EDIT,违规就 REJECT,超预算就追加降本建议。用大模型做 if-else 浪费且不稳定。代码节点 100% 可预测,响应几乎无延迟。Decision routing is deterministic: missing fields → EDIT, violations → REJECT, over budget → add reduction tips. Using an LLM for if-else logic is wasteful and unstable. Code nodes are 100% predictable with near-zero latency.
工作流拆解 Workflow Breakdown

每个节点,
一个明确职责
Every node,
one clear responsibility

将鼠标悬停在任意节点上,查看该节点的详细职责、输入/输出与模型选型说明。 Hover any node to see its responsibilities, inputs/outputs, and model selection rationale.

职责 Responsibility
输入 Input
输出 Output
职责 Responsibility
输入 Input
输出 Output
关键设计决策Key Design Decisions

为什么这样设计?Why this design,
not that?

风控设计Risk Control Design
为什么三级阻断,而不是一级?Why 3-tier blocking, not 1?
合规、完整性和预算是三种完全不同性质的问题,处理方式也不同:合规需要硬性拒绝(不给任何优化建议以防止绕过);完整性需要追问;预算需要给出具体的降本路径。混在一起处理会导致逻辑混乱和用户体验割裂。 Compliance, completeness, and budget are three fundamentally different problem types requiring different handling: compliance → hard reject (no suggestions, to prevent bypassing); completeness → ask questions; budget → concrete cost-reduction paths. Combining them creates logical confusion and broken UX.
RAG 设计RAG Design
为什么要建垂直知识库?Why build a vertical knowledge base?
通用大模型的训练数据不包含"御剑飞行"对应的专业英文 Prompt 组合。让模型自由发挥会产生幻觉(如 "flying man on sword"),生成质量极差。RAG 知识库将中文意图硬绑定到专业 Prompt,从根本上消灭了这类幻觉,并通过负向词防止风格偏移。 General LLM training data doesn't include expert English prompt equivalents for xianxia terms. Free generation produces hallucinations like "flying man on sword." The RAG KB hard-binds Chinese intent to professional prompts, fundamentally eliminating this class of hallucination, while negative prompts prevent style drift.
成本估算设计Cost Estimation Design
为什么给成本区间而不是精确数字?Why cost ranges, not exact figures?
AIGC 生成成本受多种运行时因素影响(服务器负载、模型版本、排队时间),精确单价无法保证。区间(下限×0.85,上限×1.15)能诚实地表达不确定性,同时给用户足够的预算决策空间,避免因精确但错误的数字引发信任危机。 AIGC generation costs are affected by runtime factors (server load, model versions, queue time). Exact prices can't be guaranteed. Ranges (lower×0.85, upper×1.15) honestly represent uncertainty while giving users adequate decision-making space, avoiding trust issues from precise-but-wrong numbers.
迭代与失败复盘Iteration & Failure Analysis

从 v0.1 到 v0.3,
每次失败都有代价
From v0.1 to v0.3,
every failure had a cost

v0.1
通用投产预审 · 基础版General Pre-Production Review · Baseline
接收需求 → 简单预审 → 生成通用绘画 Prompt → 输出结论。功能可用,但生成质量低,专业度严重不足。Receive input → basic review → generic painting prompt → output conclusion. Functional but low quality, severely lacking professional depth.
生成"big house, flying man"而不是专业仙侠术语,AIGC 工具效果极差Generated "big house, flying man" instead of professional xianxia terminology — AIGC results were terrible
单轮对话,用户必须一次说完所有参数,体验极差Single-turn only — users had to specify all parameters at once, terrible UX
v0.2
接入 RAG 知识库 · 漫剧专项RAG Integration · Drama Specialization
构建首个仙侠场景 RAG 知识库,升级为分镜表输出格式。专业度大幅提升,但成本计算和阻断逻辑出现新问题。Built the first xianxia RAG KB, upgraded to storyboard format. Professional quality improved significantly, but cost calculation and blocking logic surfaced new issues.
成本计算失败率高:时长提取错误导致几个用例误差高达 70-95%(测试 001、014)High cost calculation failure rate: duration extraction errors caused 70-95% error in test cases 001, 014
阻断器过于死板:专业用户直接贴入分镜脚本,被误判为"信息缺失"而拦截Interceptor too rigid: professional users pasting storyboard scripts were wrongly blocked for "missing information"
平均响应时间 70 秒,用户体验明显下降Average response time 70 seconds, noticeably degraded UX
关键学习:Key learning: 大模型适合理解任务,不适合精确计算。参数提取需要强制性约束("必须从输入中提取,找不到则输出 null"),而非依赖模型自由判断。 LLMs are good at understanding, not precise calculation. Parameter extraction needs hard constraints ("must extract from input, output null if not found"), not free model judgment.
v0.2.5
导演模式 · 视听双轨Director Mode · Visual-Audio Dual Track
从"摄影师"升级为"导演":不只生成画面描述,还自动生成旁白、音效和关键台词,形成完整的微剧本。全中文本地化。Upgraded from "photographer" to "director": generate not just visual descriptions, but narration, sound effects, and key dialogue — forming a complete micro-screenplay. Full Chinese localization.
突破:Breakthrough: 用户价值从"获得一组绘画词"升级为"获得一个可直接拍摄的短剧本"。产品定位从工具升级为导演助手。 User value upgraded from "getting drawing keywords" to "getting a directly filmable short screenplay." Product positioning shifted from tool to director's assistant.
v0.3 Beta
智能化升级 · 工程质量提升Intelligence Upgrade · Engineering Quality
针对 v0.2 测试中发现的核心问题进行专项优化:VIP 直通机制、时间戳解析、复杂度关键词映射、Step4 切换快速模型。Targeted fixes for issues found in v0.2 testing: VIP exemption, timestamp parsing, complexity keyword mapping, Step4 switched to fast model.
结果:Result: 响应时间从 70s → 42s(-43%),参数提取成功率显著提升,合规拦截准确率 100%(2/2),Prompt 关键元素包含率 90%(18/20)。 Response time 70s → 42s (−43%), parameter extraction success rate significantly improved, compliance accuracy 100% (2/2), prompt element coverage 90% (18/20).
成果与验证数据Results & Validated Metrics

20 个测试用例,
真实数据说话
20 test cases,
real numbers

95%
需求理解准确率(18/20 用例 Prompt 关键元素完整)Requirement accuracy (18/20 cases with complete key elements)
<15%
参数提取成功时的成本估算误差(中位误差 0%)Cost estimation error when extraction succeeds (median error: 0%)
42s
优化后平均响应时间(原 70s,提升 43%)Avg response time after optimization (was 70s, −43%)
100%
合规内容拦截准确率(2/2 违规用例全部正确)Compliance interception accuracy (2/2 violation cases correct)
优化前 v0.1Before v0.1
优化后 v0.3After v0.3
"big house, flying man"
通用无效输出
"big house, flying man" — generic, unusable
"floating islands, ancient pagodas, wuxia style"
基于 RAG 知识库的专业输出
"floating islands, ancient pagodas, wuxia style" — RAG-grounded professional output
单图描述,人景混杂Single image description, mixed elements
分镜表结构:环境/人物/动作独立拆解Storyboard format: environment/character/action separated
无成本预估,投产即开盲盒No cost preview — production is a blind box
动态成本区间 + 具体降本建议Dynamic cost range + specific reduction recommendations
剧本拆解:数小时/集Script breakdown: hours per episode
分钟级(平均 42 秒)Minutes (avg 42 seconds)
风格容易跑偏,国漫变韩漫/日漫Style drift — Chinese style becomes Korean/Japanese
锁定国漫/水墨风,负向词防跑偏Locked to Chinese ink painting style, negative prompts prevent drift

测试报告显示整体成本估算平均误差为 22.7%,被少数参数提取失败的异常用例拉高(001: 70.2%, 014: 94.8%)。根因是提示词未强制要求提取失败时输出 null,导致系统用默认值静默计算。 The test report shows an overall average cost error of 22.7%, inflated by a few parameter extraction failure outliers (case 001: 70.2%, case 014: 94.8%). Root cause: prompts didn't mandate null output on extraction failure, causing silent calculation with defaults.

在参数正确提取的用例中,公式层面误差<15%,中位误差 0%。v0.3 已针对性修复。 When extraction succeeded, formula-level error was <15% with 0% median error. v0.3 specifically addressed this with hard extraction constraints.

技术实现Technical Stack

选型原则:
适合的而非最贵的
Selection principle:
right fit, not most expensive

编排平台Orchestration
百度千帆 AppBuilderBaidu Qianfan AppBuilderWorkflow
节点类型Node typesLLM+Code+Memory
意图识别Intent Model
Qwen3-30B-A3BClassification
选型依据Why分类准确High accuracy
合规检查Compliance Model
DeepSeek-V3.2Safety
选型依据Why安全对齐强Safety-aligned
生成模型Generation Model
ERNIE-4.0-TurboCost+Prompt
选型依据Why中文理解强Chinese-strong
回复生成Response Model
ERNIE-SpeedUX layer
选型依据Why速度优先Speed priority
知识库Knowledge Base
RAG · Excel → CSV200+ entries
结构SchemaKey/Context/Value/Neg
反思与下一步Reflection & Next Steps

如果继续迭代,
我会做什么
If I continued,
here's what's next

成本模型精准化Cost Model Precision
接入 Sora、Runway 等真实 AIGC 工具的单价 API,替换当前的估算公式,实现基于真实模型报价的动态计算,并按建议自动计算节省百分比。Integrate real pricing APIs from Sora, Runway, and other AIGC tools to replace estimation formulas with dynamic real-world pricing, including auto-calculated savings percentages per recommendation.
🧠
跨会话记忆持久化Cross-session Memory Persistence
基于用户 ID 实现跨会话的历史偏好存储,让 Agent 记住用户的常用分辨率、预算范围、风格偏好,从陌生人变成了解用户的创作助手。User ID-based persistent memory to store historical preferences — common resolution, budget ranges, style preferences — evolving from a stranger to a creative assistant that knows the user.
📊
可观测性与评估体系Observability & Evaluation
当前测试是人工逐用例核查。需要建立自动化评估 pipeline:参数提取准确率、成本误差分布、Prompt 专业度打分(可引入 LLM-as-judge 机制)。Current testing is manual case-by-case review. Need automated evaluation pipelines: parameter extraction accuracy, cost error distributions, and prompt quality scoring (LLM-as-judge mechanism).
🔗
投产执行接口Production Execution API
"确认投产"后直接调用 AIGC 生成 API,闭合从预审到生成的完整链路,让 Pre-Flight 从前置助手变成端到端的制作平台入口。Auto-call AIGC generation APIs upon "confirm production," closing the loop from pre-review to generation — evolving Pre-Flight from a pre-production assistant to an end-to-end production platform entry.
📚
知识库扩容Knowledge Base Expansion
当前仅覆盖仙侠场景(200+ 条)。计划扩展悬疑、都市、古装、科幻等流派,并为每个场景添加推荐台词字段(Script Hook),支持导演模式的金句直取。Currently covers only xianxia (200+ entries). Plan to expand to thriller, urban, historical, and sci-fi genres, adding Script Hook fields per scene to enable direct golden-line extraction in Director Mode.
🧪
A/B 测试与回归A/B Testing & Regression
建立回归测试集,每次迭代后自动运行全量用例,对比关键指标变化(成本误差、响应时间、Prompt 质量),防止优化一处、破坏另一处。Build a regression test suite that auto-runs all cases after each iteration, comparing key metric changes — preventing optimization in one area from breaking another.

有想法?
聊聊看。
Have a project?
Let's talk.

如果你在关注 AIGC Agent 系统设计、解决方案架构或产品复盘,欢迎交流。 If you're working on AIGC Agent system design, solution architecture, or product retrospectives — let's connect.