实体抽取
关键设计
用约束式提示词驱动 Gemini 3 Pro,从商品文本中抽取结构化实体。
抽取准确率:91.2%(基于 100 条人工标注样本)。
7 类商品实体
通过约束式 Prompt 抽取以下 7 类商品实体:
| 实体类型 | 示例 | 用途 |
|---|---|---|
| Product(商品本体) | Stainless Kitchen Rack | 图谱根节点 |
| Material(材质) | 304 不锈钢、ABS、竹木 | 影响主图风格、A+ 描述 |
| Specification(规格) | 尺寸 / 重量 / 承重 / 容量 | Listing Bullet 关键事实 |
| UseCase(场景) | 厨房收纳、户外露营、卫浴 | 生成 Lifestyle 图的提示词 |
| FeatureSellingPoint(卖点) | Anti-rust / Foldable | 决定 A+ 信息图标题 |
| Compliance(合规) | FDA-Free Claim / 非医疗 | 对接违禁词库做过滤 |
| Audience(受众) | Single Apartment / Family | 决定文案语气与场景 |
约束式提示词
text
你是亚马逊跨境电商商品本体专家。请从以下商品文本中抽取 7 类实体:
Product / Material / Specification / UseCase / FeatureSellingPoint /
Compliance / Audience。
并以 JSON 数组返回,每个实体包含 type、name、attrs;如不存在该类型,
请返回空数组。
【硬约束】
1. 禁止编造未在原文出现的事实
2. Specification 必须保留原始数值与单位(如 "60×30×80cm")
3. 如有 304 / 316 等型号,必须保留具体型号
4. UseCase 与 Audience 可基于商品类目合理推断 1-3 个
【输出格式】
{
"entities": [
{ "type": "Product", "name": "...", "attrs": {} },
{ "type": "Material", "name": "...", "attrs": {} },
...
],
"evidence": {
"<entity_id>": "原文证据片段"
}
}text
商品标题:Stainless Steel Kitchen Rack Foldable
商品参数:
- Material: 304 SUS Stainless Steel
- Size: 60×30×80cm
- Folded: 60×30×8cm
- Load: 8kg per tier, 24kg total
- Tiers: 3
- Anti-rust coating
- Lead-free certified
- 1-year warranty
请抽取实体。抽取结果示例
json
{
"entities": [
{
"type": "Product",
"name": "Stainless Steel Kitchen Rack",
"attrs": { "category": "Kitchen Storage" }
},
{
"type": "Material",
"name": "304 Stainless Steel",
"attrs": { "grade": "304", "alias": ["304 SUS", "Grade 304"] }
},
{
"type": "Specification",
"name": "Size 60x30x80cm",
"attrs": { "length_cm": 60, "width_cm": 30, "height_cm": 80 }
},
{
"type": "Specification",
"name": "Load 8kg per tier",
"attrs": { "load_kg": 8, "tiers": 3, "total_load_kg": 24 }
},
{
"type": "Specification",
"name": "Folded Size 60x30x8cm",
"attrs": { "folded_height_cm": 8 }
},
{
"type": "FeatureSellingPoint",
"name": "Foldable",
"attrs": { "weight": 0.95 }
},
{
"type": "FeatureSellingPoint",
"name": "Anti-rust",
"attrs": { "weight": 0.80 }
},
{
"type": "FeatureSellingPoint",
"name": "Multi-tier (3-Tier)",
"attrs": { "weight": 0.85 }
},
{
"type": "Compliance",
"name": "Lead-Free",
"attrs": { "certified": true }
},
{
"type": "Compliance",
"name": "Food Contact Safe",
"attrs": { "certified": true }
},
{
"type": "UseCase",
"name": "Small Kitchen Apartment",
"attrs": { "weight": 0.90 }
},
{
"type": "UseCase",
"name": "Bathroom Storage",
"attrs": { "weight": 0.65 }
},
{
"type": "UseCase",
"name": "Outdoor Camping",
"attrs": { "weight": 0.55 }
},
{
"type": "Audience",
"name": "Single Apartment Renter",
"attrs": {}
},
{
"type": "Audience",
"name": "Family with Kids",
"attrs": {}
}
],
"evidence": {
"ent_002": "Material: 304 SUS Stainless Steel",
"ent_003": "Size: 60×30×80cm",
"ent_006": "Folded: 60×30×8cm → implies foldable",
"ent_009": "Lead-free certified"
}
}抽取过程的工程优化
优化 1:图像辅助抽取
如果用户上传了参考图,将图片一并传给 Gemini 3 Pro Vision:
python
prompt = f"""
{base_prompt}
【商品文本】
{product_text}
【参考图】
[image_1.jpg]
[image_2.jpg]
请结合图片中的视觉信息(颜色、材质纹理、使用场景)补充实体。
"""例如从 Lifestyle 图中识别出"小户型公寓厨房"作为 UseCase。
优化 2:迭代式抽取
text
Round 1: 抽取基础实体
→ "Product, Material, Specification"
Round 2: 基于 Round 1 结果补充推理
→ "Compliance / Audience / UseCase"
Round 3: 人工校验提示
→ "请检查以下实体是否准确..."(让模型自我验证)
最终:合并三轮结果 + 去重优化 3:实体去重
通过 1024 维向量相似度(余弦 > 0.92)合并同义实体:
text
"304 stainless"
"304 SUS" ──→ 合并为 "304 Stainless Steel"
"Grade 304"
"Stainless 304"准确率评估
我们用 100 条人工标注的真实跨境电商商品做评估:
| 实体类型 | 标注总数 | AI 抽取数 | 正确数 | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| Product | 100 | 100 | 100 | 100% | 100% | 1.00 |
| Material | 287 | 271 | 264 | 97.4% | 92.0% | 0.95 |
| Specification | 631 | 612 | 568 | 92.8% | 90.0% | 0.91 |
| UseCase | 412 | 389 | 354 | 91.0% | 85.9% | 0.88 |
| FeatureSellingPoint | 523 | 498 | 462 | 92.8% | 88.3% | 0.91 |
| Compliance | 198 | 184 | 173 | 94.0% | 87.4% | 0.91 |
| Audience | 246 | 231 | 198 | 85.7% | 80.5% | 0.83 |
| 总计 | 2,397 | 2,285 | 2,119 | 92.7% | 88.4% | 0.90 |
综合 F1 = 0.90
这个准确率水平已经能够支撑生产环境使用。Audience 类别准确率较低是因为受众判断本身有较大主观性。
失败模式分析
已识别的失败模式
- 复合材质 —— "竹木 + 不锈钢" 有时被拆为两个独立实体
- 同义不同形 —— "Foldable" vs "Collapsible" 偶尔未合并
- 缩写专有名词 —— "BPA-Free" 偶尔被识别为 Specification 而非 Compliance
- 多语言 —— 中英混合输入时英文实体优先
我们已加入「实体合并提示词」和「专有名词字典」缓解这些问题。
提示词迭代历史
GraphRAG 实体抽取 Prompt 由学生迭代 12 次,每次抽取后人工检查 50 个样本:
| 版本 | 关键改进 | F1 |
|---|---|---|
| v1.0 | 基础抽取 | 0.62 |
| v3.0 | 加入 evidence 字段 | 0.71 |
| v5.0 | 加入硬约束 + JSON Schema | 0.79 |
| v7.0 | 引入图像辅助 | 0.85 |
| v9.0 | 三轮迭代抽取 | 0.88 |
| v12.0 | 专有名词字典 + 实体合并 | 0.90 |
接口
http
POST /api/graphrag/extract
Content-Type: application/json
{
"title": "Stainless Steel Kitchen Rack",
"parameters": "...",
"reference_images": ["base64...", "base64..."]
}响应:
json
{
"graphrag_id": "kg_2026-05-01_xyz",
"entities": [...],
"relations": [...],
"duration_ms": 6234
}