API 文档 · LeDocExtract

调用流程

用 multipart 把图片 POST 到 /extract/{doc_type}
服务端按 profile 调上游 OCR (chandra/rapidocr) + Qwen，再跑校验
校验失败自动重试一次（按 profile 的 retry.strategy）
HTTP 200 返回结构化 JSON · success 字段告诉你过没过

校验失败也是 200，错误细节在 body 里。客户端只看 success 字段就够。

端点

方法	路径	作用
POST	/extract/{doc_type}	主接口：上传图 + 抽取 + 校验
GET	/health	健康 + 当前已加载 profile 列表
GET	/admin/api/profiles/{doc_type}/raw	读 YAML 原文（管理用）
PUT	/admin/api/profiles/{doc_type}/raw	保存 YAML + 热加载（管理用）

调用示例

curl

curl -F "file=@id_card.jpg" \
  http://127.0.0.1:8090/extract/id_card_front

Python (httpx)

import httpx

with open("id_card.jpg", "rb") as f:
    r = httpx.post(
        "http://127.0.0.1:8090/extract/id_card_front",
        files={"file": ("id_card.jpg", f.read())},
        timeout=300,
    )
j = r.json()
if j["success"]:
    print(j["fields"]["id_number"]["value"])
else:
    for e in j["validation_errors"]:
        print(e["field"], e["message"])

返回格式

{
  "doc_type": "id_card_front",
  "success": true,
  "fields": {
    "name":       {"value": "曾梅", "confidence": 1.0},
    "id_number":  {"value": "520123199903285848", "confidence": 1.0},
    "birth_date": {"value": "1999-03-28", "confidence": 1.0},
    ...
  },
  "validation_errors": [],
  "retries": 0,
  "elapsed_seconds": 3.5,
  "raw_ocr_text": "...",
  "raw_extract_json": { ... },
  "pipeline_errors": []
}

失败示例（校验位错）

{
  "success": false,
  "fields": {
    "id_number": {
      "value": "440101199001010019",
      "confidence": 0.0,
      "errors": [{"rule": "id_number_checksum",
                  "message": "校验位应为 5，实际为 9"}]
    }
  },
  "validation_errors": [
    {"field": "id_number", "rule": "id_number_checksum",
     "message": "校验位应为 5，实际为 9"}
  ],
  "retries": 1
}

校验规则库

在 profile 的 validate: 中按名引用：

规则名	说明
id_number_checksum	18 位身份证 ISO 7064 校验位
id_birth_consistency	身份证 7-14 位 == birth_date
id_gender_consistency	身份证 17 位奇偶 == gender
date_lt	字段 A 日期早于字段 B
date_diff_years	两日期年差落在 [min, max]
degree_years_plausible	毕业 - 入学在该学历的合理范围（高中 / 中职 / 专科 / 本科 / 硕士 / 博士）
valid_period_canonical	身份证有效期 5/10/20 年或 "长期"

另：字段级规则（required / pattern / enum / date_format / min_length）直接在 extract.fields[N] 里写，不在 validate: 引用。

DSL 完整结构

doc_type: id_card_front          # URL 用，与文件名匹配
name: "身份证正面"
version: 1

ocr:
  engine: rapidocr                # chandra | rapidocr
  options: { scale: 2.0 }

extract:
  model: qwen3.5-35b-a3b
  prompt_template: |
    ... {ocr_text} ...           # 必须含 {ocr_text} 占位
  fields:
    name:
      type: string                # string | date | int | float
      required: true
      pattern: "^[\\u4e00-\\u9fa5]{2,}$"
      enum: ["A", "B"]            # 可选
      format: "YYYY-MM-DD"        # 仅 type=date
      min_length: 2
      max_length: 50

validate:                         # 规则名见上一节
  - { rule: id_number_checksum, target: id_number }
  - { rule: date_lt, targets: [enrollment_date, graduation_date] }

retry:
  max_retries: 1
  strategy: feed_errors_to_prompt # 或 reocr_other_engine / both
  fallback_engine: chandra        # strategy=reocr_other_engine 时用