Figure Data eXtraction
高精度科学图表数据提取 Claude Code 技能。从论文图表(柱状图、折线图、散点图、箱线图、热力图、饼图、极坐标图、堆叠图)中提取数值数据,精度可达 ±0.5%。
High-precision scientific figure data extraction skill for Claude Code. Extract numerical data from paper figures (bar, line, scatter, box, heatmap, pie, polar, stacked charts) with up to ±0.5% accuracy.
FigDataX 是一个 Claude Code 技能,引导 AI 完成严格的半自动提取流程:
- 加载并分析图像(图表类型、坐标轴、标记、图例)
- 自动检测绘图区域或手动指定
- 多点轴校准 — 在每条轴上取 3+ 个刻度点进行最小二乘拟合
- 去除网格线 — Hough 线检测或颜色过滤
- 提取数据点 — 颜色匹配 + 亚像素质心精修,或从坐标网格叠加图手动读取
- 像素→数据转换 — 使用校准模型
- 验证 — 原图 vs 重建图并排对比
标记点(圆形、菱形、方形、三角形)的几何中心才是真正的数据点。大标记(10-20px)如果读边缘而非中心,可引入 5-10% 误差。
# 复制到 Claude Code skills 目录
cp -r FigDataX ~/.claude/skills/
# 安装 Python 依赖
pip install opencv-python numpy pandas matplotlib scipy openpyxl scikit-image直接告诉 Claude Code 提取图片数据:
> 提取 /path/to/figure.png 图片数据
> 从 ./results/fig3.png 中读取数据
> Extract data from /path/to/figure.png
Claude Code 将自动完成:
- 读取图像,识别图表类型、坐标轴、标记
- 生成坐标网格叠加图用于精确像素读取
- 执行多点轴校准
- 提取数据点(标记中心)
- 将结果和验证图保存在输入图片所在目录
- 输入:提供图片文件的绝对或相对路径(PNG、JPG 等)
- 输出:所有生成文件保存在输入图片所在的目录(不是 skill 目录)
| 输出文件 | 说明 |
|---|---|
{图片名}_extracted.csv |
提取的数据表 |
{图片名}_validation.png |
原图 vs 重建图对比验证 |
{图片名}_grid.png |
坐标网格叠加图(中间文件) |
示例:
输入:~/papers/fig3.png
输出:~/papers/fig3_extracted.csv
~/papers/fig3_validation.png
~/papers/fig3_grid.png
指向包含多张图的文件夹:
> 提取 /path/to/figures/ 文件夹中所有图片的数据
在 Claude Code 之外使用 FigDataX Python 库:
import sys, os
sys.path.insert(0, os.path.expanduser("~/.claude/skills/FigDataX"))
from scripts.figdatax import calibrate_axes_multipoint, auto_detect_plot_area| 方法 | 名称 | 适用场景 | 精度 |
|---|---|---|---|
| M1 | 校准半自动 | 所有图表(默认首选) | ±0.5-2% |
| M2 | 全自动颜色分割 | 高对比度、颜色分明的图表 | ±0.5-1% |
| M3 | Hough + 曲线追踪 | 折线图、连续曲线 | ±0.5-1% |
始终使用 M1。 这是最精确的方法,因为它依赖用户验证的精确坐标轴参考点,而非 AI 猜测。
- 柱状图(简单、分组、堆叠)
- 折线图(单系列/多系列)
- 散点图
- 箱线图 / 小提琴图
- 热力图
- 饼图
- 极坐标图
- 双 Y 轴图表
- 多面板图 (a, b, c, d)
- 线性、对数(半对数、双对数)
- 倒数(如波数)
- 日期/时间轴
- 使用最高分辨率图像(PDF 导出 300+ DPI)
- 每轴 3+ 刻度的多点校准
- 始终读取标记中心,而非边缘
- 颜色检测前先去除网格线
- 排除图例区域以避免误检
- 同色曲线使用坐标网格叠加图 + 手动读取
- 始终生成验证叠加图
FigDataX is a Claude Code skill that guides the AI through a rigorous semi-automatic extraction pipeline:
- Load & analyze the figure image (chart type, axes, markers, legend)
- Detect plot area automatically or via manual specification
- Multi-point axis calibration using least-squares fit on 3+ tick marks per axis
- Grid removal via Hough line detection or color-based filtering
- Data point extraction by color matching with sub-pixel centroid refinement, or manual reading from a coordinate grid overlay
- Pixel-to-data conversion using the calibrated axis model
- Validation via side-by-side overlay plot (original vs. reconstructed)
The geometric center of each marker (circle, diamond, square, triangle) is the true data point. Large markers (10-20px) can introduce 5-10% error if edges are read instead of centers.
# Copy into your Claude Code skills directory
cp -r FigDataX ~/.claude/skills/
# Install Python dependencies
pip install opencv-python numpy pandas matplotlib scipy openpyxl scikit-imageSimply tell Claude Code to extract data from a figure image:
> Extract data from /path/to/figure.png
> 提取 /path/to/papers/fig3.png 图片数据
> Digitize the chart in ./results/figure2a.png
Claude Code will automatically:
- Read the image and identify chart type, axes, markers
- Generate a coordinate grid overlay for precise pixel reading
- Perform multi-point axis calibration
- Extract data points (marker centers)
- Save results and validation plot in the same directory as the input image
- Input: Provide the absolute or relative path to the figure image (PNG, JPG, etc.)
- Output: All generated files are saved next to the input image, not in the skill directory
| Output File | Description |
|---|---|
{name}_extracted.csv |
Extracted data table |
{name}_validation.png |
Side-by-side original vs. reconstructed chart |
{name}_grid.png |
Coordinate grid overlay (intermediate) |
Example:
Input: ~/papers/fig3.png
Output: ~/papers/fig3_extracted.csv
~/papers/fig3_validation.png
~/papers/fig3_grid.png
To extract from multiple figures in a folder, point Claude Code to the directory:
> Extract data from all figures in /path/to/figures/
To use FigDataX as a Python library outside Claude Code:
import sys, os
sys.path.insert(0, os.path.expanduser("~/.claude/skills/FigDataX"))
from scripts.figdatax import calibrate_axes_multipoint, auto_detect_plot_area| Method | Name | Best For | Accuracy |
|---|---|---|---|
| M1 | Calibrated Semi-Auto | All charts (default) | ±0.5-2% |
| M2 | Fully Automated | High-contrast, distinct-color charts | ±0.5-1% |
| M3 | Hough + Curve Trace | Line charts, continuous curves | ±0.5-1% |
Always use M1. It is the most accurate because it relies on precise axis reference points verified by the user, not AI guessing.
import sys, os
sys.path.insert(0, os.path.expanduser("~/.claude/skills/FigDataX"))
from scripts.figdatax import (
auto_detect_plot_area, # Automatic plot area detection
calibrate_axes_multipoint, # Multi-point least-squares axis calibration
calibrate_axes, # Simple 2-point calibration
remove_grid, # Grid line removal (Hough/color/adaptive)
extract_by_color_adaptive, # Color-based data extraction with sub-pixel refinement
detect_data_colors, # K-means auto color detection
auto_extract_bars, # Bar chart extraction
auto_extract_scatter, # Scatter plot extraction
trace_curve, # Continuous curve tracing
interpolate_curve, # Spline interpolation for sparse points
extract_error_bars, # Error bar endpoint extraction
split_panels, # Multi-panel figure splitting
detect_axes_hough, # Hough-based axis detection
extract_polar, # Polar plot extraction
generate_grid_overlay, # 3-level coordinate grid overlay generation
detect_markers_morphological,# Morphological marker detection (same-color series)
cluster_markers_by_x, # Group markers by X position
assign_series_with_crossover,# Series assignment with crossover tracking
create_validation_plot, # Validation overlay generation
)import sys, os
sys.path.insert(0, os.path.expanduser("~/.claude/skills/FigDataX"))
from scripts.figdatax import calibrate_axes_multipoint
# Calibrate using tick mark positions
converter = calibrate_axes_multipoint(
pixel_points_x=[85, 200, 315, 430],
data_values_x=[0, 10, 20, 30],
pixel_points_y=[380, 285, 190, 95],
data_values_y=[0, 25, 50, 75]
)
# Convert any pixel coordinate to data values
x, y = converter(250, 240)
print(f"Data point: ({x}, {y})")
print(f"Calibration RMSE: X={converter.x_rmse:.4f}, Y={converter.y_rmse:.4f}")# Semi-auto extraction
python3 scripts/figdatax.py figure.png --mode semi \
--x-range 0 100 --y-range 0 50 \
--bbox 80 40 520 380 --color-target 120 200 200 \
--subpixel --remove-grid --validate
# Auto-extract bar charts
python3 scripts/figdatax.py bars.png --mode auto \
--y-range 0 100 --bbox 80 40 520 380 \
--colors "blue:120,200,200" "red:0,200,200"
# Trace a curve
python3 scripts/figdatax.py line.png --mode trace \
--x-range 0 100 --y-range 0 50 \
--bbox 80 40 520 380 --color-target 0 200 200 \
--n-samples 200 --subpixel- Bar charts (simple, grouped, stacked)
- Line charts (single/multi-series)
- Scatter plots
- Box plots / violin plots
- Heatmaps
- Pie charts
- Polar plots
- Dual Y-axis charts
- Multi-panel figures (a, b, c, d)
- Linear, logarithmic (semi-log, log-log)
- Reciprocal (e.g., wavenumber)
- Date/time axes
- Use highest resolution images (300+ DPI from PDF)
- Multi-point calibration with 3+ tick marks per axis
- Always read marker centers, not edges
- Remove grid lines before color-based extraction
- Filter out legend box area to avoid false detections
- For same-color curves, use coordinate grid overlay + manual reading
- Always generate validation overlay plots
FigDataX/
├── README.md # 本文件 / This file
├── SKILL.md # Claude Code skill 定义 (English)
├── 中文说明.md # 中文参考文档 / Chinese reference
├── requirements.txt # Python 依赖 / Dependencies
├── LICENSE # MIT 开源协议 / License
└── scripts/
├── __init__.py
└── figdatax.py # 核心库 / Core library (16 functions + CLI)
- Engauge Digitizer — 亚像素质心精修、曲线追踪 / sub-pixel centroid refinement, curve tracing
- WebPlotDigitizer — HSV 空间颜色距离度量 / color distance metric in HSV space
MIT License. See LICENSE.