silicon-paddle-ocr

Name: silicon-paddle-ocr
Author: aotenjou

OCR skill using PaddleOCR model via SiliconFlow API. This skill should be used when the user asks to "recognize text from an image", "extract text from a…

INSTALLATION

npx skills add https://github.com/aotenjou/silicon-paddleocr --skill silicon-paddle-ocr

Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

OCR - Image Text Recognition

Use PaddleOCR to extract text content from images. Supports single image or batch processing.

Overview

This skill provides optical character recognition (OCR) capabilities using the PaddlePaddle/PaddleOCR-VL-1.5 model via the SiliconFlow API. Extract text from JPG, PNG, WebP, BMP, and GIF images.

When to Use

Invoke this skill when:

User wants to extract text from an image

User asks to OCR a screenshot or photo

User needs to read text from an image file

User mentions text recognition from images

How to Use

Prerequisites

Ensure the SILICONFLOW_API_KEY environment variable is set:

export SILICONFLOW_API_KEY="your_api_key"

Basic Usage

Execute the OCR script:

python3 scripts/ocr_skill.py [options] image_path

Arguments

Argument

Description

images

Image file path(s) or glob pattern (required)

-k, --api-key

API key (default: from SILICONFLOW_API_KEY env)

-m, --model

OCR model name (default: PaddlePaddle/PaddleOCR-VL-1.5)

-p, --prompt

Recognition prompt for custom behavior

-j, --json

Output results in JSON format

-o, --output

Save results to specified file

--max-tokens

Maximum tokens in response (default: 2000)

Examples

Single image:

python3 scripts/ocr_skill.py /path/to/image.jpg

Multiple images with glob:

python3 scripts/ocr_skill.py /path/to/images/*.png

JSON output format:

python3 scripts/ocr_skill.py --json /path/to/image.jpg

Custom prompt for table extraction:

python3 scripts/ocr_skill.py -p "Please identify and format table content as Markdown" /path/to/table.jpg

Save to file:

python3 scripts/ocr_skill.py --json --output results.json /path/to/images/*.jpg

Output Format

Text output (default):

--- image.jpg ---

识别到的文字内容

识别到 X 处文字区域

JSON output:

{

  "image.jpg": {

    "image_path": "/path/to/image.jpg",

    "image_size": [width, height],

    "texts": [

      {

        "text": "识别的文字",

        "box": [[x1, y1], [x2, y2], [x3, y3], [x4, y4]]

      }

    ],

    "full_text": "所有文本的组合"

  },

  "image2.png": { ... }

}

Coordinates Explanation:

LOC values are normalized coordinates converted to pixel coordinates

Conversion: pixel = LOC × (image_size / LOC_max_value)

LOC max_value is approximately 972 (may vary by model/image)

The box field provides the four corner coordinates of each text region in pixel format

Supported Image Formats

JPG/JPEG

WebP

Error Handling

If processing fails:

Check that the image file exists

Verify the SILICONFLOW_API_KEY is valid

Ensure the API endpoint is reachable

Images that fail to process will show an error message, and other images will continue processing.

Additional Resources

Reference Files

**references/api-configuration.md** - API configuration details

Example Files

**examples/sample-usage.sh** - Example usage script

Scripts

**scripts/ocr_skill.py** - The main OCR implementation