SKILL.md
Image to Text
Extract all readable text from an image using OCR (Tesseract). Returns the full text content along with word-level bounding boxes and confidence scores.
When to Use
- Reading text content from a screenshot or design mockup
- Extracting UI copy (labels, buttons, headings) so you don't have to retype it
- Getting text positions and bounding boxes from a design image
How It Works
- The image is passed to Tesseract.js for optical character recognition
- Tesseract segments the image into lines and words
- Returns the full text plus word-level details (position, confidence)
Usage
bash <skill-path>/scripts/image-to-text.sh <image-path> [language]
Arguments:
image-path— Path to the image file (required)
language— OCR language code (optional, defaults toeng). Common:eng,fra,deu,spa,chi_sim,jpn
Examples:
# Extract text from a screenshot
bash <skill-path>/scripts/image-to-text.sh ./screenshot.png
# Extract French text
bash <skill-path>/scripts/image-to-text.sh ./mockup.png fra
Output
{
"text": "Request work\nSuggestions\nPlumbing\nHVAC\nCleaning\nElectrical",
"confidence": 87.4,
"words": [
{
"text": "Request",
"confidence": 94.2,
"bbox": { "x0": 142, "y0": 180, "x1": 268, "y1": 204 }
},
{
"text": "work",
"confidence": 96.1,
"bbox": { "x0": 274, "y0": 180, "x1": 332, "y1": 204 }
}
],
"lines": [
{
"text": "Request work",
"confidence": 95.1,
"bbox": { "x0": 142, "y0": 180, "x1": 332, "y1": 204 }
}
]
}
Field
Type
Description
text
String
Full extracted text, newline-separated
confidence
Number
Overall confidence score (0-100)
words
Array
Each word with text, confidence, and bounding box
lines
Array
Each line with text, confidence, and bounding box
Present Results to User
After extracting text, present the content grouped by lines:
Extracted text (87.4% confidence):
Request work
Suggestions
Plumbing
HVAC
Cleaning
Electrical
Found 6 lines, 6 words.
Use the extracted text directly when implementing UI copy from a design.
Troubleshooting
Low confidence / garbled text — Tesseract works best with clean, high-contrast text. Screenshots of rendered UI work well. Photos of text at angles or with noise may produce poor results.
Wrong language — Pass the correct language code as the second argument. Tesseract needs the right language model to recognize characters.
First run is slow — Tesseract downloads language data (~4MB for English) on the first run. Subsequent runs are faster.