wechat-article-extractor

Name: wechat-article-extractor
Author: freestylefly

Parse WeChat Official Account articles to extract metadata, content, and account information. Extracts article metadata (title, author, publish time, cover image) and account info (name, avatar, alias) from WeChat URLs and HTML Supports multiple article types: posts, videos, images, voice messages, text, and reposts Configurable extraction options for content, raw metadata, repost info, embedded links, and tags Handles error cases including deleted content, expired links, rate limiting, account migration, and copyright blocks Works with both direct WeChat URLs ( mp.weixin.qq.com ) and Sogou WeChat search results

INSTALLATION

npx skills add https://github.com/freestylefly/wechat-article-extractor-skill --skill wechat-article-extractor

Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

WeChat Article Extractor

Extract metadata and content from WeChat Official Account (微信公众号) articles.

Capabilities

Parse WeChat article URLs (mp.weixin.qq.com)

Extract article metadata: title, author, description, publish time

Extract account info: name, avatar, alias, description

Get article content (HTML)

Get cover image URL

Support multiple article types: post, video, image, voice, text, repost

Handle various error cases: deleted content, expired links, access limits

Usage

Basic Extraction from URL

const { extract } = require('./scripts/extract.js');

const result = await extract('https://mp.weixin.qq.com/s?__biz=...');

// Returns: { done: true, code: 0, data: {...} }

Extraction from HTML

const html = await fetch(url).then(r => r.text());

const result = await extract(html, { url: sourceUrl });

Options

const result = await extract(url, {

  shouldReturnContent: true,      // Return HTML content (default: true)

  shouldReturnRawMeta: false,     // Return raw metadata (default: false)

  shouldFollowTransferLink: true, // Follow migrated account links (default: true)

  shouldExtractMpLinks: false,    // Extract embedded mp.weixin links (default: false)

  shouldExtractTags: false,       // Extract article tags (default: false)

  shouldExtractRepostMeta: false  // Extract repost source info (default: false)

});

Response Format

Success Response

{

  done: true,

  code: 0,

  data: {

    // Account info

    account_name: "公众号名称",

    account_alias: "微信号",

    account_avatar: "头像URL",

    account_description: "功能介绍",

    account_id: "原始ID",

    account_biz: "biz参数",

    account_biz_number: 1234567890,

    account_qr_code: "二维码URL",

    // Article info

    msg_title: "文章标题",

    msg_desc: "文章摘要",

    msg_content: "HTML内容",

    msg_cover: "封面图URL",

    msg_author: "作者",

    msg_type: "post", // post|video|image|voice|text|repost

    msg_has_copyright: true,

    msg_publish_time: Date,

    msg_publish_time_str: "2024/01/15 10:30:00",

    // Link params

    msg_link: "文章链接",

    msg_source_url: "阅读原文链接",

    msg_sn: "sn参数",

    msg_mid: 1234567890,

    msg_idx: 1

  }

}

Error Response

{

  done: false,

  code: 1001,

  msg: "无法获取文章信息"

}

Error Codes

Code

Message

Description

1000

文章获取失败

General failure

1001

无法获取文章信息

Missing title or publish time

1002

请求失败

HTTP request failed

1003

响应为空

Empty response

1004

访问过于频繁

Rate limited

1005

脚本解析失败

Script parsing error

1006

公众号已迁移

Account migrated

2001

请提供文章内容或链接

Missing input

2002

链接已过期

Link expired

2003

内容涉嫌侵权

Content removed (copyright)

2004

无法获取迁移后的链接

Migration link failed

2005

内容已被发布者删除

Content deleted by author

2006

内容因违规无法查看

Content blocked

2007

内容发送失败

Failed to send

2008

系统出错

System error

2009

不支持的链接

Unsupported URL

2010

内容获取失败

Content fetch failed

2011

涉嫌过度营销

Marketing/spam content

2012

账号已被屏蔽

Account blocked

2013

账号已自主注销

Account deleted

2014

内容被投诉

Content reported

2015

账号处于迁移流程中

Account migrating

2016

冒名侵权

Impersonation

Dependencies

Required npm packages:

cheerio - HTML parsing

dayjs - Date formatting

request-promise - HTTP requests

qs - Query string parsing

lodash.unescape - HTML entities

Notes

Handles various WeChat page structures and anti-scraping measures

Automatically detects article type from page content

Supports extracting from Sogou WeChat search results (weixin.sogou.com)

Some fields may be null depending on article type and page structure