{"id":53733,"date":"2025-02-16T16:10:08","date_gmt":"2025-02-16T08:10:08","guid":{"rendered":"https:\/\/fwq.ai\/blog\/53733\/"},"modified":"2025-02-16T16:10:08","modified_gmt":"2025-02-16T08:10:08","slug":"markitdown%e6%b7%b1%e5%85%a5%e7%a0%94%e7%a9%b6","status":"publish","type":"post","link":"https:\/\/fwq.ai\/blog\/53733\/","title":{"rendered":"MarkItDown\u6df1\u5165\u7814\u7a76"},"content":{"rendered":"<p>\u662f Microsoft \u5f00\u53d1\u7684 Python \u5305\uff0c\u65e8\u5728\u5c06\u5404\u79cd\u6587\u4ef6\u683c\u5f0f\u8f6c\u6362\u4e3a Markdown\u3002<\/p>\n<p>\u81ea\u9996\u6b21\u4eae\u76f8\u4ee5\u6765\uff0c\u8be5\u5e93\u7684\u4eba\u6c14\u98d9\u5347\uff0c\u5728\u77ed\u77ed\u4e24\u5468\u5185\u5c31\u83b7\u5f97\u4e86\u8d85\u8fc7 25,000 \u4e2a GitHub \u661f\uff01\ud83e\udd2f<\/p>\n<h2>1\u3001\u662f\u4ec0\u4e48\u8ba9 MarkItDown \u5982\u6b64\u53d7\u6b22\u8fce\uff1f<\/h2>\n<p>MarkItDown \u4e3a\u5404\u79cd\u6587\u4ef6\u7c7b\u578b\u63d0\u4f9b\u5f3a\u5927\u7684\u652f\u6301\uff0c\u4f8b\u5982\uff1a<\/p>\n<ul>\n<li>Office \u683c\u5f0f\uff1aWord\u3001PowerPoint\u3001Excel<\/li>\n<li>\u5a92\u4f53\u6587\u4ef6\uff1a\u56fe\u50cf\uff08\u5e26\u6709 EXIF \u6570\u636e\u548c\u63cf\u8ff0\uff09\u3001\u97f3\u9891\uff08\u5e26\u6709\u8f6c\u5f55\u652f\u6301\uff09<\/li>\n<li>Web \u548c\u6570\u636e\u683c\u5f0f\uff1aHTML\u3001JSON\u3001XML\u3001CSV<\/li>\n<li>\u6863\u6848\uff1aZIP \u6587\u4ef6<\/li>\n<\/ul>\n<p>\u5b83\u4e0d\u4ec5\u53ef\u4ee5\u5904\u7406 Word \u7b49\u6807\u51c6\u683c\u5f0f\uff0c\u8fd8\u53ef\u4ee5\u5904\u7406\u591a\u6a21\u5f0f\u6570\u636e\uff0c\u8fd9\u4f7f\u5176\u8131\u9896\u800c\u51fa\u3002\u4f8b\u5982\uff0c\u5b83\u4f7f\u7528 OCR \u548c\u8bed\u97f3\u8bc6\u522b\u4ece\u56fe\u50cf\u548c\u97f3\u9891\u6587\u4ef6\u4e2d\u63d0\u53d6\u5185\u5bb9\u3002<\/p>\n<p>\u5c06\u4efb\u4f55\u5185\u5bb9\u8f6c\u6362\u4e3a Markdown \u7684\u80fd\u529b\u4f7f MarkItDown \u6210\u4e3a LLM \u57f9\u8bad\u7684\u5f3a\u5927\u5de5\u5177\u3002\u901a\u8fc7\u5904\u7406\u7279\u5b9a\u9886\u57df\u7684\u6587\u6863\uff0c\u5b83\u63d0\u4f9b\u4e86\u4e30\u5bcc\u7684\u4e0a\u4e0b\u6587\uff0c\u4ee5\u4fbf\u5728 LLM \u9a71\u52a8\u7684\u5e94\u7528\u7a0b\u5e8f\u4e2d\u751f\u6210\u66f4\u51c6\u786e\u3001\u66f4\u76f8\u5173\u7684\u54cd\u5e94\u3002<\/p>\n<h2>2\u3001MarkItDown \u5165\u95e8<\/h2>\n<p>\u4f7f\u7528 MarkItDown \u975e\u5e38\u7b80\u5355 &#8211; \u53ea\u9700\u8981 4 \u884c\u4ee3\u7801\uff1a<\/p>\n<pre><code>from markitdown import MarkItDown\n\nmd = MarkItDown()\nresult = md.convert(\"test.xlsx\")\nprint(result.text_content)<\/code><\/pre>\n<p>\u4ee5\u4e0b\u662f MarkItDown \u7684\u4e00\u4e9b\u7528\u4f8b\u3002<\/p>\n<p>\u8f6c\u6362 Word \u6587\u6863\u53ef\u751f\u6210\u5e72\u51c0\u51c6\u786e\u7684 Markdown\uff1a<\/p>\n<p>\u5373\u4f7f\u662f\u591a\u6807\u7b7e Excel \u7535\u5b50\u8868\u683c\u4e5f\u53ef\u4ee5\u8f7b\u677e\u5904\u7406\uff1a<\/p>\n<p>ZIP \u5b58\u6863\uff1f\u6ca1\u95ee\u9898\uff01\u8be5\u5e93\u4f1a\u9012\u5f52\u89e3\u6790\u5176\u4e2d\u7684\u6240\u6709\u6587\u4ef6\uff1a<\/p>\n<p>\u6700\u521d\uff0c\u56fe\u50cf\u63d0\u53d6\u53ef\u80fd\u4e0d\u4f1a\u4ea7\u751f\u4efb\u4f55\u7ed3\u679c\uff1a<\/p>\n<p>\u8fd9\u662f\u56e0\u4e3a MarkItDown \u4f9d\u8d56 LLM \u6765\u751f\u6210\u56fe\u50cf\u63cf\u8ff0\u3002\u901a\u8fc7\u96c6\u6210 LLM \u5ba2\u6237\u7aef\uff0c\u60a8\u53ef\u4ee5\u542f\u7528\u6b64\u529f\u80fd\uff1a<\/p>\n<pre><code>from openai import OpenAI\n\nclient = OpenAI(api_key=\"i-am-not-an-api-key\")\n\nmd = MarkItDown(llm_client=client, llm_model=\"gpt-4o\")<\/code><\/pre>\n<p>\u914d\u7f6e\u5b8c\u6210\u540e\uff0c\u53ef\u4ee5\u6210\u529f\u5904\u7406\u56fe\u50cf\u6587\u4ef6\uff1a<\/p>\n<p>\u6ce8\u610f\uff1aLLM \u4e0d\u4f1a\u5904\u7406\u57fa\u4e8e\u56fe\u50cf\u7684 PDF\u3002PDF \u9700\u8981 OCR \u9884\u5904\u7406\u624d\u80fd\u63d0\u53d6\u5185\u5bb9\u3002<\/p>\n<p>\u4f46\u662f\uff0cPDF \u5728\u63d0\u53d6\u65f6\u4f1a\u4e22\u5931\u5176\u683c\u5f0f\uff0c\u56e0\u6b64\u65e0\u6cd5\u533a\u5206\u6807\u9898\u548c\u7eaf\u6587\u672c\uff1a<\/p>\n<h2>3\u3001\u5c40\u9650\u6027<\/h2>\n<p>MarkItDown \u5e76\u975e\u6ca1\u6709\u9650\u5236\uff1a<\/p>\n<ul>\n<li>\u65e0\u6cd5\u5904\u7406\u6ca1\u6709 OCR \u7684 PDF \u6587\u4ef6\u3002<\/li>\n<li>\u4ece PDF \u6587\u4ef6\u4e2d\u63d0\u53d6\u65f6\u65e0\u6cd5\u8bbe\u7f6e\u683c\u5f0f\u3002<\/li>\n<\/ul>\n<p>\u5c3d\u7ba1\u5982\u6b64\uff0c\u4f5c\u4e3a\u4e00\u4e2a\u5f00\u6e90\u9879\u76ee\uff0c\u5b83\u5177\u6709\u9ad8\u5ea6\u53ef\u5b9a\u5236\u6027\u3002\u7531\u4e8e\u5176\u4ee3\u7801\u5e93\u7b80\u6d01\uff0c\u5f00\u53d1\u4eba\u5458\u53ef\u4ee5\u8f7b\u677e\u6269\u5c55\u5176\u529f\u80fd\u3002<\/p>\n<h2>4\u3001MarkItDown \u7684\u5de5\u4f5c\u539f\u7406<\/h2>\n<p>MarkItDown \u7684\u67b6\u6784\u7b80\u5355\u4e14\u6a21\u5757\u5316\u3002\u5176\u6838\u5fc3\u903b\u8f91\u5b8c\u5168\u4f4d\u4e8e\u5355\u4e2a\u6587\u4ef6\u4e2d\u3002<\/p>\n<p>\u5b83\u6709\u4e00\u4e2a <code>DocumentConverter<\/code> \u7c7b\uff0c\u8be5\u7c7b\u5b9a\u4e49\u4e86\u4e00\u4e2a\u901a\u7528\u7684 <code>convert()<\/code> \u65b9\u6cd5\uff1a<\/p>\n<pre><code>class DocumentConverter:\n    \"\"\"Base class for all document converters.\"\"\"\n\n    def convert(\n        self, local_path: str, **kwargs: Any\n    ) -&gt; Union[None, DocumentConverterResult]:\n        raise NotImplementedError()<\/code><\/pre>\n<p>\u5404\u4e2a\u8f6c\u6362\u5668\u4ece\u6b64\u57fa\u7c7b\u7ee7\u627f\u5e76\u52a8\u6001\u6ce8\u518c\uff1a<\/p>\n<pre><code>self.register_page_converter(PlainTextConverter())\nself.register_page_converter(HtmlConverter())\nself.register_page_converter(DocxConverter())\nself.register_page_converter(XlsxConverter())\nself.register_page_converter(Mp3Converter())\nself.register_page_converter(ImageConverter())\n# ...<\/code><\/pre>\n<p>\u8fd9\u79cd\u6a21\u5757\u5316\u65b9\u6cd5\u53ef\u4ee5\u8f7b\u677e\u6dfb\u52a0\u5bf9\u65b0\u6587\u4ef6\u7c7b\u578b\u7684\u652f\u6301\u3002<\/p>\n<h2>5\u3001\u6587\u4ef6\u8f6c\u6362\u5de5\u4f5c\u6d41\u7a0b<\/h2>\n<ul>\n<li> Office \u6587\u6863<\/li>\n<\/ul>\n<p>\u4f7f\u7528 mammoth\u3001pandas \u6216 pptx \u7b49\u5e93\u5c06 Office \u6587\u4ef6\u8f6c\u6362\u4e3a HTML\uff0c\u7136\u540e\u4f7f\u7528 BeautifulSoup \u8f6c\u6362\u4e3a Markdown\u3002<\/p>\n<ul>\n<li>\u97f3\u9891\u6587\u4ef6<\/li>\n<\/ul>\n<p>\u97f3\u9891\u4f7f\u7528 Speech_recognition \u5e93\u8f6c\u5f55\uff0c\u8be5\u5e93\u5229\u7528 Google \u7684 API\u3002<\/p>\n<p>\uff08\u5fae\u8f6f\uff0c\u4e3a\u4ec0\u4e48\u4e0d\u4f7f\u7528 Azure\uff1f\uff09<\/p>\n<ul>\n<li>\u56fe\u50cf<\/li>\n<\/ul>\n<p>\u56fe\u50cf\u5904\u7406\u6d89\u53ca\u901a\u8fc7 LLM \u63d0\u793a\u751f\u6210\u6807\u9898\uff1a<\/p>\n<blockquote><p>\n  \u201c\u4e3a\u6b64\u56fe\u50cf\u5199\u4e00\u4e2a\u8be6\u7ec6\u63cf\u8ff0\u3002\u201d\n<\/p><\/blockquote>\n<ul>\n<li>PDF<\/li>\n<\/ul>\n<p>PDF \u7531 pdfminer \u5e93\u5904\u7406\uff0c\u4f46\u7f3a\u5c11\u5185\u7f6e OCR\u3002\u4f60\u5fc5\u987b\u9884\u5904\u7406 PDF \u4ee5\u8fdb\u884c\u6587\u672c\u63d0\u53d6\u3002<\/p>\n<h2>6\u3001\u5c06 MarkItDown \u90e8\u7f72\u4e3a API<\/h2>\n<p>MarkItDown \u53ef\u4ee5\u5728\u672c\u5730\u8fd0\u884c\uff0c\u4f46\u5c06\u5176\u4f5c\u4e3a API \u6258\u7ba1\u53ef\u4ee5\u89e3\u9501\u989d\u5916\u7684\u7075\u6d3b\u6027\uff0c\u4f7f\u5176\u6613\u4e8e\u96c6\u6210\u5230 Zapier \u548c n8n \u7b49\u5de5\u4f5c\u6d41\u7a0b\u4e2d\u3002<\/p>\n<p>\u4ee5\u4e0b\u662f\u4f7f\u7528 FastAPI \u7684 MarkItDown API \u7684\u7b80\u5355\u793a\u4f8b\uff1a<\/p>\n<pre><code>import shutil\nfrom markitdown import MarkItDown\nfrom fastapi import FastAPI, UploadFile\nfrom uuid import uuid4\n\nmd = MarkItDown()\n\napp = FastAPI()\n\n@app.post(\"\/convert\")\nasync def convert_markdown(file: UploadFile):\n    unique_id = uuid4()\n    temp_dir = f\".\/temp\/{unique_id}\"\n\n    shutil.os.makedirs(temp_dir, exist_ok=True)\n\n    file_path = f\"{temp_dir}\/{file.filename}\"\n    with open(file_path, \"wb\") as f:\n        shutil.copyfileobj(file.file, f)\n    result = md.convert(file_path)\n    content = result.text_content\n\n    shutil.rmtree(temp_dir)\n\n    return {\"result\": content}<\/code><\/pre>\n<p>\u8981\u8c03\u7528API\uff1a<\/p>\n<pre><code>const formData = new FormData();\nformData.append('file', file);\n\nconst response = await fetch('http:\/\/localhost:8000\/convert', {\n  method: 'POST',\n  body: formData,\n});<\/code><\/pre>\n<h2>7\u3001\u514d\u8d39\u6258\u7ba1 API<\/h2>\n<p>\u6258\u7ba1 Python API \u53ef\u80fd\u5f88\u68d8\u624b\u3002\u4f20\u7edf\u670d\u52a1\uff08\u5982 AWS EC2 \u6216 DigitalOcean\uff09\u9700\u8981\u79df\u7528\u6574\u53f0\u670d\u52a1\u5668\uff0c\u8fd9\u603b\u662f\u5f88\u6602\u8d35\u3002<\/p>\n<p>\u4f46\u73b0\u5728\uff0c\u4f60\u53ef\u4ee5\u4f7f\u7528 \u3002<\/p>\n<p>\u8fd9\u662f\u4e00\u4e2a\u53ef\u4ee5\u4ee5\u65e0\u670d\u52a1\u5668\u65b9\u5f0f\u6258\u7ba1 Python \u4ee3\u7801\u5e93\u7684\u5e73\u53f0 &#8211; \u5b83\u53ea\u6309 API \u8c03\u7528\u6536\u8d39\uff0c\u5e76\u63d0\u4f9b\u6177\u6168\u7684\u514d\u8d39\u4f7f\u7528\u5957\u9910\u3002<\/p>\n<p>\u53ea\u9700\u8fde\u63a5\u4f60\u7684 GitHub \u5b58\u50a8\u5e93\uff0c\u5b9a\u4e49\u6784\u5efa\u548c\u542f\u52a8\u547d\u4ee4\uff0c\u4e00\u5207\u5c31\u7eea\uff1a<\/p>\n<p>\u73b0\u5728\u4f60\u6709\u4e00\u4e2a\u6258\u7ba1\u5728\u4e91\u4e2d\u7684 MarkItDown API\uff0c\u53ef\u4ee5\u96c6\u6210\u5230\u4f60\u7684\u5de5\u4f5c\u6d41\u7a0b\u4e2d\uff0c\u6700\u91cd\u8981\u7684\u662f\uff0c\u53ea\u6709\u5728\u771f\u6b63\u8c03\u7528\u65f6\u624d\u6536\u8d39\u3002<\/p>\n<hr>\n","protected":false},"excerpt":{"rendered":"<p>\u662f Microsoft \u5f00\u53d1\u7684 Python \u5305\uff0c\u65e8\u5728\u5c06\u5404\u79cd\u6587\u4ef6\u683c\u5f0f\u8f6c\u6362\u4e3a Markdown\u3002 \u81ea\u9996\u6b21\u4eae\u76f8\u4ee5\u6765\uff0c\u8be5\u5e93\u7684\u4eba\u6c14\u98d9\u5347\uff0c\u5728\u77ed\u77ed\u4e24\u5468\u5185\u5c31\u83b7\u5f97\u4e86\u8d85\u8fc7 25,000 \u4e2a GitHub \u661f\uff01\ud83e\udd2f 1\u3001\u662f\u4ec0\u4e48\u8ba9 MarkItDown \u5982\u6b64\u53d7\u6b22\u8fce\uff1f MarkItDown \u4e3a\u5404\u79cd\u6587\u4ef6\u7c7b\u578b\u63d0\u4f9b\u5f3a\u5927\u7684\u652f\u6301\uff0c\u4f8b\u5982\uff1a Office \u683c\u5f0f\uff1aWord\u3001PowerPoint\u3001Excel \u5a92\u4f53\u6587\u4ef6\uff1a\u56fe\u50cf\uff08\u5e26\u6709 EXIF \u6570\u636e\u548c\u63cf\u8ff0\uff09\u3001\u97f3\u9891\uff08\u5e26\u6709\u8f6c\u5f55\u652f\u6301\uff09 Web \u548c\u6570\u636e\u683c\u5f0f\uff1aHTML\u3001JSON\u3001XML\u3001CSV \u6863\u6848\uff1aZIP \u6587\u4ef6 \u5b83\u4e0d\u4ec5\u53ef\u4ee5\u5904\u7406 Word \u7b49\u6807\u51c6\u683c\u5f0f\uff0c\u8fd8\u53ef\u4ee5\u5904\u7406\u591a\u6a21\u5f0f\u6570\u636e\uff0c\u8fd9\u4f7f\u5176\u8131\u9896\u800c\u51fa\u3002\u4f8b\u5982\uff0c\u5b83\u4f7f\u7528 OCR \u548c\u8bed\u97f3\u8bc6\u522b\u4ece\u56fe\u50cf\u548c\u97f3\u9891\u6587\u4ef6\u4e2d\u63d0\u53d6\u5185\u5bb9\u3002 \u5c06\u4efb\u4f55\u5185\u5bb9\u8f6c\u6362\u4e3a Markdown \u7684\u80fd\u529b\u4f7f MarkItDown \u6210\u4e3a LLM \u57f9\u8bad\u7684\u5f3a\u5927\u5de5\u5177\u3002\u901a\u8fc7\u5904\u7406\u7279\u5b9a\u9886\u57df\u7684\u6587\u6863\uff0c\u5b83\u63d0\u4f9b\u4e86\u4e30\u5bcc\u7684\u4e0a\u4e0b\u6587\uff0c\u4ee5\u4fbf\u5728 LLM \u9a71\u52a8\u7684\u5e94\u7528\u7a0b\u5e8f\u4e2d\u751f\u6210\u66f4\u51c6\u786e\u3001\u66f4\u76f8\u5173\u7684\u54cd\u5e94\u3002 2\u3001MarkItDown \u5165\u95e8 \u4f7f\u7528 MarkItDown \u975e\u5e38\u7b80\u5355 &#8211; \u53ea\u9700\u8981 4 \u884c\u4ee3\u7801\uff1a from markitdown import MarkItDown md = MarkItDown() [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13],"tags":[],"class_list":["post-53733","post","type-post","status-publish","format-standard","hentry","category-ai"],"_links":{"self":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/posts\/53733","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/comments?post=53733"}],"version-history":[{"count":0,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/posts\/53733\/revisions"}],"wp:attachment":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/media?parent=53733"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/categories?post=53733"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/tags?post=53733"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}