quivr/backend/core/MegaParse
Chloé Daems a15c9ec88f
add fallback on llamaparse (#3374)
# Description

Please include a summary of the changes and the related issue. Please
also include relevant motivation and context.

## Checklist before requesting a review

Please delete options that are not relevant.

- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my code
2024-10-15 01:06:31 -07:00
..
.github/workflows feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
images feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
megaparse add fallback on llamaparse (#3374) 2024-10-15 01:06:31 -07:00
notebooks feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
tests feat(assistant): cdp (#3305) 2024-10-03 06:46:59 -07:00
.env.example feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
.gitattributes feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
.gitignore feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
.pre-commit-config.yaml feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
.release-please-manifest.json feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
Dockerfile feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
LICENSE feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
logo.png feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
Makefile feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
pyproject.toml feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
README.md feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
release-please-config.json feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
requirements-dev.lock feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00
requirements.lock feat: introducing configurable retrieval workflows (#3227) 2024-09-23 09:11:06 -07:00

MegaParse - Your Mega Parser for every type of documents

Quivr-logo

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

Key Features 🎯

  • Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
  • No Information Loss: Focus on having no information loss during parsing.
  • Fast and Efficient: Designed with speed and efficiency at its core.
  • Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
  • Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.

Support

  • Files: PDF Powerpoint Word
  • Content: Tables TOC Headers Footers Images

Example

https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3

Installation

pip install megaparse

Usage

  1. Add your OpenAI API key to the .env file

  2. Install poppler on your computer (images and PDFs)

  3. Install tesseract on your computer (images and PDFs)

from megaparse import MegaParse

megaparse = MegaParse(file_path="./test.pdf")
document = megaparse.load()
print(document.page_content)
megaparse.save_md(document.page_content, "./test.md")

(Optional) Use LlamaParse for Improved Results

  1. Create an account on Llama Cloud and get your API key.

  2. Call Megaparse with the llama_parse_api_key parameter

from megaparse import MegaParse

megaparse = MegaParse(file_path="./test.pdf", llama_parse_api_key="llx-your_api_key")
document = megaparse.load()
print(document.page_content)

BenchMark

Parser Diff
LMM megaparse 36
Megaparse with LLamaParse and GPTCleaner 74
Megaparse with LLamaParse 97
Unstructured Augmented Parse 99
LLama Parse 102
Megaparse 105

Lower is better

Next Steps

  • Improve Table Parsing
  • Improve Image Parsing and description
  • Add TOC for Docx
  • Add Hyperlinks for Docx
  • Order Headers for Docx to Markdown
  • Add Rye package manager

Star History

Star History Chart