mirror of
https://github.com/QuivrHQ/quivr.git
synced 2024-12-15 01:21:48 +03:00
ef90e8e672
# Description
Major PR which, among other things, introduces the possibility of easily
customizing the retrieval workflows. Workflows are based on LangGraph,
and can be customized using a [yaml configuration
file](core/tests/test_llm_endpoint.py), and adding the implementation of
the nodes logic into
[quivr_rag_langgraph.py](1a0c98437a/backend/core/quivr_core/quivr_rag_langgraph.py
)
This is a first, simple implementation that will significantly evolve in
the coming weeks to enable more complex workflows (for instance, with
conditional nodes). We also plan to adopt a similar approach for the
ingestion part, i.e. to enable user to easily customize the ingestion
pipeline.
Closes CORE-195, CORE-203, CORE-204
## Checklist before requesting a review
Please delete options that are not relevant.
- [X] My code follows the style guidelines of this project
- [X] I have performed a self-review of my code
- [X] I have commented hard-to-understand areas
- [X] I have ideally added tests that prove my fix is effective or that
my feature works
- [X] New and existing unit tests pass locally with my changes
- [X] Any dependent changes have been merged
## Screenshots (if appropriate):
2.8 KiB
2.8 KiB
MegaParse - Your Mega Parser for every type of documents
MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.
Key Features 🎯
- Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
- No Information Loss: Focus on having no information loss during parsing.
- Fast and Efficient: Designed with speed and efficiency at its core.
- Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
- Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.
Support
- Files: ✅ PDF ✅ Powerpoint ✅ Word
- Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images
Example
https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3
Installation
pip install megaparse
Usage
-
Add your OpenAI API key to the .env file
-
Install poppler on your computer (images and PDFs)
-
Install tesseract on your computer (images and PDFs)
from megaparse import MegaParse
megaparse = MegaParse(file_path="./test.pdf")
document = megaparse.load()
print(document.page_content)
megaparse.save_md(document.page_content, "./test.md")
(Optional) Use LlamaParse for Improved Results
-
Create an account on Llama Cloud and get your API key.
-
Call Megaparse with the
llama_parse_api_key
parameter
from megaparse import MegaParse
megaparse = MegaParse(file_path="./test.pdf", llama_parse_api_key="llx-your_api_key")
document = megaparse.load()
print(document.page_content)
BenchMark
Parser | Diff |
---|---|
LMM megaparse | 36 |
Megaparse with LLamaParse and GPTCleaner | 74 |
Megaparse with LLamaParse | 97 |
Unstructured Augmented Parse | 99 |
LLama Parse | 102 |
Megaparse | 105 |
Lower is better
Next Steps
- Improve Table Parsing
- Improve Image Parsing and description
- Add TOC for Docx
- Add Hyperlinks for Docx
- Order Headers for Docx to Markdown
- Add Rye package manager