mirror of https://github.com/QuivrHQ/quivr.git synced 2024-12-14 07:59:00 +03:00

History

Stan Girard b767f19f28 feat(assistant): cdp (#3305 ) # Description Please include a summary of the changes and the related issue. Please also include relevant motivation and context. ## Checklist before requesting a review Please delete options that are not relevant. - [ ] My code follows the style guidelines of this project - [ ] I have performed a self-review of my code - [ ] I have commented hard-to-understand areas - [ ] I have ideally added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes - [ ] Any dependent changes have been merged ## Screenshots (if appropriate): --------- Co-authored-by: Zewed <dewez.antoine2@gmail.com>		2024-10-03 06:46:59 -07:00
..
.github/workflows	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
images	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
megaparse	feat(assistant): cdp (#3305 )	2024-10-03 06:46:59 -07:00
notebooks	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
tests	feat(assistant): cdp (#3305 )	2024-10-03 06:46:59 -07:00
.env.example	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
.gitattributes	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
.gitignore	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
.pre-commit-config.yaml	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
.release-please-manifest.json	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
Dockerfile	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
LICENSE	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
logo.png	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
Makefile	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
pyproject.toml	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
README.md	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
release-please-config.json	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
requirements-dev.lock	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00
requirements.lock	feat: introducing configurable retrieval workflows (#3227 )	2024-09-23 09:11:06 -07:00

README.md

MegaParse - Your Mega Parser for every type of documents

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

Key Features 🎯

Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
No Information Loss: Focus on having no information loss during parsing.
Fast and Efficient: Designed with speed and efficiency at its core.
Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.

Support

Files: ✅ PDF ✅ Powerpoint ✅ Word
Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images

Example

https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3

Installation

pip install megaparse

Usage

Add your OpenAI API key to the .env file
Install poppler on your computer (images and PDFs)
Install tesseract on your computer (images and PDFs)

from megaparse import MegaParse

megaparse = MegaParse(file_path="./test.pdf")
document = megaparse.load()
print(document.page_content)
megaparse.save_md(document.page_content, "./test.md")

(Optional) Use LlamaParse for Improved Results

Create an account on Llama Cloud and get your API key.
Call Megaparse with the llama_parse_api_key parameter

from megaparse import MegaParse

megaparse = MegaParse(file_path="./test.pdf", llama_parse_api_key="llx-your_api_key")
document = megaparse.load()
print(document.page_content)

BenchMark

Parser	Diff
LMM megaparse	36
Megaparse with LLamaParse and GPTCleaner	74
Megaparse with LLamaParse	97
Unstructured Augmented Parse	99
LLama Parse	102
Megaparse	105

Lower is better

Next Steps

Improve Table Parsing
Improve Image Parsing and description
Add TOC for Docx
Add Hyperlinks for Docx
Order Headers for Docx to Markdown
Add Rye package manager