quivr

mirror of https://github.com/StanGirard/quivr.git synced 2024-11-27 10:20:32 +03:00

History

Pascal Gula 8bbe6e7054 Adds pytesseract, tesseract and poopler-utils (#1648 ) To enable the ingestion of copy protected PDF via OCR instead of text extraction # Description Copy protected PDF can't be properly imported via the standard langchain loader. See the following errors: ``` 2023-11-15 14:16:31,927 [INFO] models.files: Computing documents from file Cradle to Cradle Criteria for the built environmen.pdf [nltk_data] Downloading package punkt to [nltk_data] /home/pascal_gula_luccid_ai/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /home/pascal_gula_luccid_ai/nltk_data... [nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip. Error processing file: detectron2 is not installed, pytesseract is not installed and the text of the PDF is not extractable. To process this file, install detectron2, install pytesseract, or remove copy protection from the PDF. ``` ``` 2023-11-15 15:04:14,624 [INFO] models.files: Computing documents from file Cradle to Cradle Criteria for the built environmen.pdf Error processing file: Unable to get page count. Is poppler installed and in PATH? ``` ``` 023-11-15 15:59:11,886 [INFO] models.files: Computing documents from file Cradle to Cradle Criteria for the built environmen.pdf Error processing file: tesseract is not installed or it's not in your PATH. See README file for more information. ``` ## Checklist before requesting a review Please delete options that are not relevant. - [x] My code follows the style guidelines of this project - [x] I have performed a self-review of my code - [x] I have commented hard-to-understand areas - [x] I have ideally added tests that prove my fix is effective or that my feature works - [x] New and existing unit tests pass locally with my changes - [x] Any dependent changes have been merged ## Screenshots (if appropriate): None Co-authored-by: Stan Girard <girard.stanislas@gmail.com>		2023-11-22 17:26:11 +01:00
..
.vscode	feat: ⚙️🐞 configure debugger for the backend (#1345 )	2023-10-09 15:23:13 +02:00
llm	feat: 🎸 tokens (#1678 )	2023-11-22 08:47:51 +01:00
middlewares	refactor: add modules folder (#1633 )	2023-11-15 13:17:51 +01:00
models	feat: allow updating api brain definition (#1682 )	2023-11-22 11:15:14 +01:00
modules	feat: 🎸 openai (#1658 )	2023-11-20 01:22:03 +01:00
packages	feat: 🎸 marketplace (#1657 )	2023-11-19 18:46:12 +01:00
repository	feat: allow updating api brain definition (#1682 )	2023-11-22 11:15:14 +01:00
routes	feat: allow updating api brain definition (#1682 )	2023-11-22 11:15:14 +01:00
supabase/functions/add-new-email	feat(refacto): changed a bit of things to make better dx (#984 )	2023-08-19 13:32:16 +02:00
tests	feat: 🎸 marketplace (#1657 )	2023-11-19 18:46:12 +01:00
vectorstore	feat: 🎸 telegram	2023-11-01 22:33:47 +01:00
.dockerignore	feat: 🎸 docker reduced size by 2 (#1653 )	2023-11-18 19:23:56 +01:00
celery_task.py	feat: 🎸 marketplace (#1657 )	2023-11-19 18:46:12 +01:00
celery_worker.py	feat: 🎸 marketplace (#1657 )	2023-11-19 18:46:12 +01:00
chat_service.py	refactor: packages folder be 2 (#1628 )	2023-11-14 14:31:02 +01:00
crawl_service.py	refactor: packages folder be 2 (#1628 )	2023-11-14 14:31:02 +01:00
Dockerfile	Adds pytesseract, tesseract and poopler-utils (#1648 )	2023-11-22 17:26:11 +01:00
Dockerfile.dev	feat: 🎸 docker reduced size by 2 (#1653 )	2023-11-18 19:23:56 +01:00
logger.py	feat(refacto): changed a bit of things to make better dx (#984 )	2023-08-19 13:32:16 +02:00
main.py	feat: 🎸 docker reduced size by 2 (#1653 )	2023-11-18 19:23:56 +01:00
pyrightconfig.json	feat(refacto): changed a bit of things to make better dx (#984 )	2023-08-19 13:32:16 +02:00
requirements.txt	Adds pytesseract, tesseract and poopler-utils (#1648 )	2023-11-22 17:26:11 +01:00
upload_service.py	refactor: packages folder be 2 (#1628 )	2023-11-14 14:31:02 +01:00