Privacy Web Search Engine (not meta, own crawler)
Go to file
2022-05-27 19:40:36 +03:00
.github/workflows Create main.yml 2022-05-02 17:54:57 +00:00
cli Update cli 2022-05-21 23:22:59 +03:00
crawler Replacing a third party robots.txt parser with own robots.txt parser 2022-05-27 19:40:36 +03:00
lib Replacing a third party robots.txt parser with own robots.txt parser 2022-05-27 19:40:36 +03:00
scripts Recursive->queue & Split worker to other files & Edit config & Fix memory leak http.cpp, add escape function 2022-05-24 20:51:31 +03:00
website Add search_demo.png 2022-05-25 22:44:20 +03:00
.gitignore Update .gitignore & AUTHORS.txt 2022-05-21 22:27:29 +03:00
CMakeLists.txt Update May (opensearch->typesense) 2022-05-02 03:32:11 -04:00
config.json Replacing a third party robots.txt parser with own robots.txt parser 2022-05-27 19:40:36 +03:00
demo.png Update demo.png 2022-05-21 22:26:51 +03:00
LICENSE Create LICENSE 2022-05-03 17:40:30 +00:00
logo.png Add logo.png 2022-05-03 14:48:07 -04:00
README.md Update README.md 2022-05-25 22:45:12 +03:00
search_demo.png Add search_demo.png 2022-05-25 22:44:20 +03:00

Privacy Web Search Engine

Website

https://raw.githubusercontent.com/liameno/librengine/master/preview.gif

Features

Crawler

  • Cache
  • Robots.txt
  • Update info after time
  • Proxy
  • Queue (BFS)
  • Detect trackers
  • Http to https
  • Normalize url (remove #fragment, ?query)

Website / CLI

  • Encryption (rsa)
  • API
  • Proxy
  • Node Info
  • Nodes
  • Rating (min=0, max=200, def=100)

TODO

  • Encryption (assymetric)
  • Robots Rules (from headers & html) & crawl-delay
  • Images Crawler
  • Adaptive Website

Dependencies

Arch:

yay -S curl lexbor openssl &&
curl -O https://dl.typesense.org/releases/0.23.0.rc20/typesense-server-0.23.0.rc20-linux-amd64.tar.gz &&
tar -xzf typesense-server-0.23.0.rc20-linux-amd64.tar.gz

Debian:

sudo apt install libcurl4-openssl-dev &&
curl -O https://dl.typesense.org/releases/0.23.0.rc20/typesense-server-0.23.0.rc20-linux-amd64.tar.gz &&
tar -xzf typesense-server-0.23.0.rc20-linux-amd64.tar.gz &&
git clone https://github.com/lexbor/lexbor && 
cd lexbor &&
cmake . && make && sudo make install &&
sudo apt install libssl-dev

Build

#git clone ...
cd librengine &&
sh scripts/build_all.sh

Run

mkdir /tmp/typesense-data &&
./typesense-server --data-dir=/tmp/typesense-data --api-key=xyz --enable-cors &&
sh scripts/init_db.sh

CLI

./cli gnu 1 ../../config.json
#[query] [page] [config path]

Crawler

./crawler https://www.gnu.org ../../config.json
#[start_site] [config path]

Website

./website ../../config.json
#[config path]

Config

//proxy: type://ip:port OR empty ("")
//socks5://127.0.0.1:9050

//_s - seconds

{
  "global": {
    //edit also website/frontend/js/search_encrypt.js
    "rsa_key_length": 1024, //1024|2048|4096
    "max_title_show_size": 55,
    "max_desc_show_size": 350,
    "nodes": [
      {
        "name": "This",
        "url": "http://127.0.0.1:8080"
      }
    ]
  },
  "crawler": {
    "user_agent": "librengine",
    "proxy": "socks5://127.0.0.1:9050",
    "load_page_timeout_s": 10,
    "update_time_site_info_s_after": 864000, //10 days
    "delay_time_s": 3, 
    "max_pages_site": 5,
    "max_page_symbols": 50000000, //50mb
    "max_robots_txt_symbols": 3000,
    "max_lru_cache_size_host": 512,
    "max_lru_cache_size_url": 512,
    "is_http_to_https": true,
    "is_check_robots_txt": true
  },
  "cli": {
    "proxy": "socks5://127.0.0.1:9050"
  },
  "website": {
    "port": 8080,
    "proxy": "socks5://127.0.0.1:9050"
  },
  //edit also init_db.sh
  "db": {
    "url": "http://localhost:8108",
    "api_key": "xyz"
  }
}

License

GNU AFFERO GENERAL PUBLIC LICENSE v3.0