Matt Wells
45cb5c9a0c
fix bugs to try to get sharding working
...
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
7065b0ae0c
fixed oops
2014-01-21 13:13:16 -08:00
Matt Wells
dba382f7f7
added max cpu merge threads parm and defaulted to 10
...
up from 2 for better disk reading latencies.
2014-01-21 13:11:53 -08:00
Matt Wells
9354d06493
menu updates.
2014-01-21 13:01:37 -08:00
Matt Wells
8d5e1cb547
added url download support
2014-01-20 23:17:04 -08:00
Matt Wells
41cdfcef96
inc spider limits in various places
2014-01-20 18:51:15 -08:00
Matt Wells
946a683e39
quite a few spider fixes
2014-01-20 16:45:27 -08:00
Matt Wells
5c86d8a122
simplified spiderdb.cpp scanSpiderdb()
...
by breaking up into 4 functions.
evalIpLoop(), readSpiderdbList(), ...
2014-01-19 22:18:37 -08:00
Matt Wells
e9bbc16a9f
took out pagecount table. just hafta scan
...
twice i think because caching counts
gets complicated because of adding
duplicate injection requests!
2014-01-19 20:34:38 -08:00
Matt Wells
58d0c444ac
fixes for the global index quota system
2014-01-19 19:38:23 -08:00
Matt Wells
089d7f34a0
more spiderdb spider request fixes
2014-01-19 18:00:56 -08:00
Matt Wells
970d5b2488
formatting
2014-01-19 16:40:22 -08:00
Matt Wells
fa0e3f784f
formatting
2014-01-19 15:06:02 -08:00
Matt Wells
5c9b688f72
spiderdb fixes for injections
2014-01-19 14:33:27 -08:00
Matt Wells
99de2188e1
formatting
2014-01-19 13:21:58 -08:00
Matt Wells
04b0650301
formatting
2014-01-19 12:37:37 -08:00
Matt Wells
cd91130a6d
formatting
2014-01-19 12:16:26 -08:00
Matt Wells
ca816492b5
doc links
2014-01-19 12:01:32 -08:00
Matt Wells
b6c3ecc20e
more formatting
2014-01-19 11:56:36 -08:00
Matt Wells
471599e9e7
formatting
2014-01-19 10:44:19 -08:00
Matt Wells
e6eb9003b5
more formatting
2014-01-19 01:09:38 -08:00
Matt Wells
b755b4d581
formatting fixes
2014-01-19 00:57:20 -08:00
Matt Wells
fe3a879758
formatting changes
2014-01-19 00:38:02 -08:00
Matt Wells
36b93a1e92
minor cmdline fixes
2014-01-18 21:26:59 -08:00
Matt Wells
4606e88721
code cleanups.
...
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
10f4443974
quite a few fixes to the quota system, cleanups etc.
2014-01-18 16:23:13 -08:00
Matt Wells
f3000e2763
set m_needsSave in collectionrec when parms updated
2014-01-18 12:51:10 -08:00
Matt Wells
8edfc2ce70
more collection fixes
2014-01-18 12:09:33 -08:00
Matt Wells
fa59c62264
more bug fixes associated with collections
...
and site page counts in url filters.
2014-01-18 11:54:58 -08:00
Matt Wells
22aa13e34d
do not set indexcode to EFAKEFIRSTIP
...
for INJECTED urls, just added urls.
fix add url page to not always use 'main'
collection. added reset/restart cmds to spider page.
2014-01-18 11:09:30 -08:00
Matt Wells
178af5f781
cleanup parms a bit.
...
added diffbotApiUrl to all crawls whether
custom or not, on spider controls page.
2014-01-18 10:29:22 -08:00
Matt Wells
9c1f6197eb
added indexbody control so i can
...
turn it off for my special json
global index.
2014-01-18 10:04:33 -08:00
Matt Wells
6fb602ae62
hash a little meta info still even if custom crawl
2014-01-18 09:37:07 -08:00
Matt Wells
f9d0a02dbe
test and get gbparenturl: query working.
2014-01-18 09:28:58 -08:00
Matt Wells
0be8a59e9e
hash content checksums for pages
...
in custom crawls so we can do deduping.
2014-01-17 21:42:02 -08:00
Matt Wells
5b7170e8c6
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
...
Conflicts:
Json.cpp
PageAddUrl.cpp
PageStats.cpp
Spider.cpp
2014-01-17 21:07:08 -08:00
Matt Wells
4e803210ee
tons of changes from live github on neo.
...
lots of core fixes.
took out ppthtml powerpoint convert, it hangs.
dynamic rdbmap to save memory per coll.
fixed disk page cache logic and brought it
back.
2014-01-17 21:01:43 -08:00
Matt Wells
8c4ac3c514
Merge branch 'master' into diffbot
2014-01-17 20:17:40 -08:00
Matt Wells
bb51dd93c8
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2014-01-17 20:17:03 -08:00
Matt Wells
403dca707c
do not hash body etc. into posdb if
...
doing a custom diffbot crawl. saves
a lot of disk space.
2014-01-17 20:16:29 -08:00
Matt Wells
116f90dba3
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2014-01-17 18:39:34 -08:00
Matt Wells
94740ed3a1
allow sleeps in main.cpp function
2014-01-17 18:39:20 -08:00
Matt Wells
3ec44c5b35
fix streaming mode for sending back json
...
downloads/dumps.
2014-01-17 18:28:17 -08:00
Matt Wells
e09496e34e
fix parm updating logic.
2014-01-17 17:48:45 -08:00
Matt Wells
2faba0efd1
fix repeat rounds sticking bug
...
by adding PF_REBUILDURLFILTERS flag to
spiderroundastarttime parm
2014-01-17 17:17:10 -08:00
Matt Wells
16f8af0d57
added awesome streaming mode support
...
to tcpserver.cpp for sending back
json objects as we get them from shards.
and as we get them in small pieces so we
don't go oom. made that code much simpler
and more reliable in the long run.
2014-01-17 16:26:17 -08:00
Matt Wells
0844dbf72a
added url process pattern and regex to
...
xmldoc.cpp.
2014-01-17 11:08:23 -08:00
Matt Wells
01a3282020
fix problem scanning spiderdb.
...
move dedup spiderdb code to
RdbMerge.cpp where it really should be.
2014-01-16 17:04:08 -08:00
Matt Wells
167d2dc99f
nothing.
2014-01-16 13:40:27 -08:00
Matt Wells
980d63632a
more msg5 re-read fixes.
...
stop re-reading if increasing minrecsizes did nothing.
fix tight merges so they work over all colls.
fix merge counting to be fast and not loop over
all rdbbases which could be thousands.
add num mirrors to rebalance.txt.
fix updateCrawlInfo to wait for all replies. critical error!
2014-01-16 13:38:22 -08:00