Commit Graph

988 Commits

Author SHA1 Message Date
Matt Wells
45cb5c9a0c fix bugs to try to get sharding working
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
7065b0ae0c fixed oops 2014-01-21 13:13:16 -08:00
Matt Wells
dba382f7f7 added max cpu merge threads parm and defaulted to 10
up from 2 for better disk reading latencies.
2014-01-21 13:11:53 -08:00
Matt Wells
9354d06493 menu updates. 2014-01-21 13:01:37 -08:00
Matt Wells
8d5e1cb547 added url download support 2014-01-20 23:17:04 -08:00
Matt Wells
41cdfcef96 inc spider limits in various places 2014-01-20 18:51:15 -08:00
Matt Wells
946a683e39 quite a few spider fixes 2014-01-20 16:45:27 -08:00
Matt Wells
5c86d8a122 simplified spiderdb.cpp scanSpiderdb()
by breaking up into 4 functions.
evalIpLoop(), readSpiderdbList(), ...
2014-01-19 22:18:37 -08:00
Matt Wells
e9bbc16a9f took out pagecount table. just hafta scan
twice i think because caching counts
gets complicated because of adding
duplicate injection requests!
2014-01-19 20:34:38 -08:00
Matt Wells
58d0c444ac fixes for the global index quota system 2014-01-19 19:38:23 -08:00
Matt Wells
089d7f34a0 more spiderdb spider request fixes 2014-01-19 18:00:56 -08:00
Matt Wells
970d5b2488 formatting 2014-01-19 16:40:22 -08:00
Matt Wells
fa0e3f784f formatting 2014-01-19 15:06:02 -08:00
Matt Wells
5c9b688f72 spiderdb fixes for injections 2014-01-19 14:33:27 -08:00
Matt Wells
99de2188e1 formatting 2014-01-19 13:21:58 -08:00
Matt Wells
04b0650301 formatting 2014-01-19 12:37:37 -08:00
Matt Wells
cd91130a6d formatting 2014-01-19 12:16:26 -08:00
Matt Wells
ca816492b5 doc links 2014-01-19 12:01:32 -08:00
Matt Wells
b6c3ecc20e more formatting 2014-01-19 11:56:36 -08:00
Matt Wells
471599e9e7 formatting 2014-01-19 10:44:19 -08:00
Matt Wells
e6eb9003b5 more formatting 2014-01-19 01:09:38 -08:00
Matt Wells
b755b4d581 formatting fixes 2014-01-19 00:57:20 -08:00
Matt Wells
fe3a879758 formatting changes 2014-01-19 00:38:02 -08:00
Matt Wells
36b93a1e92 minor cmdline fixes 2014-01-18 21:26:59 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
10f4443974 quite a few fixes to the quota system, cleanups etc. 2014-01-18 16:23:13 -08:00
Matt Wells
f3000e2763 set m_needsSave in collectionrec when parms updated 2014-01-18 12:51:10 -08:00
Matt Wells
8edfc2ce70 more collection fixes 2014-01-18 12:09:33 -08:00
Matt Wells
fa59c62264 more bug fixes associated with collections
and site page counts in url filters.
2014-01-18 11:54:58 -08:00
Matt Wells
22aa13e34d do not set indexcode to EFAKEFIRSTIP
for INJECTED urls, just added urls.
fix add url page to not always use 'main'
collection. added reset/restart cmds to spider page.
2014-01-18 11:09:30 -08:00
Matt Wells
178af5f781 cleanup parms a bit.
added diffbotApiUrl to all crawls whether
custom or not, on spider controls page.
2014-01-18 10:29:22 -08:00
Matt Wells
9c1f6197eb added indexbody control so i can
turn it off for my special json
global index.
2014-01-18 10:04:33 -08:00
Matt Wells
6fb602ae62 hash a little meta info still even if custom crawl 2014-01-18 09:37:07 -08:00
Matt Wells
f9d0a02dbe test and get gbparenturl: query working. 2014-01-18 09:28:58 -08:00
Matt Wells
0be8a59e9e hash content checksums for pages
in custom crawls so we can do deduping.
2014-01-17 21:42:02 -08:00
Matt Wells
5b7170e8c6 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	Json.cpp
	PageAddUrl.cpp
	PageStats.cpp
	Spider.cpp
2014-01-17 21:07:08 -08:00
Matt Wells
4e803210ee tons of changes from live github on neo.
lots of core fixes.
took out ppthtml powerpoint convert, it hangs.
dynamic rdbmap to save memory per coll.
fixed disk page cache logic and brought it
back.
2014-01-17 21:01:43 -08:00
Matt Wells
8c4ac3c514 Merge branch 'master' into diffbot 2014-01-17 20:17:40 -08:00
Matt Wells
bb51dd93c8 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-01-17 20:17:03 -08:00
Matt Wells
403dca707c do not hash body etc. into posdb if
doing a custom diffbot crawl. saves
a lot of disk space.
2014-01-17 20:16:29 -08:00
Matt Wells
116f90dba3 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-01-17 18:39:34 -08:00
Matt Wells
94740ed3a1 allow sleeps in main.cpp function 2014-01-17 18:39:20 -08:00
Matt Wells
3ec44c5b35 fix streaming mode for sending back json
downloads/dumps.
2014-01-17 18:28:17 -08:00
Matt Wells
e09496e34e fix parm updating logic. 2014-01-17 17:48:45 -08:00
Matt Wells
2faba0efd1 fix repeat rounds sticking bug
by adding PF_REBUILDURLFILTERS flag to
spiderroundastarttime parm
2014-01-17 17:17:10 -08:00
Matt Wells
16f8af0d57 added awesome streaming mode support
to tcpserver.cpp for sending back
json objects as we get them from shards.
and as we get them in small pieces so we
don't go oom. made that code much simpler
and more reliable in the long run.
2014-01-17 16:26:17 -08:00
Matt Wells
0844dbf72a added url process pattern and regex to
xmldoc.cpp.
2014-01-17 11:08:23 -08:00
Matt Wells
01a3282020 fix problem scanning spiderdb.
move dedup spiderdb code to
RdbMerge.cpp where it really should be.
2014-01-16 17:04:08 -08:00
Matt Wells
167d2dc99f nothing. 2014-01-16 13:40:27 -08:00
Matt Wells
980d63632a more msg5 re-read fixes.
stop re-reading if increasing minrecsizes did nothing.
fix tight merges so they work over all colls.
fix merge counting to be fast and not loop over
all rdbbases which could be thousands.
add num mirrors to rebalance.txt.
fix updateCrawlInfo to wait for all replies. critical error!
2014-01-16 13:38:22 -08:00