Matt
30a77dd422
checkpoint on massive spidering speed ups.
2015-02-11 17:55:28 -08:00
mwells
87285ba3cd
use gbmemcpy not memcpy so we can get profiler working again
...
since memcpy can't be interrupted and backtrace() called.
2015-01-13 12:25:42 -07:00
Matt
f57b2e0ab5
always restrict to seed domains.
2014-12-15 17:46:13 -08:00
mwells
7c57283b88
fix tld lang url filter. was being reset.
2014-12-04 14:34:08 -07:00
Matt Wells
832392887c
do not spam the logs with spider request corrupt count msgs.
...
but store a count for them now in coll rec.
2014-12-04 10:00:13 -07:00
Matt Wells
d3a25db329
take out swap out stuff
2014-11-27 06:31:20 -08:00
Matt Wells
1c3d87968b
Merge branch 'diffbot-testing' into diffbot-matt
2014-11-27 06:29:26 -08:00
Matt Wells
c111b18b29
a few hacks. temp hack for oom to split 4 ways
...
for custom crawls
2014-11-25 15:01:23 -08:00
Matt
adcef39376
Merge branch 'diffbot-testing' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
Collectiondb.h
Conf.cpp
Conf.h
Msg39.cpp
PageEvents.cpp
PageResults.cpp
PageTurk.cpp
Pages.cpp
Parms.cpp
Posdb.cpp
Proxy.cpp
Query.cpp
Query.h
RdbBase.cpp
RdbMap.cpp
Repair.cpp
Repair.h
SafeBuf.cpp
Spider.cpp
Tagdb.cpp
TopTree.cpp
XmlDoc.cpp
main.cpp
2014-11-20 16:53:07 -08:00
Matt
931a1c4bc6
good checkpoint. quite a few fixes.
2014-11-17 18:13:36 -08:00
Matt
702785a8ee
disable collection swapping temporarily for
...
tagdb updates
2014-11-13 15:36:10 -08:00
Matt
c6605d7b33
64 bit somewhat working at runtime. need to test all functionality
...
to make sure. fixes are pretty trivial.
2014-11-12 19:18:25 -08:00
Matt
96b8197ad3
now it compiles with -m32
2014-11-10 14:45:11 -08:00
emmanuelcharon
790c525820
rename diffbotHopcount to maxHops
2014-11-04 16:05:20 -08:00
emmanuelcharon
c29dedd714
added diffbotHopcount parameter for diffbot crawl and bulk jobs, also updated PageCrawlbot.cpp
2014-10-31 16:34:31 -07:00
Matt Wells
e7dd8f7956
replace long long with int64_t
2014-10-30 13:36:39 -06:00
mwells
f483fccc2e
if no crawl regex, and it has a crawl pattern consisting of
...
only negative patterns then restrict to domains of seeds
2014-10-09 11:15:33 -06:00
Matt Wells
23d26e26ba
Merge branch 'testing' into diffbot-testing
2014-09-30 16:02:07 -07:00
Matt Wells
8c6d216a14
lots of fixes for collection swapping.
2014-09-29 20:16:39 -07:00
Matt Wells
cfb2ab7e82
fix core when deleting collection
...
that is not swapped out.
2014-09-29 14:00:10 -07:00
mwells
bca24fb0e6
fix collection swap logic a bunch. seems to work now.
2014-09-29 13:05:20 -07:00
mwells
257a7e3c10
first stab at swapping out collection recs
...
to save memory when # of collections is high
2014-09-29 11:37:05 -07:00
mwells
29f928a71e
import fixes
2014-09-25 20:48:34 -07:00
mwells
538f6103d5
get qa tests working again.
...
fixed facet links.
made data import function actually work so we can
import data from one collection (files) into another.
made url filters profile compatible with UFP_ stuff.
2014-09-23 17:48:40 -07:00
mwells
5390a25721
custom profiles fix
2014-09-22 10:40:16 -07:00
mwells
dcc775eae7
added more langs to url filters drop down
2014-09-21 18:16:11 -07:00
mwells
e45c0d32f6
Merge branch 'diffbot-testing' into testing
2014-08-15 17:05:22 -07:00
Matt Wells
2af299da2c
various fixes.
...
prioritize process only urls over crawl urls to get data faster.
do not merge on high negative rec concentration. we need to fix that more.
allow simplified redirs again for custom crawls to avoid too many dups.
raise crawlinfo delay from 1 sec to 5 secs to reduce network usage
for now. add back in injection enabled parm, but hidden.
2014-08-15 10:27:50 -07:00
mwells
58f5a2dd57
save conf files safely to disk so we don't
...
lose them because the disk is full.
2014-07-29 10:02:43 -07:00
mwells
0409571262
Merge branch 'diffbot-testing' into testing
...
Conflicts:
Spider.cpp
2014-07-28 14:37:44 -07:00
Matt Wells
343f783592
another fix for &restartRound=1
2014-07-28 13:58:36 -07:00
mwells
9347b1fc79
Merge branch 'diffbot-testing' into testing
...
Conflicts:
Collectiondb.cpp
Spider.cpp
2014-07-15 19:30:34 -07:00
Matt Wells
3421befd3a
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-07-15 16:10:50 -07:00
Matt Wells
c1c31c1364
fix for using more than 32k colls
2014-07-15 16:10:35 -07:00
mwells
cd48799030
try to fix core on neo
2014-07-15 10:46:12 -07:00
mwells
61afdc0fb9
fix core from adding/deleting collection
2014-07-12 08:23:40 -07:00
mwells
2f8207ccf7
qa fixes
2014-07-11 19:07:49 -07:00
Matt Wells
b393a1bbbe
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Errno.cpp
Errno.h
2014-07-10 10:06:55 -07:00
mwells
0da6063983
bring tags back in site list / url filters.
2014-07-10 07:44:16 -07:00
mwells
6434e5cc04
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Errno.cpp
Errno.h
Parms.h
2014-07-07 09:49:59 -07:00
Matt Wells
886063a3bd
fixes for query reindex.
2014-07-03 12:24:14 -07:00
Matt Wells
e6dd317664
Merge branch 'diffbot-testing' into diffbot-matt
2014-06-30 11:37:12 -07:00
Matt Wells
5e39b7870d
fix for bad crawl info stats
2014-06-30 10:53:11 -07:00
Matt Wells
98b317b421
Merge branch 'diffbot-testing' into diffbot-matt
...
Conflicts:
Parms.cpp
Query.cpp
2014-06-27 17:23:03 -07:00
Matt Wells
2227d1fca7
Merge branch 'diffbot-matt' of github.com:gigablast/open-source-search-engine into diffbot-matt
...
Conflicts:
Collectiondb.cpp
2014-06-27 17:18:20 -07:00
Matt Wells
2137e150e7
Merge branch 'testing' into diffbot-matt
...
Conflicts:
Collectiondb.cpp
Make.depend
Parms.cpp
2014-06-27 17:17:14 -07:00
Matt Wells
3162c83473
add some debug msgs
2014-06-27 08:28:28 -07:00
Matt Wells
e9ff8c48d8
try to remove the sluggishness from
...
all hosts... should really reduce load.
2014-06-25 17:46:28 -07:00
mwells
651f0f27ac
only send localcrawlinfo if it has
...
been updated significantly since last time.
should remove the sluggishness/missedhearbeats
from host #0 on neo.
2014-06-25 12:38:51 -06:00
Matt Wells
48a98df71d
make &s=20000 search much faster by skipping
...
generation of first 20000 summaries if
deduping is off, site clustering is off and
gigabit generation are off (&dr=0&sc=0&dsrt=0).
turn gigabits off on load for all customcrawls(diffbot)
2014-06-23 14:44:21 -07:00