Commit Graph

200 Commits

Author SHA1 Message Date
mwells
7c57283b88 fix tld lang url filter. was being reset. 2014-12-04 14:34:08 -07:00
Matt Wells
832392887c do not spam the logs with spider request corrupt count msgs.
but store a count for them now in coll rec.
2014-12-04 10:00:13 -07:00
Matt Wells
d3a25db329 take out swap out stuff 2014-11-27 06:31:20 -08:00
Matt Wells
1c3d87968b Merge branch 'diffbot-testing' into diffbot-matt 2014-11-27 06:29:26 -08:00
Matt Wells
c111b18b29 a few hacks. temp hack for oom to split 4 ways
for custom crawls
2014-11-25 15:01:23 -08:00
Matt
adcef39376 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Collectiondb.h
	Conf.cpp
	Conf.h
	Msg39.cpp
	PageEvents.cpp
	PageResults.cpp
	PageTurk.cpp
	Pages.cpp
	Parms.cpp
	Posdb.cpp
	Proxy.cpp
	Query.cpp
	Query.h
	RdbBase.cpp
	RdbMap.cpp
	Repair.cpp
	Repair.h
	SafeBuf.cpp
	Spider.cpp
	Tagdb.cpp
	TopTree.cpp
	XmlDoc.cpp
	main.cpp
2014-11-20 16:53:07 -08:00
Matt
931a1c4bc6 good checkpoint. quite a few fixes. 2014-11-17 18:13:36 -08:00
Matt
702785a8ee disable collection swapping temporarily for
tagdb updates
2014-11-13 15:36:10 -08:00
Matt
c6605d7b33 64 bit somewhat working at runtime. need to test all functionality
to make sure. fixes are pretty trivial.
2014-11-12 19:18:25 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
emmanuelcharon
790c525820 rename diffbotHopcount to maxHops 2014-11-04 16:05:20 -08:00
emmanuelcharon
c29dedd714 added diffbotHopcount parameter for diffbot crawl and bulk jobs, also updated PageCrawlbot.cpp 2014-10-31 16:34:31 -07:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
mwells
f483fccc2e if no crawl regex, and it has a crawl pattern consisting of
only negative patterns then restrict to domains of seeds
2014-10-09 11:15:33 -06:00
Matt Wells
23d26e26ba Merge branch 'testing' into diffbot-testing 2014-09-30 16:02:07 -07:00
Matt Wells
8c6d216a14 lots of fixes for collection swapping. 2014-09-29 20:16:39 -07:00
Matt Wells
cfb2ab7e82 fix core when deleting collection
that is not swapped out.
2014-09-29 14:00:10 -07:00
mwells
bca24fb0e6 fix collection swap logic a bunch. seems to work now. 2014-09-29 13:05:20 -07:00
mwells
257a7e3c10 first stab at swapping out collection recs
to save memory when # of collections is high
2014-09-29 11:37:05 -07:00
mwells
29f928a71e import fixes 2014-09-25 20:48:34 -07:00
mwells
538f6103d5 get qa tests working again.
fixed facet links.
made data import function actually work so we can
import data from one collection (files) into another.
made url filters profile compatible with UFP_ stuff.
2014-09-23 17:48:40 -07:00
mwells
5390a25721 custom profiles fix 2014-09-22 10:40:16 -07:00
mwells
dcc775eae7 added more langs to url filters drop down 2014-09-21 18:16:11 -07:00
mwells
e45c0d32f6 Merge branch 'diffbot-testing' into testing 2014-08-15 17:05:22 -07:00
Matt Wells
2af299da2c various fixes.
prioritize process only urls over crawl urls to get data faster.
do not merge on high negative rec concentration. we need to fix that more.
allow simplified redirs again for custom crawls to avoid too many dups.
raise crawlinfo delay from 1 sec to 5 secs to reduce network usage
for now. add back in injection enabled parm, but hidden.
2014-08-15 10:27:50 -07:00
mwells
58f5a2dd57 save conf files safely to disk so we don't
lose them because the disk is full.
2014-07-29 10:02:43 -07:00
mwells
0409571262 Merge branch 'diffbot-testing' into testing
Conflicts:
	Spider.cpp
2014-07-28 14:37:44 -07:00
Matt Wells
343f783592 another fix for &restartRound=1 2014-07-28 13:58:36 -07:00
mwells
9347b1fc79 Merge branch 'diffbot-testing' into testing
Conflicts:
	Collectiondb.cpp
	Spider.cpp
2014-07-15 19:30:34 -07:00
Matt Wells
3421befd3a Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-07-15 16:10:50 -07:00
Matt Wells
c1c31c1364 fix for using more than 32k colls 2014-07-15 16:10:35 -07:00
mwells
cd48799030 try to fix core on neo 2014-07-15 10:46:12 -07:00
mwells
61afdc0fb9 fix core from adding/deleting collection 2014-07-12 08:23:40 -07:00
mwells
2f8207ccf7 qa fixes 2014-07-11 19:07:49 -07:00
Matt Wells
b393a1bbbe Merge branch 'testing' into diffbot-matt
Conflicts:
	Errno.cpp
	Errno.h
2014-07-10 10:06:55 -07:00
mwells
0da6063983 bring tags back in site list / url filters. 2014-07-10 07:44:16 -07:00
mwells
6434e5cc04 Merge branch 'testing' into diffbot-matt
Conflicts:
	Errno.cpp
	Errno.h
	Parms.h
2014-07-07 09:49:59 -07:00
Matt Wells
886063a3bd fixes for query reindex. 2014-07-03 12:24:14 -07:00
Matt Wells
e6dd317664 Merge branch 'diffbot-testing' into diffbot-matt 2014-06-30 11:37:12 -07:00
Matt Wells
5e39b7870d fix for bad crawl info stats 2014-06-30 10:53:11 -07:00
Matt Wells
98b317b421 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Parms.cpp
	Query.cpp
2014-06-27 17:23:03 -07:00
Matt Wells
2227d1fca7 Merge branch 'diffbot-matt' of github.com:gigablast/open-source-search-engine into diffbot-matt
Conflicts:
	Collectiondb.cpp
2014-06-27 17:18:20 -07:00
Matt Wells
2137e150e7 Merge branch 'testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Make.depend
	Parms.cpp
2014-06-27 17:17:14 -07:00
Matt Wells
3162c83473 add some debug msgs 2014-06-27 08:28:28 -07:00
Matt Wells
e9ff8c48d8 try to remove the sluggishness from
all hosts... should really reduce load.
2014-06-25 17:46:28 -07:00
mwells
651f0f27ac only send localcrawlinfo if it has
been updated significantly since last time.
should remove the sluggishness/missedhearbeats
from host #0 on neo.
2014-06-25 12:38:51 -06:00
Matt Wells
48a98df71d make &s=20000 search much faster by skipping
generation of first 20000 summaries if
deduping is off, site clustering is off and
gigabit generation are off (&dr=0&sc=0&dsrt=0).
turn gigabits off on load for all customcrawls(diffbot)
2014-06-23 14:44:21 -07:00
mwells
6da972704b bring back custom home page html into search controls 2014-06-21 09:57:51 -07:00
mwells
a09d4cd723 Merge branch 'master' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Pages.cpp
	XmlDoc.cpp
	gb.conf
2014-06-20 09:35:39 -07:00
Matt Wells
1bef36c03c emergency bug fixes 2014-06-18 05:04:45 -07:00