Commit Graph

178 Commits

Author SHA1 Message Date
mwells
dcc775eae7 added more langs to url filters drop down 2014-09-21 18:16:11 -07:00
mwells
e45c0d32f6 Merge branch 'diffbot-testing' into testing 2014-08-15 17:05:22 -07:00
Matt Wells
2af299da2c various fixes.
prioritize process only urls over crawl urls to get data faster.
do not merge on high negative rec concentration. we need to fix that more.
allow simplified redirs again for custom crawls to avoid too many dups.
raise crawlinfo delay from 1 sec to 5 secs to reduce network usage
for now. add back in injection enabled parm, but hidden.
2014-08-15 10:27:50 -07:00
mwells
58f5a2dd57 save conf files safely to disk so we don't
lose them because the disk is full.
2014-07-29 10:02:43 -07:00
mwells
0409571262 Merge branch 'diffbot-testing' into testing
Conflicts:
	Spider.cpp
2014-07-28 14:37:44 -07:00
Matt Wells
343f783592 another fix for &restartRound=1 2014-07-28 13:58:36 -07:00
mwells
9347b1fc79 Merge branch 'diffbot-testing' into testing
Conflicts:
	Collectiondb.cpp
	Spider.cpp
2014-07-15 19:30:34 -07:00
Matt Wells
3421befd3a Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-07-15 16:10:50 -07:00
Matt Wells
c1c31c1364 fix for using more than 32k colls 2014-07-15 16:10:35 -07:00
mwells
cd48799030 try to fix core on neo 2014-07-15 10:46:12 -07:00
mwells
61afdc0fb9 fix core from adding/deleting collection 2014-07-12 08:23:40 -07:00
mwells
2f8207ccf7 qa fixes 2014-07-11 19:07:49 -07:00
Matt Wells
b393a1bbbe Merge branch 'testing' into diffbot-matt
Conflicts:
	Errno.cpp
	Errno.h
2014-07-10 10:06:55 -07:00
mwells
0da6063983 bring tags back in site list / url filters. 2014-07-10 07:44:16 -07:00
mwells
6434e5cc04 Merge branch 'testing' into diffbot-matt
Conflicts:
	Errno.cpp
	Errno.h
	Parms.h
2014-07-07 09:49:59 -07:00
Matt Wells
886063a3bd fixes for query reindex. 2014-07-03 12:24:14 -07:00
Matt Wells
e6dd317664 Merge branch 'diffbot-testing' into diffbot-matt 2014-06-30 11:37:12 -07:00
Matt Wells
5e39b7870d fix for bad crawl info stats 2014-06-30 10:53:11 -07:00
Matt Wells
98b317b421 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Parms.cpp
	Query.cpp
2014-06-27 17:23:03 -07:00
Matt Wells
2227d1fca7 Merge branch 'diffbot-matt' of github.com:gigablast/open-source-search-engine into diffbot-matt
Conflicts:
	Collectiondb.cpp
2014-06-27 17:18:20 -07:00
Matt Wells
2137e150e7 Merge branch 'testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Make.depend
	Parms.cpp
2014-06-27 17:17:14 -07:00
Matt Wells
3162c83473 add some debug msgs 2014-06-27 08:28:28 -07:00
Matt Wells
e9ff8c48d8 try to remove the sluggishness from
all hosts... should really reduce load.
2014-06-25 17:46:28 -07:00
mwells
651f0f27ac only send localcrawlinfo if it has
been updated significantly since last time.
should remove the sluggishness/missedhearbeats
from host #0 on neo.
2014-06-25 12:38:51 -06:00
Matt Wells
48a98df71d make &s=20000 search much faster by skipping
generation of first 20000 summaries if
deduping is off, site clustering is off and
gigabit generation are off (&dr=0&sc=0&dsrt=0).
turn gigabits off on load for all customcrawls(diffbot)
2014-06-23 14:44:21 -07:00
mwells
6da972704b bring back custom home page html into search controls 2014-06-21 09:57:51 -07:00
mwells
a09d4cd723 Merge branch 'master' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Pages.cpp
	XmlDoc.cpp
	gb.conf
2014-06-20 09:35:39 -07:00
Matt Wells
1bef36c03c emergency bug fixes 2014-06-18 05:04:45 -07:00
mwells
584af942d4 Merge branch 'testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Make.depend
	Parms.cpp
2014-06-16 20:42:28 -07:00
mwells
928456b102 fix merge 2014-06-15 15:05:28 -07:00
mwells
2be7c78f2f Merge branch 'diffbot-testing' into testing
Conflicts:
	Collectiondb.cpp
	Parms.cpp
2014-06-15 15:02:17 -07:00
mwells
5c0b371dc9 Merge branch 'testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	HttpServer.cpp
	Make.depend
	Parms.cpp
	Parms.h
2014-06-13 11:00:09 -07:00
Matt Wells
ab7717d065 now use &roundStart=0 to trigger the next crawl round.
now assume all crawl jobs are "repeat", but those that
have repeat of "0" just assume 10 year frequency, 3652.5 days.
that way the &roundStart=0 will do another round of crawling
for them as well.
2014-06-11 18:45:58 -07:00
Matt Wells
365f29b293 made &spiderRoundStart=1 (or 0) force the next
spider round to begin.
also added pageUrl to XmlDoc::getContentHashJSON32()
so it's not included in the hash to fix some spider-time
deduping issues.
2014-06-10 14:20:41 -07:00
mwells
108c281c33 fix annoying bug when adding new parms. 2014-06-10 12:29:50 -07:00
Matt Wells
56af753c3e fixed nasty bug of resetting RdbBases for
random collnums, causing data loss and corruption.
2014-06-09 10:16:29 -07:00
Matt Wells
74f0a41290 bulk jobs give up after downloading a url
3 times. crawls don't give up on tmperrors,
but retry every 30 days.
2014-06-05 23:11:14 -07:00
Matt Wells
970eb33a83 sanity checks to ensure fakefirstip
was able to convert to a real good firstip
2014-06-05 16:13:33 -07:00
Matt Wells
d98cf4b2b0 try to prevent slamming diffbot backend
with bulk jobs consisting of hundreds of
different domains/ips.
2014-06-04 12:37:49 -07:00
Daniel Steinberg
fc5cfa2a62 move list of bulk urls to new directory earlier. May fix Defect #2218 if there is something that is causing the bulk job to restart before this function returns 2014-05-15 13:35:32 -07:00
mwells
45b8bb3421 log msg cleanups 2014-05-11 21:55:44 -07:00
mwells
2b37f56e4c Merge branch 'diffbot-matt' into testing 2014-05-10 07:56:45 -07:00
mwells
38a79888b6 Merge branch 'diffbot-testing' into testing 2014-05-10 07:49:29 -07:00
mwells
6048ae849b added support for spidering a particular language
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
3bf52f0f2d if "wpid" is supplied try to update sitelist
for that wpid. hopefully we can get the wp admin
tools to send a /search?wpid=xxxx&sites=xyz.com request so
we can start spidering those sites before they even see the
widget. also it is simpler than trying to update m_siteListBuf
each time someone does a query since those can be hundreds
a second.
2014-05-07 16:10:26 -07:00
mwells
ff121a76d9 fix formatting bugs 2014-04-30 14:17:39 -06:00
Matt Wells
066a01cba6 Merge branch 'diffbot-testing' into diffbot-matt 2014-04-28 14:15:02 -07:00
Matt Wells
e21e0a404c fixed bug for product title extraction.
titledb-saved.dat tree loop corruption bug.
no main coll bug.
put the ajax widget on spider status page so you can
see spider going in realtime. will give customers
a good idea of the spider moving along.
more widget fixes, to use new base64 thumbs, etc.
2014-04-28 13:30:24 -07:00
Matt Wells
20a2729827 added jobCreationTimeUTC and jobCompletionTimeUTC
to json api
2014-04-25 14:12:18 -07:00
Matt Wells
f3c06ced57 try to fix core from deleting coll 2014-04-25 11:52:17 -07:00