Commit Graph

651 Commits

Author SHA1 Message Date
mwells
22271c0bb2 do not accept msg4 add requests until in sync with host 0 2013-12-10 13:20:23 -08:00
mwells
f2d5661965 parmdb overhaul. support collection add/del
sync when host comes back online. use udp not tcp.
host #0 can now handle a new incoming request while
a parm change is currently outstanding.
all missed "command" parms will be received when a dead host
comes back online, too, like a tight merge for instance.
does not use msg4, uses msg3e and msg3f for syncing and
sending parms.
2013-12-10 13:09:55 -08:00
mwells
0e47d48d8c test commit 2013-12-10 13:02:52 -08:00
mwells
e04d596288 minor comments update. 2013-12-09 13:42:33 -08:00
Matt Wells
dd3b49faa9 collection name hell 2013-12-08 16:44:37 -07:00
Matt Wells
3353a90a85 fix resuming a killed merge condition. 2013-12-08 15:50:45 -07:00
Matt Wells
ed79b67d2e core dump fixes 2013-12-08 15:36:23 -07:00
Matt Wells
144e2c898e save resources by not doing reads
on an empty doledb priority.
stop saving allSpidersOn and Off parms.
2013-12-08 14:07:31 -07:00
Matt Wells
a2e52a5dc3 little fix 2013-12-08 10:15:54 -07:00
Matt Wells
020d7741b9 new coll.conf for main with ismedia filter.
updated url filters docs some more for "isnew"
and explained the errorcount stuff more.
2013-12-08 10:10:51 -07:00
Matt Wells
65e75167e3 limit posdb merging to 8 files max.
added some more url filters documentation.
2013-12-08 09:41:05 -07:00
Matt Wells
78a4cfe6da forgot to push the .h files 2013-12-07 22:12:48 -07:00
Matt Wells
e1712fc94f fix uninitialized diffbot titlerec
header parms. ignore them when not
a custom crawl.
2013-12-07 22:11:26 -07:00
Matt Wells
06edfddf31 a bunch of bug fixes, mostly spider related.
also some for pagereindex.
2013-12-07 21:56:37 -07:00
Matt Wells
5e4b5a112c Merge branch 'master' into diffbot
Conflicts:

	PageResults.cpp
	Threads.cpp
	XmlDoc.cpp
	XmlDoc.h
2013-12-07 11:34:26 -07:00
Matt Wells
105be1fbdc more core fixes 2013-12-07 10:38:47 -07:00
Matt Wells
8d92a079c2 minor spider error reply time fix 2013-12-07 10:21:51 -07:00
Matt Wells
e731e5a4d8 Merge branch 'diffbot' of git@github.com:gigablast/open-source-search-engine into diffbot 2013-12-07 10:21:21 -07:00
Matt Wells
0e846a9389 minor spider reply error fix 2013-12-07 10:21:02 -07:00
Matt Wells
626a97770c another core fix 2013-12-07 10:14:37 -07:00
Matt Wells
fda7b48500 fix core 2013-12-07 10:11:13 -07:00
Matt Wells
1bc80ab552 fixed pagereindex. we now add spiderreplies
for internal errors like ENOMEM or ENOTFOUND
to try to avoid the "CRITICAL CRITICAL" msgs.
these are considered temporary errors.
2013-12-07 10:01:17 -07:00
Matt Wells
d9b31d3481 quick bug fix 2013-12-06 22:57:49 -07:00
Matt Wells
269c10f648 try to figure out why pagereindex never
displayed html page when done.
2013-12-06 22:56:06 -07:00
Matt Wells
e7bd904765 fix docids only printing. 2013-12-06 09:53:32 -07:00
Matt Wells
c50ef1954f show admin controls on serps if ip is local.
fixed up the "reindex" page for deleting/reindexing
search results for a given query.
2013-12-06 09:48:30 -07:00
Matt Wells
4b3e111bed fix spider dumping to remember
uh48's between list readings.
was showing dups for www.nordicusa.com/webtv
at the end.
2013-12-05 10:09:06 -08:00
Matt Wells
99cc10fccd allow seed urls to match url crawl pattern
regardless.
2013-12-03 17:13:38 -08:00
Matt Wells
432099c4e6 added rebuild=true fix for regex crawl change 2013-12-03 16:23:58 -08:00
Matt Wells
2e46bcc97f Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-12-03 16:23:20 -08:00
Matt Wells
03219a3057 add regex support back in 2013-12-03 16:23:05 -08:00
Matt Wells
6ab9041f45 fix bug when just getting the crawl parms
was rebuilding the waiting tree.
2013-12-03 16:17:36 -08:00
Matt Wells
9f1d79b124 check for null collrec 2013-12-02 10:13:19 -08:00
Matt Wells
cda5968b75 update common word list 2013-12-01 15:19:33 -07:00
Matt Wells
39f8dc646b default gigabits on for my copy. 2013-12-01 15:07:06 -07:00
Matt Wells
7f4dca7a07 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine 2013-12-01 14:47:16 -07:00
Matt Wells
7874c8d832 added ifdef NEEDSLICENSE 2013-12-01 14:47:08 -07:00
Gigablast
dfe72a76a0 Update LICENSE
updates to license
2013-12-01 13:43:14 -08:00
Matt Wells
d43b55103c show query in msg20 log msg 2013-12-01 12:11:25 -07:00
Matt Wells
1077191e4a fix log msg bug. 2013-12-01 12:08:05 -07:00
Matt Wells
08030865e4 fix compiler warning 2013-12-01 11:57:26 -07:00
Matt Wells
d811a13627 fix small oopsy 2013-12-01 11:56:33 -07:00
Matt Wells
3155869fbf added new log msg for
recording cpu time for summary generation.
2013-12-01 11:53:41 -07:00
Matt Wells
5ee2be8fcf fixed data corruption bug. m_finalCrawlDelay
was being stored in xmldoc titlerec header.
2013-11-27 14:18:15 -08:00
Matt Wells
1129e9b635 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-11-27 14:09:54 -08:00
Matt Wells
57eb231a4e do not add timestamps to lastdownload
cache if skiphammercheck is true. those
are like robots.txt or redirs or root files.
2013-11-26 14:21:17 -08:00
Matt Wells
0f3374e3f3 measure crawl delay by default from
start of each download now. it is
a parm in msg13request.
2013-11-26 14:07:28 -08:00
Matt Wells
4769ca0881 if pthread_create() returns EAGAIN then do
not always retry, it makes an infinite loop.
2013-11-26 14:52:07 -07:00
Matt Wells
8bb086ac60 crawldelay works now but it measures
from the end of the download, not the
beginning.
2013-11-26 12:58:14 -08:00
Matt Wells
1c7c9a4d80 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-11-26 09:19:26 -08:00