Commit Graph

654 Commits

Author SHA1 Message Date
mwells
1175478705 got this new parm shit compiling 2013-12-10 12:54:19 -08:00
mwells
9e1976a8e2 new parm stuff almost compiling. 2013-12-10 11:13:43 -08:00
mwells
cc63fd048f Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-12-09 13:46:08 -08:00
mwells
e04d596288 minor comments update. 2013-12-09 13:42:33 -08:00
Matt Wells
dd3b49faa9 collection name hell 2013-12-08 16:44:37 -07:00
Matt Wells
3353a90a85 fix resuming a killed merge condition. 2013-12-08 15:50:45 -07:00
Matt Wells
ed79b67d2e core dump fixes 2013-12-08 15:36:23 -07:00
Matt Wells
144e2c898e save resources by not doing reads
on an empty doledb priority.
stop saving allSpidersOn and Off parms.
2013-12-08 14:07:31 -07:00
Matt Wells
a2e52a5dc3 little fix 2013-12-08 10:15:54 -07:00
Matt Wells
020d7741b9 new coll.conf for main with ismedia filter.
updated url filters docs some more for "isnew"
and explained the errorcount stuff more.
2013-12-08 10:10:51 -07:00
Matt Wells
65e75167e3 limit posdb merging to 8 files max.
added some more url filters documentation.
2013-12-08 09:41:05 -07:00
Matt Wells
78a4cfe6da forgot to push the .h files 2013-12-07 22:12:48 -07:00
Matt Wells
e1712fc94f fix uninitialized diffbot titlerec
header parms. ignore them when not
a custom crawl.
2013-12-07 22:11:26 -07:00
Matt Wells
06edfddf31 a bunch of bug fixes, mostly spider related.
also some for pagereindex.
2013-12-07 21:56:37 -07:00
Matt Wells
5e4b5a112c Merge branch 'master' into diffbot
Conflicts:

	PageResults.cpp
	Threads.cpp
	XmlDoc.cpp
	XmlDoc.h
2013-12-07 11:34:26 -07:00
Matt Wells
105be1fbdc more core fixes 2013-12-07 10:38:47 -07:00
Matt Wells
8d92a079c2 minor spider error reply time fix 2013-12-07 10:21:51 -07:00
Matt Wells
e731e5a4d8 Merge branch 'diffbot' of git@github.com:gigablast/open-source-search-engine into diffbot 2013-12-07 10:21:21 -07:00
Matt Wells
0e846a9389 minor spider reply error fix 2013-12-07 10:21:02 -07:00
Matt Wells
626a97770c another core fix 2013-12-07 10:14:37 -07:00
Matt Wells
fda7b48500 fix core 2013-12-07 10:11:13 -07:00
Matt Wells
1bc80ab552 fixed pagereindex. we now add spiderreplies
for internal errors like ENOMEM or ENOTFOUND
to try to avoid the "CRITICAL CRITICAL" msgs.
these are considered temporary errors.
2013-12-07 10:01:17 -07:00
Matt Wells
d9b31d3481 quick bug fix 2013-12-06 22:57:49 -07:00
Matt Wells
269c10f648 try to figure out why pagereindex never
displayed html page when done.
2013-12-06 22:56:06 -07:00
mwells
522e81913f another parm overhaul checkpoint 2013-12-06 17:33:55 -08:00
mwells
adf9d807ea Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	Process.cpp
2013-12-06 12:31:36 -08:00
mwells
08faf78be9 checkpoint for new parm logic
to allowing syncing with newly added or deleted
collections even if a host was dead when collection
was added/deleted. also added parm change request
queueing.
2013-12-06 12:29:14 -08:00
Matt Wells
e7bd904765 fix docids only printing. 2013-12-06 09:53:32 -07:00
Matt Wells
c50ef1954f show admin controls on serps if ip is local.
fixed up the "reindex" page for deleting/reindexing
search results for a given query.
2013-12-06 09:48:30 -07:00
Matt Wells
4b3e111bed fix spider dumping to remember
uh48's between list readings.
was showing dups for www.nordicusa.com/webtv
at the end.
2013-12-05 10:09:06 -08:00
Matt Wells
99cc10fccd allow seed urls to match url crawl pattern
regardless.
2013-12-03 17:13:38 -08:00
Matt Wells
432099c4e6 added rebuild=true fix for regex crawl change 2013-12-03 16:23:58 -08:00
Matt Wells
2e46bcc97f Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-12-03 16:23:20 -08:00
Matt Wells
03219a3057 add regex support back in 2013-12-03 16:23:05 -08:00
Matt Wells
6ab9041f45 fix bug when just getting the crawl parms
was rebuilding the waiting tree.
2013-12-03 16:17:36 -08:00
Matt Wells
9f1d79b124 check for null collrec 2013-12-02 10:13:19 -08:00
Matt Wells
cda5968b75 update common word list 2013-12-01 15:19:33 -07:00
Matt Wells
39f8dc646b default gigabits on for my copy. 2013-12-01 15:07:06 -07:00
Matt Wells
7f4dca7a07 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine 2013-12-01 14:47:16 -07:00
Matt Wells
7874c8d832 added ifdef NEEDSLICENSE 2013-12-01 14:47:08 -07:00
Gigablast
dfe72a76a0 Update LICENSE
updates to license
2013-12-01 13:43:14 -08:00
Matt Wells
d43b55103c show query in msg20 log msg 2013-12-01 12:11:25 -07:00
Matt Wells
1077191e4a fix log msg bug. 2013-12-01 12:08:05 -07:00
Matt Wells
08030865e4 fix compiler warning 2013-12-01 11:57:26 -07:00
Matt Wells
d811a13627 fix small oopsy 2013-12-01 11:56:33 -07:00
Matt Wells
3155869fbf added new log msg for
recording cpu time for summary generation.
2013-12-01 11:53:41 -07:00
Matt Wells
5ee2be8fcf fixed data corruption bug. m_finalCrawlDelay
was being stored in xmldoc titlerec header.
2013-11-27 14:18:15 -08:00
Matt Wells
1129e9b635 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-11-27 14:09:54 -08:00
Matt Wells
57eb231a4e do not add timestamps to lastdownload
cache if skiphammercheck is true. those
are like robots.txt or redirs or root files.
2013-11-26 14:21:17 -08:00
Matt Wells
0f3374e3f3 measure crawl delay by default from
start of each download now. it is
a parm in msg13request.
2013-11-26 14:07:28 -08:00