Commit Graph

642 Commits

Author SHA1 Message Date
Matt Wells
020d7741b9 new coll.conf for main with ismedia filter.
updated url filters docs some more for "isnew"
and explained the errorcount stuff more.
2013-12-08 10:10:51 -07:00
Matt Wells
65e75167e3 limit posdb merging to 8 files max.
added some more url filters documentation.
2013-12-08 09:41:05 -07:00
Matt Wells
78a4cfe6da forgot to push the .h files 2013-12-07 22:12:48 -07:00
Matt Wells
e1712fc94f fix uninitialized diffbot titlerec
header parms. ignore them when not
a custom crawl.
2013-12-07 22:11:26 -07:00
Matt Wells
06edfddf31 a bunch of bug fixes, mostly spider related.
also some for pagereindex.
2013-12-07 21:56:37 -07:00
Matt Wells
5e4b5a112c Merge branch 'master' into diffbot
Conflicts:

	PageResults.cpp
	Threads.cpp
	XmlDoc.cpp
	XmlDoc.h
2013-12-07 11:34:26 -07:00
Matt Wells
105be1fbdc more core fixes 2013-12-07 10:38:47 -07:00
Matt Wells
8d92a079c2 minor spider error reply time fix 2013-12-07 10:21:51 -07:00
Matt Wells
e731e5a4d8 Merge branch 'diffbot' of git@github.com:gigablast/open-source-search-engine into diffbot 2013-12-07 10:21:21 -07:00
Matt Wells
0e846a9389 minor spider reply error fix 2013-12-07 10:21:02 -07:00
Matt Wells
626a97770c another core fix 2013-12-07 10:14:37 -07:00
Matt Wells
fda7b48500 fix core 2013-12-07 10:11:13 -07:00
Matt Wells
1bc80ab552 fixed pagereindex. we now add spiderreplies
for internal errors like ENOMEM or ENOTFOUND
to try to avoid the "CRITICAL CRITICAL" msgs.
these are considered temporary errors.
2013-12-07 10:01:17 -07:00
Matt Wells
d9b31d3481 quick bug fix 2013-12-06 22:57:49 -07:00
Matt Wells
269c10f648 try to figure out why pagereindex never
displayed html page when done.
2013-12-06 22:56:06 -07:00
Matt Wells
e7bd904765 fix docids only printing. 2013-12-06 09:53:32 -07:00
Matt Wells
c50ef1954f show admin controls on serps if ip is local.
fixed up the "reindex" page for deleting/reindexing
search results for a given query.
2013-12-06 09:48:30 -07:00
Matt Wells
4b3e111bed fix spider dumping to remember
uh48's between list readings.
was showing dups for www.nordicusa.com/webtv
at the end.
2013-12-05 10:09:06 -08:00
Matt Wells
99cc10fccd allow seed urls to match url crawl pattern
regardless.
2013-12-03 17:13:38 -08:00
Matt Wells
432099c4e6 added rebuild=true fix for regex crawl change 2013-12-03 16:23:58 -08:00
Matt Wells
2e46bcc97f Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-12-03 16:23:20 -08:00
Matt Wells
03219a3057 add regex support back in 2013-12-03 16:23:05 -08:00
Matt Wells
6ab9041f45 fix bug when just getting the crawl parms
was rebuilding the waiting tree.
2013-12-03 16:17:36 -08:00
Matt Wells
9f1d79b124 check for null collrec 2013-12-02 10:13:19 -08:00
Matt Wells
cda5968b75 update common word list 2013-12-01 15:19:33 -07:00
Matt Wells
39f8dc646b default gigabits on for my copy. 2013-12-01 15:07:06 -07:00
Matt Wells
7f4dca7a07 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine 2013-12-01 14:47:16 -07:00
Matt Wells
7874c8d832 added ifdef NEEDSLICENSE 2013-12-01 14:47:08 -07:00
Gigablast
dfe72a76a0 Update LICENSE
updates to license
2013-12-01 13:43:14 -08:00
Matt Wells
d43b55103c show query in msg20 log msg 2013-12-01 12:11:25 -07:00
Matt Wells
1077191e4a fix log msg bug. 2013-12-01 12:08:05 -07:00
Matt Wells
08030865e4 fix compiler warning 2013-12-01 11:57:26 -07:00
Matt Wells
d811a13627 fix small oopsy 2013-12-01 11:56:33 -07:00
Matt Wells
3155869fbf added new log msg for
recording cpu time for summary generation.
2013-12-01 11:53:41 -07:00
Matt Wells
5ee2be8fcf fixed data corruption bug. m_finalCrawlDelay
was being stored in xmldoc titlerec header.
2013-11-27 14:18:15 -08:00
Matt Wells
1129e9b635 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-11-27 14:09:54 -08:00
Matt Wells
57eb231a4e do not add timestamps to lastdownload
cache if skiphammercheck is true. those
are like robots.txt or redirs or root files.
2013-11-26 14:21:17 -08:00
Matt Wells
0f3374e3f3 measure crawl delay by default from
start of each download now. it is
a parm in msg13request.
2013-11-26 14:07:28 -08:00
Matt Wells
4769ca0881 if pthread_create() returns EAGAIN then do
not always retry, it makes an infinite loop.
2013-11-26 14:52:07 -07:00
Matt Wells
8bb086ac60 crawldelay works now but it measures
from the end of the download, not the
beginning.
2013-11-26 12:58:14 -08:00
Matt Wells
1c7c9a4d80 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-11-26 09:19:26 -08:00
Matt Wells
040bdb8039 fix url filters formulation.
fixed extra , in json.
fixed upp and ucp patterns if all substrings
are negative.
2013-11-26 09:17:38 -08:00
Matt Wells
ca544ddb90 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-11-25 15:06:11 -08:00
Matt Wells
1bbbcff755 fix getTokenizedDiffbotReply()
to look for type: with a {} depth of 1
so it does not pick up on the
type:image in the images array if there is
one in the article.
2013-11-25 13:58:31 -08:00
Matt Wells
61ce4be279 fix major bug when you have twins/mirrors.
queries not returning all the results.
2013-11-25 09:53:53 -07:00
Matt Wells
9a456de178 minor fix 2013-11-24 20:48:47 -07:00
Matt Wells
5da41cd113 fix a couple different cores. 2013-11-24 19:46:44 -07:00
Matt Wells
41ce557627 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2013-11-22 18:26:53 -08:00
Matt Wells
e8065a0f0a enforce crawl delay perfectly. 2013-11-22 18:26:34 -08:00
Matt Wells
1826860094 forgot to add diffbot api url parm 2013-11-22 17:55:37 -08:00