Matt Wells
|
020d7741b9
|
new coll.conf for main with ismedia filter.
updated url filters docs some more for "isnew"
and explained the errorcount stuff more.
|
2013-12-08 10:10:51 -07:00 |
|
Matt Wells
|
65e75167e3
|
limit posdb merging to 8 files max.
added some more url filters documentation.
|
2013-12-08 09:41:05 -07:00 |
|
Matt Wells
|
78a4cfe6da
|
forgot to push the .h files
|
2013-12-07 22:12:48 -07:00 |
|
Matt Wells
|
e1712fc94f
|
fix uninitialized diffbot titlerec
header parms. ignore them when not
a custom crawl.
|
2013-12-07 22:11:26 -07:00 |
|
Matt Wells
|
06edfddf31
|
a bunch of bug fixes, mostly spider related.
also some for pagereindex.
|
2013-12-07 21:56:37 -07:00 |
|
Matt Wells
|
5e4b5a112c
|
Merge branch 'master' into diffbot
Conflicts:
PageResults.cpp
Threads.cpp
XmlDoc.cpp
XmlDoc.h
|
2013-12-07 11:34:26 -07:00 |
|
Matt Wells
|
105be1fbdc
|
more core fixes
|
2013-12-07 10:38:47 -07:00 |
|
Matt Wells
|
8d92a079c2
|
minor spider error reply time fix
|
2013-12-07 10:21:51 -07:00 |
|
Matt Wells
|
e731e5a4d8
|
Merge branch 'diffbot' of git@github.com:gigablast/open-source-search-engine into diffbot
|
2013-12-07 10:21:21 -07:00 |
|
Matt Wells
|
0e846a9389
|
minor spider reply error fix
|
2013-12-07 10:21:02 -07:00 |
|
Matt Wells
|
626a97770c
|
another core fix
|
2013-12-07 10:14:37 -07:00 |
|
Matt Wells
|
fda7b48500
|
fix core
|
2013-12-07 10:11:13 -07:00 |
|
Matt Wells
|
1bc80ab552
|
fixed pagereindex. we now add spiderreplies
for internal errors like ENOMEM or ENOTFOUND
to try to avoid the "CRITICAL CRITICAL" msgs.
these are considered temporary errors.
|
2013-12-07 10:01:17 -07:00 |
|
Matt Wells
|
d9b31d3481
|
quick bug fix
|
2013-12-06 22:57:49 -07:00 |
|
Matt Wells
|
269c10f648
|
try to figure out why pagereindex never
displayed html page when done.
|
2013-12-06 22:56:06 -07:00 |
|
Matt Wells
|
e7bd904765
|
fix docids only printing.
|
2013-12-06 09:53:32 -07:00 |
|
Matt Wells
|
c50ef1954f
|
show admin controls on serps if ip is local.
fixed up the "reindex" page for deleting/reindexing
search results for a given query.
|
2013-12-06 09:48:30 -07:00 |
|
Matt Wells
|
4b3e111bed
|
fix spider dumping to remember
uh48's between list readings.
was showing dups for www.nordicusa.com/webtv
at the end.
|
2013-12-05 10:09:06 -08:00 |
|
Matt Wells
|
99cc10fccd
|
allow seed urls to match url crawl pattern
regardless.
|
2013-12-03 17:13:38 -08:00 |
|
Matt Wells
|
432099c4e6
|
added rebuild=true fix for regex crawl change
|
2013-12-03 16:23:58 -08:00 |
|
Matt Wells
|
2e46bcc97f
|
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
|
2013-12-03 16:23:20 -08:00 |
|
Matt Wells
|
03219a3057
|
add regex support back in
|
2013-12-03 16:23:05 -08:00 |
|
Matt Wells
|
6ab9041f45
|
fix bug when just getting the crawl parms
was rebuilding the waiting tree.
|
2013-12-03 16:17:36 -08:00 |
|
Matt Wells
|
9f1d79b124
|
check for null collrec
|
2013-12-02 10:13:19 -08:00 |
|
Matt Wells
|
cda5968b75
|
update common word list
|
2013-12-01 15:19:33 -07:00 |
|
Matt Wells
|
39f8dc646b
|
default gigabits on for my copy.
|
2013-12-01 15:07:06 -07:00 |
|
Matt Wells
|
7f4dca7a07
|
Merge branch 'master' of git@github.com:gigablast/open-source-search-engine
|
2013-12-01 14:47:16 -07:00 |
|
Matt Wells
|
7874c8d832
|
added ifdef NEEDSLICENSE
|
2013-12-01 14:47:08 -07:00 |
|
Gigablast
|
dfe72a76a0
|
Update LICENSE
updates to license
|
2013-12-01 13:43:14 -08:00 |
|
Matt Wells
|
d43b55103c
|
show query in msg20 log msg
|
2013-12-01 12:11:25 -07:00 |
|
Matt Wells
|
1077191e4a
|
fix log msg bug.
|
2013-12-01 12:08:05 -07:00 |
|
Matt Wells
|
08030865e4
|
fix compiler warning
|
2013-12-01 11:57:26 -07:00 |
|
Matt Wells
|
d811a13627
|
fix small oopsy
|
2013-12-01 11:56:33 -07:00 |
|
Matt Wells
|
3155869fbf
|
added new log msg for
recording cpu time for summary generation.
|
2013-12-01 11:53:41 -07:00 |
|
Matt Wells
|
5ee2be8fcf
|
fixed data corruption bug. m_finalCrawlDelay
was being stored in xmldoc titlerec header.
|
2013-11-27 14:18:15 -08:00 |
|
Matt Wells
|
1129e9b635
|
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
|
2013-11-27 14:09:54 -08:00 |
|
Matt Wells
|
57eb231a4e
|
do not add timestamps to lastdownload
cache if skiphammercheck is true. those
are like robots.txt or redirs or root files.
|
2013-11-26 14:21:17 -08:00 |
|
Matt Wells
|
0f3374e3f3
|
measure crawl delay by default from
start of each download now. it is
a parm in msg13request.
|
2013-11-26 14:07:28 -08:00 |
|
Matt Wells
|
4769ca0881
|
if pthread_create() returns EAGAIN then do
not always retry, it makes an infinite loop.
|
2013-11-26 14:52:07 -07:00 |
|
Matt Wells
|
8bb086ac60
|
crawldelay works now but it measures
from the end of the download, not the
beginning.
|
2013-11-26 12:58:14 -08:00 |
|
Matt Wells
|
1c7c9a4d80
|
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
|
2013-11-26 09:19:26 -08:00 |
|
Matt Wells
|
040bdb8039
|
fix url filters formulation.
fixed extra , in json.
fixed upp and ucp patterns if all substrings
are negative.
|
2013-11-26 09:17:38 -08:00 |
|
Matt Wells
|
ca544ddb90
|
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
|
2013-11-25 15:06:11 -08:00 |
|
Matt Wells
|
1bbbcff755
|
fix getTokenizedDiffbotReply()
to look for type: with a {} depth of 1
so it does not pick up on the
type:image in the images array if there is
one in the article.
|
2013-11-25 13:58:31 -08:00 |
|
Matt Wells
|
61ce4be279
|
fix major bug when you have twins/mirrors.
queries not returning all the results.
|
2013-11-25 09:53:53 -07:00 |
|
Matt Wells
|
9a456de178
|
minor fix
|
2013-11-24 20:48:47 -07:00 |
|
Matt Wells
|
5da41cd113
|
fix a couple different cores.
|
2013-11-24 19:46:44 -07:00 |
|
Matt Wells
|
41ce557627
|
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
|
2013-11-22 18:26:53 -08:00 |
|
Matt Wells
|
e8065a0f0a
|
enforce crawl delay perfectly.
|
2013-11-22 18:26:34 -08:00 |
|
Matt Wells
|
1826860094
|
forgot to add diffbot api url parm
|
2013-11-22 17:55:37 -08:00 |
|