Commit Graph

1565 Commits

Author SHA1 Message Date
mwells
72df0d25d2 added safebuf base64decode func 2014-06-06 16:20:15 -07:00
mwells
965d992f98 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Msg13.cpp
2014-06-06 15:14:41 -07:00
mwells
3f2dcda4e1 got new floater/proxy logic compiling. 2014-06-06 15:11:51 -07:00
Matt Wells
6b5b83ac85 fixes for gbmin/gbmax being first query term. 2014-06-06 10:20:12 -07:00
Matt Wells
5b6100c77d log format change for errcnt 2014-06-06 09:29:57 -07:00
Matt Wells
d850f5f006 try to prevent job status flip flop from error retries. 2014-06-05 23:38:54 -07:00
Matt Wells
74f0a41290 bulk jobs give up after downloading a url
3 times. crawls don't give up on tmperrors,
but retry every 30 days.
2014-06-05 23:11:14 -07:00
Matt Wells
172d7071a7 fix to rename tagdb0000.002.dat 2014-06-05 22:21:41 -07:00
Matt Wells
8ac691f324 fix merging getting clogged by so many
collections tring to merge tagdb at once
2014-06-05 21:27:33 -07:00
Matt Wells
13243a411c more fixes for fake http reply hack 2014-06-05 20:31:49 -07:00
Matt Wells
ce7294e9a9 more mem leak fixes for fake
bulk job empty http replies
2014-06-05 20:09:12 -07:00
Matt Wells
7f10fca234 no longer for add www to url domain if it is just
a domain. was messing of tmblr.co where www.tmblr.co
has no IP.
2014-06-05 17:00:12 -07:00
Matt Wells
3c6a8bf87e fix issue of not retrying diffbot internal errors. 2014-06-05 16:24:52 -07:00
Matt Wells
cfda735194 print error stuff in spiderdb dump. 2014-06-05 16:14:32 -07:00
Matt Wells
970eb33a83 sanity checks to ensure fakefirstip
was able to convert to a real good firstip
2014-06-05 16:13:33 -07:00
Matt Wells
7b4b8b27bd more debug msgs 2014-06-05 14:58:20 -07:00
Matt Wells
5f41840211 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-06-05 14:48:23 -07:00
Matt Wells
8477ef72f8 support gbmin gbmax gbminint gbmaxint range query
terms properly, when generating the docidvotebuf.
fixes boolean queries using them as well.
2014-06-05 14:47:45 -07:00
Matt Wells
1fe2c94322 add some debug notes 2014-06-05 12:26:06 -07:00
Matt Wells
780fd43aae timestamp bug fix 2014-06-04 15:50:26 -07:00
mwells
2a36e5bde5 Merge branch 'diffbot-testing' into diffbot-matt 2014-06-04 14:40:34 -07:00
Matt Wells
546d135007 fix boolean queries to do the on-demand
mini merges of the termlists. should fix
gbmin:offerprice:100 AND (text:lord OR text:helicopter)
2014-06-04 14:33:54 -07:00
mwells
2c750b2c22 Merge branch 'diffbot-testing' into diffbot-matt 2014-06-04 13:56:44 -07:00
Matt Wells
d98cf4b2b0 try to prevent slamming diffbot backend
with bulk jobs consisting of hundreds of
different domains/ips.
2014-06-04 12:37:49 -07:00
Matt Wells
4298e4e752 sanity checks for debugging duplicate
titledb file bug.
2014-06-04 12:15:12 -07:00
Matt Wells
b7d9002a05 fix log bug 2014-06-04 10:57:25 -07:00
Matt Wells
8b74bd855b Merge branch 'master' into diffbot-testing 2014-06-04 09:37:55 -07:00
Matt Wells
fcc8bc85cc update bulk job restart 2014-06-04 09:36:26 -07:00
Matt Wells
e2ca303fe2 doc updates 2014-06-04 07:38:40 -07:00
mwells
a734240474 minor date change in documentation. 2014-06-04 07:26:46 -07:00
mwells
beba94013e remove clustermaintenance documentation. seemed pretty
obsolete.
2014-06-04 07:26:10 -07:00
mwells
3fd973a53e documentation updates for scaling the cluster 2014-06-04 07:17:34 -07:00
Matt Wells
ec1b66aff5 Merge branch 'master' into diffbot-testing 2014-06-03 20:50:59 -07:00
Matt Wells
585e6a357f parm documentation update for url filters 2014-06-03 20:50:22 -07:00
Matt Wells
b534ac5812 do not print completed time if spidering is going on 2014-06-03 20:30:10 -07:00
Matt Wells
694b19e053 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-06-03 18:07:51 -07:00
Matt Wells
07cf2f1129 fix core 2014-06-03 18:07:35 -07:00
Matt Wells
f8073b5adc Merge branch 'diffbot-testing' into diffbot-matt 2014-06-03 14:59:31 -07:00
Matt Wells
50468293e7 fix bool expressions with only one operand.
i.e. double parens bug.
2014-06-03 14:46:28 -07:00
Matt Wells
bf70823260 take out <moreResultsFollow> for &stream=1
for now. maybe add back in later but would be
at end of the reply.
2014-06-03 14:09:24 -07:00
mwells
d23032241d fix mem leak when downloading images is turned on. 2014-06-03 13:26:56 -07:00
Matt Wells
da677eb8a4 fix for searching for query pipe operator in quotes. 2014-06-03 13:08:35 -07:00
mwells
91c7115c73 nothing 2014-06-03 11:49:21 -07:00
mwells
a1f1daad16 Merge branch 'master' into diffbot-matt
Conflicts:
	Spider.cpp
2014-06-03 11:41:46 -07:00
mwells
6dcbc10e92 spider proxy updates. 2014-06-03 11:38:44 -07:00
mwells
ba2329808b fix siteListIsEmpty bug causing spider to
spider the whole internet when it shouldn't
2014-06-03 11:37:31 -07:00
Matt Wells
c3a823c99d fix relative url bug when relative url starts with ? 2014-06-03 10:54:50 -07:00
Matt Wells
536b43e19f Merge branch 'master' into diffbot-testing 2014-06-03 10:17:00 -07:00
mwells
a772e21db6 only show proxy stuff in logs when debugging is on for it 2014-06-02 17:37:43 -07:00
mwells
937e275134 zero out the proxy ips 2014-06-02 17:32:13 -07:00