Commit Graph

26 Commits

Author SHA1 Message Date
Matt Wells
c8c56a24da fixed query reindex for diffbot json docs.
added recycle content checkbox to query reindex.
fix gbsortbyint: at end of query core.
only show 'all spiders paused' msg for active jobs.
show error summaries if doc not found and &showerrors=1.
2014-12-15 16:49:20 -08:00
Matt
4e8a42e024 text replacements for bad int32_t substitutions 2014-11-17 18:24:38 -08:00
Matt
931a1c4bc6 good checkpoint. quite a few fixes. 2014-11-17 18:13:36 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
b393a1bbbe Merge branch 'testing' into diffbot-matt
Conflicts:
	Errno.cpp
	Errno.h
2014-07-10 10:06:55 -07:00
mwells
5bbdb8e172 got page add url and add url api working. 2014-07-09 20:32:30 -07:00
mwells
d9ae010371 shard gbfacetstr:gbxpathsitehash123456 terms by termid for speed.
got them working again multicasting a msg 0x39 to the appropriate shard.
set special msg39request flag for better performance for those guys.
2014-07-07 12:32:27 -07:00
mwells
6434e5cc04 Merge branch 'testing' into diffbot-matt
Conflicts:
	Errno.cpp
	Errno.h
	Parms.h
2014-07-07 09:49:59 -07:00
mwells
43d0d636ee fix dmoz building. 2014-07-05 22:20:15 -07:00
mwells
29d170631a more api updates 2014-07-05 12:36:01 -07:00
Matt Wells
1361e5728c show actual diffbot error in urls.csv.
do not stop indexing page and harvesting links on diffbot error.
2014-07-02 11:53:24 -07:00
mwells
92799ef393 add support for tunnelling https fetch
through an http proxy using CONNECT
directive. needs more debugging.
2014-07-01 10:43:52 -06:00
Matt Wells
27ffd23345 handle boolean query overflow errors better. 2014-06-10 17:21:55 -07:00
Matt Wells
72c6d032d8 fix query reindex on subdocuments (diffbot json blurbs)
so that they just put in a spiderrequest to reindex
the parent url. Added &diffbotreply= to the injection
interface so dan can provide that along with the
pageUrl he passes in with &u=
2014-05-15 14:11:12 -07:00
Matt Wells
82726879a2 support base64 generated thumbnails in serps. 2014-04-24 14:04:57 -07:00
mwells
be99155986 more updates 2014-04-09 11:03:31 -07:00
Daniel Steinberg
2331b4673d Defect #2099: throw an error a crawl request was made with a name that already existed for bulk request (or the other way around) 2014-03-11 16:21:58 -07:00
Matt Wells
6a45e42128 added ability to treat <link xyz.com rel=canoical> as meta redirects.
should help us dedup.
added a function to do looser deduping of spider pages although current
not enabled, we are still using the more strict one.
added documentation on how we dedup to developer.html for jon to
take a look at.
2014-01-30 10:04:09 -08:00
Matt Wells
f64b53bfb3 almost done with rebalancing code 2014-01-10 14:12:58 -08:00
Matt Wells
b0d77a834a do not spider fake ips requests, just re-add them
with the right firstip
2013-12-20 12:22:02 -08:00
mwells
e04d596288 minor comments update. 2013-12-09 13:42:33 -08:00
Matt Wells
6495dfd86e try to fix json parser overflow error. needs
testing. tried to fix round num from incrementing
for little job because i think server overload.
should be fixed right some time. just made wait
time 30 secs instead of 10 in Spider.cpp.
2013-11-15 11:30:16 -08:00
mwells
9bf8bf7712 add spider reply even on g_errno now with an error
code of EINTERNAL error in the spider reply.
no longer just sit on the lock. this was blocking
an entire ip when just lock sitting for 3 hrs.
and only do read rate timeouts if there was at least
one byte read. this was causing diffbot reply to
read rate timeout after just 60 seconds even though
its timeout was specified as 90 seconds.
2013-09-29 09:22:20 -06:00
Matt Wells
a50898649b various fixes. 2013-09-16 10:16:49 -07:00
Matt Wells
5dc7bd2ab4 integrate diffbot from svn back into git. 2013-09-13 09:23:18 -07:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00