Commit Graph

1210 Commits

Author SHA1 Message Date
Matt Wells
156b50240a code checkpoint 2014-02-08 16:24:33 -07:00
Matt Wells
e593b6e1de basic controls code checkpoint. 2014-02-08 15:10:06 -07:00
Matt Wells
dabd691626 basic admin controls page structure 2014-02-08 00:34:45 -07:00
Matt Wells
fc47c18aec new printadmintop functionality. 2014-02-07 23:08:04 -07:00
Matt Wells
b634d06287 fix some cores. use olddoc contenthash
for msg13 call for EDOCUNCHANGED errors.
2014-02-07 18:28:09 -08:00
Matt Wells
252d24dc2a fix core of page spiders 2014-02-07 10:46:10 -08:00
Matt Wells
573a04bccd fix bug in gbminint. 2014-02-06 21:36:47 -08:00
Matt Wells
edef3acf37 remove bugg line 2014-02-06 21:19:37 -08:00
Matt Wells
b3453c248e take out buggy statement. 2014-02-06 21:16:30 -08:00
Matt Wells
7b42d2848d formatting fixes 2014-02-06 21:06:31 -08:00
Matt Wells
2d4af1aefe index numbers as integers too, not just floats
so we can sort by spider date without losing
128 seconds of resolution.
2014-02-06 20:57:54 -08:00
Matt Wells
63e95c3b2d show lastSpidered time at end of json item.
it's a float so we should probably store it
as an int as well. we lose 128 seconds of resolution.
2014-02-06 18:56:38 -08:00
Matt Wells
8d534b8ed8 many more fixes for streaming mode 2014-02-06 18:21:22 -08:00
Matt Wells
874311ae52 fixes for streaming mode. 2014-02-06 16:28:42 -08:00
Matt Wells
5787b15884 Merge branch 'diffbot' into diffbot-testing 2014-02-06 15:26:21 -08:00
Matt Wells
8f6a4ee9b6 do not save collrecs all the time.
stop superflusouly setting m_needsSave.
try to stop evaluating crawls that have
completed because of lack of urls. we still
need to fix it so if they change url filters
so that more urls become available, that we retry!
2014-02-06 15:27:49 -08:00
Matt Wells
845611ae1b &stream=1 stream mode fixes. 2014-02-06 15:23:53 -08:00
Matt Wells
4cfe69a96f minor link updates 2014-02-06 14:41:33 -08:00
Matt Wells
f9dbd64056 get streaming time sliced results working 2014-02-06 14:25:44 -08:00
Matt Wells
106077c163 fix spiderrequest deduping some more 2014-02-06 09:47:18 -08:00
Matt Wells
4029b0b937 more faster spider fixes. tried to fix
corrupt rdbcache.
2014-02-06 09:25:27 -08:00
Matt Wells
9145d89e3f raise spiderdb minfilestomerge from 2 to 3
to reduce merging since we allow many urls
in doledb for the same firstip now
2014-02-05 19:35:19 -08:00
Matt Wells
203cdc5f99 delete from winnertable when deleting from winnertree 2014-02-05 19:12:33 -08:00
Matt Wells
25e7ba5ef8 fix too many spiders out per ip some more 2014-02-05 17:11:45 -08:00
Matt Wells
2842350e6d gb.conf spiders back on 2014-02-05 16:59:06 -08:00
Matt Wells
5c8b9af1d3 fix rdbcache corruption from -O2 compile bug.
fix too many spiders per ip bug!
2014-02-05 16:58:21 -08:00
Matt Wells
951e9d5068 wait 180 secs for diffbot reply 2014-02-05 15:46:26 -08:00
Matt Wells
c60dcf4ecb show userobots for bulk jobs 2014-02-05 15:45:39 -08:00
Matt Wells
d9f0d57c0c core fixes. csv fixes. 2014-02-05 14:56:22 -08:00
Matt Wells
ecc10c2cb9 dup cache fixes. do not add dups to spiderdb either. 2014-02-05 14:09:35 -08:00
Matt Wells
7806a8a68c fix excessive dupcache deduping. 2014-02-05 13:41:15 -08:00
Matt Wells
c159f80f05 MAX_WINNER_NODES back to 40. 2014-02-05 13:25:04 -08:00
Matt Wells
9c26b85c2f fixed contenthash32 logic for json objects.
fixed hashing of numbers/bools for json objects.
added m_dupCache to reduce spiderrequests added to spiderdb.
do not add urls to waitingtree if ufn is obviously filtered/banned.
do not spider spiderrequest from doledb is maxoutperip would
be violated.
2014-02-05 13:22:03 -08:00
Matt Wells
d86c7b8fbb do not store 40 urls in doledb if
firstip does not have that many urls to begin
with. it's better to just store one url in doledb
for small domains.
2014-02-04 20:39:46 -08:00
Matt Wells
bda134268e added winnertable to avoid dups in winnertree. 2014-02-04 20:09:43 -08:00
Matt Wells
053a9b9a0d spiders seem to be working somewhat now. 2014-02-04 18:23:37 -08:00
Matt Wells
189999509b code checkpoint. time slicing, faster spider code
compiling. now needs debug.
2014-02-04 17:34:43 -08:00
Matt Wells
7f4d3205e5 streaming results code checkpoint. 2014-02-04 17:05:43 -08:00
Matt Wells
3312400fee checkpoint for faster spider code. 2014-02-04 16:15:27 -08:00
Matt Wells
20c31dcc78 Merge branch 'master' into diffbot-slicing 2014-02-04 12:28:43 -08:00
Matt Wells
d2cebad8e7 spidercoll deletion fixes. 2014-02-04 12:28:05 -08:00
Matt Wells
9ded8fa091 faster spiders checkpoint 2014-02-04 12:26:42 -08:00
Matt Wells
258e3cba0d fix maxtocrawl limit thing 2014-02-04 09:25:27 -07:00
Matt Wells
17fff243f9 add connectips back. call them adminIps this time.
if your ip is on the list then you have admin
access. cookie tokens will come later/soon.
2014-02-03 20:47:48 -07:00
Matt Wells
d3b498a057 time slice checkpoint 2014-02-03 19:17:58 -08:00
Matt Wells
5ea852dac3 fix core when thread fails to spawn. 2014-02-03 07:27:32 -07:00
Matt Wells
b46da4c192 prevent msg20/tagdb lookup socket jam up.
throttle back max outstanding msg20s (summary generations)
based on used udp sockets.
2014-02-03 07:09:29 -07:00
Matt Wells
56adb2ee8c nomenclature. url filters -> spider scheduler 2014-02-02 17:00:11 -07:00
Matt Wells
10235bb840 fix add url and cached page getting 2014-02-02 16:49:31 -07:00
Matt Wells
7bf8a2ac49 do not let glibc do malloc checks, we do that. 2014-02-02 13:41:59 -07:00