Matt Wells
156b50240a
code checkpoint
2014-02-08 16:24:33 -07:00
Matt Wells
e593b6e1de
basic controls code checkpoint.
2014-02-08 15:10:06 -07:00
Matt Wells
dabd691626
basic admin controls page structure
2014-02-08 00:34:45 -07:00
Matt Wells
fc47c18aec
new printadmintop functionality.
2014-02-07 23:08:04 -07:00
Matt Wells
b634d06287
fix some cores. use olddoc contenthash
...
for msg13 call for EDOCUNCHANGED errors.
2014-02-07 18:28:09 -08:00
Matt Wells
252d24dc2a
fix core of page spiders
2014-02-07 10:46:10 -08:00
Matt Wells
573a04bccd
fix bug in gbminint.
2014-02-06 21:36:47 -08:00
Matt Wells
edef3acf37
remove bugg line
2014-02-06 21:19:37 -08:00
Matt Wells
b3453c248e
take out buggy statement.
2014-02-06 21:16:30 -08:00
Matt Wells
7b42d2848d
formatting fixes
2014-02-06 21:06:31 -08:00
Matt Wells
2d4af1aefe
index numbers as integers too, not just floats
...
so we can sort by spider date without losing
128 seconds of resolution.
2014-02-06 20:57:54 -08:00
Matt Wells
63e95c3b2d
show lastSpidered time at end of json item.
...
it's a float so we should probably store it
as an int as well. we lose 128 seconds of resolution.
2014-02-06 18:56:38 -08:00
Matt Wells
8d534b8ed8
many more fixes for streaming mode
2014-02-06 18:21:22 -08:00
Matt Wells
874311ae52
fixes for streaming mode.
2014-02-06 16:28:42 -08:00
Matt Wells
5787b15884
Merge branch 'diffbot' into diffbot-testing
2014-02-06 15:26:21 -08:00
Matt Wells
8f6a4ee9b6
do not save collrecs all the time.
...
stop superflusouly setting m_needsSave.
try to stop evaluating crawls that have
completed because of lack of urls. we still
need to fix it so if they change url filters
so that more urls become available, that we retry!
2014-02-06 15:27:49 -08:00
Matt Wells
845611ae1b
&stream=1 stream mode fixes.
2014-02-06 15:23:53 -08:00
Matt Wells
4cfe69a96f
minor link updates
2014-02-06 14:41:33 -08:00
Matt Wells
f9dbd64056
get streaming time sliced results working
2014-02-06 14:25:44 -08:00
Matt Wells
106077c163
fix spiderrequest deduping some more
2014-02-06 09:47:18 -08:00
Matt Wells
4029b0b937
more faster spider fixes. tried to fix
...
corrupt rdbcache.
2014-02-06 09:25:27 -08:00
Matt Wells
9145d89e3f
raise spiderdb minfilestomerge from 2 to 3
...
to reduce merging since we allow many urls
in doledb for the same firstip now
2014-02-05 19:35:19 -08:00
Matt Wells
203cdc5f99
delete from winnertable when deleting from winnertree
2014-02-05 19:12:33 -08:00
Matt Wells
25e7ba5ef8
fix too many spiders out per ip some more
2014-02-05 17:11:45 -08:00
Matt Wells
2842350e6d
gb.conf spiders back on
2014-02-05 16:59:06 -08:00
Matt Wells
5c8b9af1d3
fix rdbcache corruption from -O2 compile bug.
...
fix too many spiders per ip bug!
2014-02-05 16:58:21 -08:00
Matt Wells
951e9d5068
wait 180 secs for diffbot reply
2014-02-05 15:46:26 -08:00
Matt Wells
c60dcf4ecb
show userobots for bulk jobs
2014-02-05 15:45:39 -08:00
Matt Wells
d9f0d57c0c
core fixes. csv fixes.
2014-02-05 14:56:22 -08:00
Matt Wells
ecc10c2cb9
dup cache fixes. do not add dups to spiderdb either.
2014-02-05 14:09:35 -08:00
Matt Wells
7806a8a68c
fix excessive dupcache deduping.
2014-02-05 13:41:15 -08:00
Matt Wells
c159f80f05
MAX_WINNER_NODES back to 40.
2014-02-05 13:25:04 -08:00
Matt Wells
9c26b85c2f
fixed contenthash32 logic for json objects.
...
fixed hashing of numbers/bools for json objects.
added m_dupCache to reduce spiderrequests added to spiderdb.
do not add urls to waitingtree if ufn is obviously filtered/banned.
do not spider spiderrequest from doledb is maxoutperip would
be violated.
2014-02-05 13:22:03 -08:00
Matt Wells
d86c7b8fbb
do not store 40 urls in doledb if
...
firstip does not have that many urls to begin
with. it's better to just store one url in doledb
for small domains.
2014-02-04 20:39:46 -08:00
Matt Wells
bda134268e
added winnertable to avoid dups in winnertree.
2014-02-04 20:09:43 -08:00
Matt Wells
053a9b9a0d
spiders seem to be working somewhat now.
2014-02-04 18:23:37 -08:00
Matt Wells
189999509b
code checkpoint. time slicing, faster spider code
...
compiling. now needs debug.
2014-02-04 17:34:43 -08:00
Matt Wells
7f4d3205e5
streaming results code checkpoint.
2014-02-04 17:05:43 -08:00
Matt Wells
3312400fee
checkpoint for faster spider code.
2014-02-04 16:15:27 -08:00
Matt Wells
20c31dcc78
Merge branch 'master' into diffbot-slicing
2014-02-04 12:28:43 -08:00
Matt Wells
d2cebad8e7
spidercoll deletion fixes.
2014-02-04 12:28:05 -08:00
Matt Wells
9ded8fa091
faster spiders checkpoint
2014-02-04 12:26:42 -08:00
Matt Wells
258e3cba0d
fix maxtocrawl limit thing
2014-02-04 09:25:27 -07:00
Matt Wells
17fff243f9
add connectips back. call them adminIps this time.
...
if your ip is on the list then you have admin
access. cookie tokens will come later/soon.
2014-02-03 20:47:48 -07:00
Matt Wells
d3b498a057
time slice checkpoint
2014-02-03 19:17:58 -08:00
Matt Wells
5ea852dac3
fix core when thread fails to spawn.
2014-02-03 07:27:32 -07:00
Matt Wells
b46da4c192
prevent msg20/tagdb lookup socket jam up.
...
throttle back max outstanding msg20s (summary generations)
based on used udp sockets.
2014-02-03 07:09:29 -07:00
Matt Wells
56adb2ee8c
nomenclature. url filters -> spider scheduler
2014-02-02 17:00:11 -07:00
Matt Wells
10235bb840
fix add url and cached page getting
2014-02-02 16:49:31 -07:00
Matt Wells
7bf8a2ac49
do not let glibc do malloc checks, we do that.
2014-02-02 13:41:59 -07:00