Matt Wells
20a2729827
added jobCreationTimeUTC and jobCompletionTimeUTC
...
to json api
2014-04-25 14:12:18 -07:00
Matt Wells
f3c06ced57
try to fix core from deleting coll
2014-04-25 11:52:17 -07:00
Matt Wells
ffc68d80d6
fix startup core
2014-04-12 13:22:38 -07:00
mwells
02304073d4
doc updates. core fixes.
2014-04-10 00:31:41 -07:00
mwells
ac5cf7971b
more misc updates.
2014-04-05 18:09:04 -07:00
Matt Wells
d6434191d1
nomenclature changes to reduce collissions.
...
name collection 'qatest123' for doing smoke tests,
not 'test'.
2014-03-31 15:02:17 -07:00
Matt Wells
67202f3731
Merge branch 'diffbot' into diffbot-testing
2014-03-20 15:39:03 -07:00
Matt Wells
99bd9319fd
temp hack to reduce network comm
...
between trinity and neo
2014-03-20 15:42:34 -07:00
Matt Wells
6e23d37e47
Merge branch 'diffbot' into diffbot-testing
2014-03-17 17:27:28 -07:00
Matt Wells
9d3c35ad17
nothing
2014-03-17 13:53:19 -07:00
mwells
7812f5c746
more bool fixes. still needs a little more work
2014-03-13 13:54:23 -07:00
Matt Wells
c4b38a5c72
fix a few cores from previous code updates
2014-03-11 09:36:33 -07:00
Matt Wells
bd4484db3c
Merge branch 'testing' into diffbot-testing
2014-03-10 12:08:23 -07:00
Matt Wells
11e8c16878
new site list updates
2014-03-09 17:53:24 -07:00
Matt Wells
4cb66c31bf
get this new api spidering
2014-03-08 12:02:20 -07:00
Matt Wells
624c1d4e68
nuke doledb fixes
2014-03-08 10:51:15 -07:00
Matt Wells
8aa0662a27
Merge branch 'diffbot' into testing
...
Conflicts:
Make.depend
PageResults.cpp
Parms.cpp
Spider.cpp
Spider.h
gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
72fab5b61e
Do not end a crawl while urls are still being spidered
...
because they might add more links to spiderdb when
they finally complete.
2014-03-07 09:30:12 -08:00
Matt Wells
7cdd411ef1
Merge branch 'diffbot' into diffbot-testing
2014-03-07 09:26:47 -08:00
Matt Wells
27e8e810d2
use collnum instead of coll string.
...
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
13e33bc261
fix jezebel crawl from hanging.
2014-03-04 19:45:26 -08:00
Matt Wells
1acb16b1ee
tweak empty doledb priority logic.
...
anchor it more to m_doleIpTable for more
reliability. seems like it was causing some
slow dows during spidering. seems more
continuous now.
2014-03-03 13:48:59 -08:00
Matt Wells
48b5330d9c
only skip checking to spider a url of its
...
doleip table is empty
2014-03-03 13:22:27 -08:00
Matt Wells
a82abe8260
added ^ operator to url crawl patterns.
...
good for tmz crawl.
2014-03-02 14:57:59 -08:00
Matt Wells
7fd6bbd7f5
added ^ support to url crawl expressions
2014-03-02 14:41:25 -08:00
Matt Wells
ceb623bb8f
do not dedup bulks.
...
only respider urls if error is tmp.
mess with msg1 in spider.cpp so niceness
is MAX_NICENESS and not 0 because it was
not able to trigger a doledb dump.
2014-02-23 20:04:46 -08:00
Matt Wells
e87b71caef
fix query reindex core
2014-02-19 21:07:01 -08:00
Matt Wells
dda7648333
try to fix problem of crawls stopping
...
when they shouldn't. seems like it might
be doing the trick.
2014-02-19 00:51:46 -08:00
Matt Wells
b48adc0542
try to fix crawls stopping too early
2014-02-18 10:28:48 -08:00
Matt Wells
ae2aed7066
try to fix a few cores from deleting collections.
...
try to spider urls again if user changes
certain crawling parms. like regex, patterns, etc.
2014-02-18 09:44:15 -08:00
Matt Wells
f942183104
ignore maxtocrawl for bulk jobs too
2014-02-16 22:24:17 -08:00
Matt Wells
a4deb7ff08
exempt bulk jobs from maxtoprocess
2014-02-16 22:14:43 -08:00
Matt Wells
9d0dca71db
fix rapid coll delete bug some more.
2014-02-16 20:13:06 -08:00
Matt Wells
ce652462b0
add color coded circles to coll nav bar.
...
disk usage red box.
2014-02-16 19:59:53 -07:00
Matt Wells
c691b2dd5f
hopcount precedence fix
2014-02-16 16:01:29 -08:00
Matt Wells
f8135e628e
fall back to hop count if priority
...
is tied (and both are due to be spidered).
defaults back to breadth first like it was
doing before.
2014-02-16 15:52:08 -08:00
Matt Wells
08b103f3a4
Merge branch 'diffbot-testing' into diffbot
...
Conflicts:
Spider.cpp
2014-02-13 10:11:56 -08:00
Matt Wells
d2b473e554
checkpoint
2014-02-09 19:09:44 -07:00
Matt Wells
6c9a44367f
code checkpoint
2014-02-09 12:38:40 -07:00
Matt Wells
252d24dc2a
fix core of page spiders
2014-02-07 10:46:10 -08:00
Matt Wells
edef3acf37
remove bugg line
2014-02-06 21:19:37 -08:00
Matt Wells
b3453c248e
take out buggy statement.
2014-02-06 21:16:30 -08:00
Matt Wells
7b42d2848d
formatting fixes
2014-02-06 21:06:31 -08:00
Matt Wells
8f6a4ee9b6
do not save collrecs all the time.
...
stop superflusouly setting m_needsSave.
try to stop evaluating crawls that have
completed because of lack of urls. we still
need to fix it so if they change url filters
so that more urls become available, that we retry!
2014-02-06 15:27:49 -08:00
Matt Wells
106077c163
fix spiderrequest deduping some more
2014-02-06 09:47:18 -08:00
Matt Wells
4029b0b937
more faster spider fixes. tried to fix
...
corrupt rdbcache.
2014-02-06 09:25:27 -08:00
Matt Wells
9145d89e3f
raise spiderdb minfilestomerge from 2 to 3
...
to reduce merging since we allow many urls
in doledb for the same firstip now
2014-02-05 19:35:19 -08:00
Matt Wells
203cdc5f99
delete from winnertable when deleting from winnertree
2014-02-05 19:12:33 -08:00
Matt Wells
25e7ba5ef8
fix too many spiders out per ip some more
2014-02-05 17:11:45 -08:00
Matt Wells
5c8b9af1d3
fix rdbcache corruption from -O2 compile bug.
...
fix too many spiders per ip bug!
2014-02-05 16:58:21 -08:00