Commit Graph

274 Commits

Author SHA1 Message Date
Matt Wells
20a2729827 added jobCreationTimeUTC and jobCompletionTimeUTC
to json api
2014-04-25 14:12:18 -07:00
Matt Wells
f3c06ced57 try to fix core from deleting coll 2014-04-25 11:52:17 -07:00
Matt Wells
ffc68d80d6 fix startup core 2014-04-12 13:22:38 -07:00
mwells
02304073d4 doc updates. core fixes. 2014-04-10 00:31:41 -07:00
mwells
ac5cf7971b more misc updates. 2014-04-05 18:09:04 -07:00
Matt Wells
d6434191d1 nomenclature changes to reduce collissions.
name collection 'qatest123' for doing smoke tests,
not 'test'.
2014-03-31 15:02:17 -07:00
Matt Wells
67202f3731 Merge branch 'diffbot' into diffbot-testing 2014-03-20 15:39:03 -07:00
Matt Wells
99bd9319fd temp hack to reduce network comm
between trinity and neo
2014-03-20 15:42:34 -07:00
Matt Wells
6e23d37e47 Merge branch 'diffbot' into diffbot-testing 2014-03-17 17:27:28 -07:00
Matt Wells
9d3c35ad17 nothing 2014-03-17 13:53:19 -07:00
mwells
7812f5c746 more bool fixes. still needs a little more work 2014-03-13 13:54:23 -07:00
Matt Wells
c4b38a5c72 fix a few cores from previous code updates 2014-03-11 09:36:33 -07:00
Matt Wells
bd4484db3c Merge branch 'testing' into diffbot-testing 2014-03-10 12:08:23 -07:00
Matt Wells
11e8c16878 new site list updates 2014-03-09 17:53:24 -07:00
Matt Wells
4cb66c31bf get this new api spidering 2014-03-08 12:02:20 -07:00
Matt Wells
624c1d4e68 nuke doledb fixes 2014-03-08 10:51:15 -07:00
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
72fab5b61e Do not end a crawl while urls are still being spidered
because they might add more links to spiderdb when
they finally complete.
2014-03-07 09:30:12 -08:00
Matt Wells
7cdd411ef1 Merge branch 'diffbot' into diffbot-testing 2014-03-07 09:26:47 -08:00
Matt Wells
27e8e810d2 use collnum instead of coll string.
more stable since resetting collections
keeps string the same but changes the collnum.
2014-03-06 15:48:11 -08:00
Matt Wells
13e33bc261 fix jezebel crawl from hanging. 2014-03-04 19:45:26 -08:00
Matt Wells
1acb16b1ee tweak empty doledb priority logic.
anchor it more to m_doleIpTable for more
reliability. seems like it was causing some
slow dows during spidering. seems more
continuous now.
2014-03-03 13:48:59 -08:00
Matt Wells
48b5330d9c only skip checking to spider a url of its
doleip table is empty
2014-03-03 13:22:27 -08:00
Matt Wells
a82abe8260 added ^ operator to url crawl patterns.
good for tmz crawl.
2014-03-02 14:57:59 -08:00
Matt Wells
7fd6bbd7f5 added ^ support to url crawl expressions 2014-03-02 14:41:25 -08:00
Matt Wells
ceb623bb8f do not dedup bulks.
only respider urls if error is tmp.
mess with msg1 in spider.cpp so niceness
is MAX_NICENESS and not 0 because it was
not able to trigger a doledb dump.
2014-02-23 20:04:46 -08:00
Matt Wells
e87b71caef fix query reindex core 2014-02-19 21:07:01 -08:00
Matt Wells
dda7648333 try to fix problem of crawls stopping
when they shouldn't. seems like it might
be doing the trick.
2014-02-19 00:51:46 -08:00
Matt Wells
b48adc0542 try to fix crawls stopping too early 2014-02-18 10:28:48 -08:00
Matt Wells
ae2aed7066 try to fix a few cores from deleting collections.
try to spider urls again if user changes
certain crawling parms. like regex, patterns, etc.
2014-02-18 09:44:15 -08:00
Matt Wells
f942183104 ignore maxtocrawl for bulk jobs too 2014-02-16 22:24:17 -08:00
Matt Wells
a4deb7ff08 exempt bulk jobs from maxtoprocess 2014-02-16 22:14:43 -08:00
Matt Wells
9d0dca71db fix rapid coll delete bug some more. 2014-02-16 20:13:06 -08:00
Matt Wells
ce652462b0 add color coded circles to coll nav bar.
disk usage red box.
2014-02-16 19:59:53 -07:00
Matt Wells
c691b2dd5f hopcount precedence fix 2014-02-16 16:01:29 -08:00
Matt Wells
f8135e628e fall back to hop count if priority
is tied (and both are due to be spidered).
defaults back to breadth first like it was
doing before.
2014-02-16 15:52:08 -08:00
Matt Wells
08b103f3a4 Merge branch 'diffbot-testing' into diffbot
Conflicts:
	Spider.cpp
2014-02-13 10:11:56 -08:00
Matt Wells
d2b473e554 checkpoint 2014-02-09 19:09:44 -07:00
Matt Wells
6c9a44367f code checkpoint 2014-02-09 12:38:40 -07:00
Matt Wells
252d24dc2a fix core of page spiders 2014-02-07 10:46:10 -08:00
Matt Wells
edef3acf37 remove bugg line 2014-02-06 21:19:37 -08:00
Matt Wells
b3453c248e take out buggy statement. 2014-02-06 21:16:30 -08:00
Matt Wells
7b42d2848d formatting fixes 2014-02-06 21:06:31 -08:00
Matt Wells
8f6a4ee9b6 do not save collrecs all the time.
stop superflusouly setting m_needsSave.
try to stop evaluating crawls that have
completed because of lack of urls. we still
need to fix it so if they change url filters
so that more urls become available, that we retry!
2014-02-06 15:27:49 -08:00
Matt Wells
106077c163 fix spiderrequest deduping some more 2014-02-06 09:47:18 -08:00
Matt Wells
4029b0b937 more faster spider fixes. tried to fix
corrupt rdbcache.
2014-02-06 09:25:27 -08:00
Matt Wells
9145d89e3f raise spiderdb minfilestomerge from 2 to 3
to reduce merging since we allow many urls
in doledb for the same firstip now
2014-02-05 19:35:19 -08:00
Matt Wells
203cdc5f99 delete from winnertable when deleting from winnertree 2014-02-05 19:12:33 -08:00
Matt Wells
25e7ba5ef8 fix too many spiders out per ip some more 2014-02-05 17:11:45 -08:00
Matt Wells
5c8b9af1d3 fix rdbcache corruption from -O2 compile bug.
fix too many spiders per ip bug!
2014-02-05 16:58:21 -08:00