Commit Graph

200 Commits

Author SHA1 Message Date
Matt Wells
af9eb8fb73 need to allow clients to not restrict to
seed domains.
2014-02-26 22:27:22 -08:00
Matt Wells
a0697e1bb5 do not allow custom crawls to spider the web any more. 2014-02-26 10:26:09 -08:00
Matt Wells
8bb5d106db fixes for query reindex/delete. 2014-02-25 18:12:45 -08:00
Matt Wells
ceb623bb8f do not dedup bulks.
only respider urls if error is tmp.
mess with msg1 in spider.cpp so niceness
is MAX_NICENESS and not 0 because it was
not able to trigger a doledb dump.
2014-02-23 20:04:46 -08:00
Matt Wells
88dfa20cbe docid based spider rec related fixes. 2014-02-20 08:46:00 -08:00
Matt Wells
ae2aed7066 try to fix a few cores from deleting collections.
try to spider urls again if user changes
certain crawling parms. like regex, patterns, etc.
2014-02-18 09:44:15 -08:00
Matt Wells
9d0dca71db fix rapid coll delete bug some more. 2014-02-16 20:13:06 -08:00
Matt Wells
fe63371622 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing 2014-02-16 13:39:02 -08:00
Matt Wells
4930243de3 minor updates 2014-02-16 13:38:54 -08:00
Matt Wells
fe0f2d3537 allow coll delete if not the one being repaired 2014-02-16 10:55:34 -08:00
Matt Wells
c3d8a143be fix bug of process regex being ignored
when crawl regex was specified.
2014-02-13 10:06:14 -08:00
Matt Wells
3b0a571cea fix security system to actually work now 2014-02-12 00:06:00 -07:00
Matt Wells
69fa6662bc EDOCUNCHANGED fixes for diffbot 2014-02-10 16:23:39 -08:00
Matt Wells
4029b0b937 more faster spider fixes. tried to fix
corrupt rdbcache.
2014-02-06 09:25:27 -08:00
Matt Wells
93021b2f13 Merge branch 'diffbot'
Conflicts:

	Collectiondb.cpp
	Spider.cpp
	Spider.h
2014-02-01 11:31:00 -07:00
Matt Wells
095c47f181 Merge branch 'diffbot'
Conflicts:

	Collectiondb.cpp
	Spider.cpp
	Spider.h
2014-02-01 11:28:31 -07:00
Matt Wells
b40f393f4c fix a couple cores related to deleting collections
in progress. support termlist dump with terms
containing colons.
2014-01-29 15:56:07 -08:00
Matt Wells
726090be83 contains a hack fix to fix things at startup
but now it is commented out.
2014-01-25 15:07:47 -08:00
Matt Wells
e3f769dffe fixes for sudden revitilization of dead crawls. 2014-01-25 11:03:15 -08:00
mwells
308106673c added debug statements for email bug 2014-01-24 14:08:27 -08:00
Matt Wells
27b6ceffa8 fix bug of sending notification email twice
for really really tiny jobs.
2014-01-23 21:22:39 -08:00
Matt Wells
c4a6ad1145 update "this round" counts to at least
the total counts if round # is 0 so we do not
double spider everyone's jobs!
put a check in rebalance loop to see if gb
is exiting so we don't get into an infinite loop.
this should be in redmine now...
2014-01-23 18:22:13 -08:00
Matt Wells
9432ae870d fix bug to pass jenkins. 2014-01-23 09:38:15 -08:00
Matt Wells
e351cb9939 free spidercolls on exit 2014-01-22 23:52:23 -08:00
Matt Wells
bc35b7d0ec fix pagecrawlbot.cpp to support &c=token-name.
cleanup mem at process exit better.
2014-01-22 23:40:38 -08:00
Matt Wells
8a9b1f7a19 added diffbot retry rules.
added maxTotalSpiders parm for
all colls to follow.
tried to fix msg 0x00 socket jam up.
2014-01-22 19:57:38 -08:00
Matt Wells
33c5d9c07f a lot of times rdb tree has invalid collection
numbers in it so fix our counting algo in case
the collection rec no longer exists!
2014-01-21 19:01:44 -08:00
Matt Wells
45cb5c9a0c fix bugs to try to get sharding working
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
41cdfcef96 inc spider limits in various places 2014-01-20 18:51:15 -08:00
Matt Wells
e9bbc16a9f took out pagecount table. just hafta scan
twice i think because caching counts
gets complicated because of adding
duplicate injection requests!
2014-01-19 20:34:38 -08:00
Matt Wells
58d0c444ac fixes for the global index quota system 2014-01-19 19:38:23 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
10f4443974 quite a few fixes to the quota system, cleanups etc. 2014-01-18 16:23:13 -08:00
Matt Wells
fa59c62264 more bug fixes associated with collections
and site page counts in url filters.
2014-01-18 11:54:58 -08:00
Matt Wells
5b7170e8c6 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	Json.cpp
	PageAddUrl.cpp
	PageStats.cpp
	Spider.cpp
2014-01-17 21:07:08 -08:00
Matt Wells
4e803210ee tons of changes from live github on neo.
lots of core fixes.
took out ppthtml powerpoint convert, it hangs.
dynamic rdbmap to save memory per coll.
fixed disk page cache logic and brought it
back.
2014-01-17 21:01:43 -08:00
Matt Wells
d8554bfb0f update default parm settings. 2014-01-09 13:22:51 -08:00
Matt Wells
3db562f0f1 bug fixes for pages indexed and manual seeds counting. 2014-01-07 15:38:22 -08:00
Matt Wells
f0c232803a fix page url filters editing 2014-01-07 14:27:58 -08:00
Matt Wells
d69fc065ce just use a global diffbot api url for simplicity
to avoid having to call XmlDoc::getUrlFilterNum()
which is not good practice since url filters are for
spiderdb reads really.
2014-01-07 12:31:25 -08:00
Matt Wells
909022642d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-01-07 12:10:59 -08:00
Matt Wells
e366c12470 Merge branch 'master' into diffbot
Conflicts:
	Collectiondb.cpp
	Msg13.cpp
	Parms.cpp
	Spider.h
2014-01-07 12:09:11 -08:00
Matt Wells
4f64677b4f get new global preemptive cache
logic compiling, with section voting
stats.
2014-01-05 11:51:09 -08:00
Matt Wells
7df2111ceb fixed 'gb inject titledb-DIR newhosts.conf' command
for populating an index from titledb files in DIR
and transmitting to appropriate host in newhosts.conf.
also prettied up the gb -h output to use a formatting
function.
2014-01-02 01:20:08 -07:00
Matt Wells
6aac48e487 fix crawl delay wait queue logic.
if coll already exists trying to add, let it be. don't error out.
2013-12-27 14:35:51 -08:00
Matt Wells
100af585a6 parm sync fixes 2013-12-26 11:20:19 -08:00
Matt Wells
3acd6a08d5 add the true spider request when
retrying to spider a fake-ip spider request.
add a EFAKEIP error reply for the fake ip request.
prevents us double spidering the same url.
2013-12-23 10:27:42 -08:00
Matt Wells
2ac8ff2952 compile regex so it's case dependent 2013-12-23 09:30:35 -08:00
Matt Wells
4c7ce819b9 fix core dump 2013-12-19 18:39:29 -08:00
Matt Wells
c2f8445a70 expand reg ex shortcuts like \d to [0-9] 2013-12-19 18:31:37 -08:00
Matt Wells
3092dcecaa rebuild url filters and regexes at startup 2013-12-19 15:56:27 -08:00
Matt Wells
99099505d8 call regfree before changing regex 2013-12-19 15:32:26 -08:00
Matt Wells
7f70e4e887 fix regex logic 2013-12-19 15:19:18 -08:00
Matt Wells
6f0137889b fixes for getUrlFilterNum so
it looks at "hadReply" bit
in SPiderRequest when getting
diffbot api url.
2013-12-18 14:05:41 -08:00
Matt Wells
1b5057ad42 log cleanups mostly.
took out disk page cache,
kinda buggy... need to fix at some point.
2013-12-18 10:57:18 -08:00
Matt Wells
2f2333abd1 parmdb fixes 2013-12-17 14:53:33 -08:00
Matt Wells
33ba8070b5 more bug fixes parmdb 2013-12-17 13:09:05 -08:00
Matt Wells
3f19ece776 parmdb updates 2013-12-16 17:07:15 -08:00
Matt Wells
1fe91cad2f parmdb updates 2013-12-16 14:10:39 -08:00
Matt Wells
0615acff17 zero out url filters checkboxes on submit 2013-12-16 11:03:40 -08:00
mwells
76bb3d05e1 clean up logging so i can see what's going on 2013-12-10 16:41:30 -08:00
mwells
db74af766b fix core in addExistingColl() 2013-12-10 15:46:38 -08:00
mwells
82494baa89 move CollectionRec stuff into Collectiondb files
for simplicity.
2013-12-10 15:28:04 -08:00
mwells
f2d5661965 parmdb overhaul. support collection add/del
sync when host comes back online. use udp not tcp.
host #0 can now handle a new incoming request while
a parm change is currently outstanding.
all missed "command" parms will be received when a dead host
comes back online, too, like a tight merge for instance.
does not use msg4, uses msg3e and msg3f for syncing and
sending parms.
2013-12-10 13:09:55 -08:00
Matt Wells
ed79b67d2e core dump fixes 2013-12-08 15:36:23 -07:00
Matt Wells
06edfddf31 a bunch of bug fixes, mostly spider related.
also some for pagereindex.
2013-12-07 21:56:37 -07:00
Matt Wells
c669f8c138 fix file descriptor leak in Dir class.
try to fix core from Thread getting SIGALRM.
try to set NOFILES to 1024 at startup in case
more are allowed.
2013-11-19 13:41:56 -08:00
Matt Wells
fe1a7d1a75 rdbbase not fully resetting? it was
trying to dump to coll directories that
had been moved to trash folder.
and printing out "deleted from under us".
at least it was corrupting data in RdbMem
this time because i added m_dumpErrno logic.
2013-11-15 09:01:58 -08:00
Matt Wells
62432b3530 support for &restart=1 2013-11-14 14:02:56 -08:00
Matt Wells
af678b7c1b fix a few bugs. 2013-11-10 22:11:13 -08:00
Matt Wells
105a201cde fix mem leak.
check if tree writes are disabled and block
until not when deleting/resetting a collection.
just like we do it tree is being saved.
2013-11-10 16:28:00 -08:00
Matt Wells
9d016b5c3c reset spiderstatus 2013-10-30 13:49:31 -07:00
Matt Wells
3acc1f9a51 deal with if callback is null
when deleting/resetting collnum
2013-10-30 13:18:19 -07:00
Matt Wells
d0ddfb7d7d would block when deleting or resetting
a collection when the rdb tree is saving to
disk. keeps retrying every 100ms since it
modifies the tree.
2013-10-30 13:12:46 -07:00
Matt Wells
b83dd59913 fix bug when we nuke a collnum
from a tree right in the middle of when
saving rdb trees in process.cpp.
2013-10-30 12:27:08 -07:00
Matt Wells
adf4d258ae better crawl status reporting.
allow for _ in coll names.
2013-10-30 10:00:46 -07:00
Matt Wells
54d3375a00 fixes when crawling on distributed 2x2 2013-10-25 14:54:24 -07:00
Matt Wells
e5aa795b76 reset seed dedup table when collrec reset 2013-10-23 18:12:50 -07:00
Matt Wells
c39b45ff88 fix crawl round end detection etc.
inc round counter even if not repeating crawl
2013-10-23 15:53:59 -07:00
Matt Wells
8f5bb4a787 a few core dump fixes. get crawl-delay
working a little. about half way done.
2013-10-22 15:44:10 -07:00
Matt Wells
c7d7e24f9b show spider rounds and round starttime in
json output. fixed url filters display bug.
reset seeds safebuf parm when coll is reset.
2013-10-21 19:20:03 -07:00
Matt Wells
0e4d96b3f8 added "seeds" to json reply. store seed urls
(and deup them) in collrec. fixed some respidering
issues. any time we re-enter url filters
then rebuild the waiting tree.
2013-10-21 17:35:14 -07:00
Matt Wells
64a1c7c2f2 more bug fixes. if spiders disabled for row
in url filters, don't spider the url.
2013-10-21 14:45:12 -07:00
Matt Wells
978910ca7a fix more bugs. 2013-10-21 14:17:32 -07:00
Matt Wells
605289e130 fix a couple collection related bugs
causing cores in crawlbot.
2013-10-21 11:38:33 -07:00
Matt Wells
85bca4f3d1 can now delete collection while spiders are out 2013-10-18 18:11:14 -07:00
Matt Wells
889583ec4b now we can reset collection mid stream 2013-10-18 17:49:36 -07:00
Matt Wells
ecab57ff0f change collnum of reset collection
so any adds in progress will fail.
2013-10-18 15:46:00 -07:00
Matt Wells
b589b17e63 fix collection resetting. 2013-10-18 15:21:00 -07:00
Matt Wells
f5e5b0f5d3 fix crawlbot bugs 2013-10-16 12:12:22 -07:00
mwells
e7377d72ab fix robots.txt switch. fix collection rec saving.
require collname explicitly for injecturl urldata.
2013-09-27 11:39:23 -06:00
Matt Wells
7fdbd0f66a delete spider coll when deleting coll 2013-09-18 15:36:30 -07:00
Matt Wells
e50da4d012 crawlbot api fixes 2013-09-17 15:47:44 -07:00
Matt Wells
c16fe8601b more crawlbot api fixes 2013-09-17 15:32:28 -07:00
Matt Wells
c81f700bf0 get reset collection kinda working. 2013-09-17 14:13:44 -07:00
Matt Wells
4321f02e4e trying to get reset collection working 2013-09-17 12:21:09 -07:00
Matt Wells
4c11265a98 more updates to crawlbot api 2013-09-16 13:59:11 -07:00
Matt Wells
5dc7bd2ab4 integrate diffbot from svn back into git. 2013-09-13 09:23:18 -07:00
Matt Wells
94e6492916 removed MAX_COLL_RECS so we can have unlimited
collections, really limited by the sizeof(collnum_t) only now,
which is 16bits, 15bits unsigned, which is the limitation.
can always expand this so we can have more than 32k collections.
2013-08-30 16:20:38 -07:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00