Matt Wells
af9eb8fb73
need to allow clients to not restrict to
...
seed domains.
2014-02-26 22:27:22 -08:00
Matt Wells
a0697e1bb5
do not allow custom crawls to spider the web any more.
2014-02-26 10:26:09 -08:00
Matt Wells
8bb5d106db
fixes for query reindex/delete.
2014-02-25 18:12:45 -08:00
Matt Wells
ceb623bb8f
do not dedup bulks.
...
only respider urls if error is tmp.
mess with msg1 in spider.cpp so niceness
is MAX_NICENESS and not 0 because it was
not able to trigger a doledb dump.
2014-02-23 20:04:46 -08:00
Matt Wells
88dfa20cbe
docid based spider rec related fixes.
2014-02-20 08:46:00 -08:00
Matt Wells
ae2aed7066
try to fix a few cores from deleting collections.
...
try to spider urls again if user changes
certain crawling parms. like regex, patterns, etc.
2014-02-18 09:44:15 -08:00
Matt Wells
9d0dca71db
fix rapid coll delete bug some more.
2014-02-16 20:13:06 -08:00
Matt Wells
fe63371622
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
2014-02-16 13:39:02 -08:00
Matt Wells
4930243de3
minor updates
2014-02-16 13:38:54 -08:00
Matt Wells
fe0f2d3537
allow coll delete if not the one being repaired
2014-02-16 10:55:34 -08:00
Matt Wells
c3d8a143be
fix bug of process regex being ignored
...
when crawl regex was specified.
2014-02-13 10:06:14 -08:00
Matt Wells
3b0a571cea
fix security system to actually work now
2014-02-12 00:06:00 -07:00
Matt Wells
69fa6662bc
EDOCUNCHANGED fixes for diffbot
2014-02-10 16:23:39 -08:00
Matt Wells
4029b0b937
more faster spider fixes. tried to fix
...
corrupt rdbcache.
2014-02-06 09:25:27 -08:00
Matt Wells
93021b2f13
Merge branch 'diffbot'
...
Conflicts:
Collectiondb.cpp
Spider.cpp
Spider.h
2014-02-01 11:31:00 -07:00
Matt Wells
095c47f181
Merge branch 'diffbot'
...
Conflicts:
Collectiondb.cpp
Spider.cpp
Spider.h
2014-02-01 11:28:31 -07:00
Matt Wells
b40f393f4c
fix a couple cores related to deleting collections
...
in progress. support termlist dump with terms
containing colons.
2014-01-29 15:56:07 -08:00
Matt Wells
726090be83
contains a hack fix to fix things at startup
...
but now it is commented out.
2014-01-25 15:07:47 -08:00
Matt Wells
e3f769dffe
fixes for sudden revitilization of dead crawls.
2014-01-25 11:03:15 -08:00
mwells
308106673c
added debug statements for email bug
2014-01-24 14:08:27 -08:00
Matt Wells
27b6ceffa8
fix bug of sending notification email twice
...
for really really tiny jobs.
2014-01-23 21:22:39 -08:00
Matt Wells
c4a6ad1145
update "this round" counts to at least
...
the total counts if round # is 0 so we do not
double spider everyone's jobs!
put a check in rebalance loop to see if gb
is exiting so we don't get into an infinite loop.
this should be in redmine now...
2014-01-23 18:22:13 -08:00
Matt Wells
9432ae870d
fix bug to pass jenkins.
2014-01-23 09:38:15 -08:00
Matt Wells
e351cb9939
free spidercolls on exit
2014-01-22 23:52:23 -08:00
Matt Wells
bc35b7d0ec
fix pagecrawlbot.cpp to support &c=token-name.
...
cleanup mem at process exit better.
2014-01-22 23:40:38 -08:00
Matt Wells
8a9b1f7a19
added diffbot retry rules.
...
added maxTotalSpiders parm for
all colls to follow.
tried to fix msg 0x00 socket jam up.
2014-01-22 19:57:38 -08:00
Matt Wells
33c5d9c07f
a lot of times rdb tree has invalid collection
...
numbers in it so fix our counting algo in case
the collection rec no longer exists!
2014-01-21 19:01:44 -08:00
Matt Wells
45cb5c9a0c
fix bugs to try to get sharding working
...
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
41cdfcef96
inc spider limits in various places
2014-01-20 18:51:15 -08:00
Matt Wells
e9bbc16a9f
took out pagecount table. just hafta scan
...
twice i think because caching counts
gets complicated because of adding
duplicate injection requests!
2014-01-19 20:34:38 -08:00
Matt Wells
58d0c444ac
fixes for the global index quota system
2014-01-19 19:38:23 -08:00
Matt Wells
4606e88721
code cleanups.
...
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
10f4443974
quite a few fixes to the quota system, cleanups etc.
2014-01-18 16:23:13 -08:00
Matt Wells
fa59c62264
more bug fixes associated with collections
...
and site page counts in url filters.
2014-01-18 11:54:58 -08:00
Matt Wells
5b7170e8c6
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
...
Conflicts:
Json.cpp
PageAddUrl.cpp
PageStats.cpp
Spider.cpp
2014-01-17 21:07:08 -08:00
Matt Wells
4e803210ee
tons of changes from live github on neo.
...
lots of core fixes.
took out ppthtml powerpoint convert, it hangs.
dynamic rdbmap to save memory per coll.
fixed disk page cache logic and brought it
back.
2014-01-17 21:01:43 -08:00
Matt Wells
d8554bfb0f
update default parm settings.
2014-01-09 13:22:51 -08:00
Matt Wells
3db562f0f1
bug fixes for pages indexed and manual seeds counting.
2014-01-07 15:38:22 -08:00
Matt Wells
f0c232803a
fix page url filters editing
2014-01-07 14:27:58 -08:00
Matt Wells
d69fc065ce
just use a global diffbot api url for simplicity
...
to avoid having to call XmlDoc::getUrlFilterNum()
which is not good practice since url filters are for
spiderdb reads really.
2014-01-07 12:31:25 -08:00
Matt Wells
909022642d
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2014-01-07 12:10:59 -08:00
Matt Wells
e366c12470
Merge branch 'master' into diffbot
...
Conflicts:
Collectiondb.cpp
Msg13.cpp
Parms.cpp
Spider.h
2014-01-07 12:09:11 -08:00
Matt Wells
4f64677b4f
get new global preemptive cache
...
logic compiling, with section voting
stats.
2014-01-05 11:51:09 -08:00
Matt Wells
7df2111ceb
fixed 'gb inject titledb-DIR newhosts.conf' command
...
for populating an index from titledb files in DIR
and transmitting to appropriate host in newhosts.conf.
also prettied up the gb -h output to use a formatting
function.
2014-01-02 01:20:08 -07:00
Matt Wells
6aac48e487
fix crawl delay wait queue logic.
...
if coll already exists trying to add, let it be. don't error out.
2013-12-27 14:35:51 -08:00
Matt Wells
100af585a6
parm sync fixes
2013-12-26 11:20:19 -08:00
Matt Wells
3acd6a08d5
add the true spider request when
...
retrying to spider a fake-ip spider request.
add a EFAKEIP error reply for the fake ip request.
prevents us double spidering the same url.
2013-12-23 10:27:42 -08:00
Matt Wells
2ac8ff2952
compile regex so it's case dependent
2013-12-23 09:30:35 -08:00
Matt Wells
4c7ce819b9
fix core dump
2013-12-19 18:39:29 -08:00
Matt Wells
c2f8445a70
expand reg ex shortcuts like \d to [0-9]
2013-12-19 18:31:37 -08:00
Matt Wells
3092dcecaa
rebuild url filters and regexes at startup
2013-12-19 15:56:27 -08:00
Matt Wells
99099505d8
call regfree before changing regex
2013-12-19 15:32:26 -08:00
Matt Wells
7f70e4e887
fix regex logic
2013-12-19 15:19:18 -08:00
Matt Wells
6f0137889b
fixes for getUrlFilterNum so
...
it looks at "hadReply" bit
in SPiderRequest when getting
diffbot api url.
2013-12-18 14:05:41 -08:00
Matt Wells
1b5057ad42
log cleanups mostly.
...
took out disk page cache,
kinda buggy... need to fix at some point.
2013-12-18 10:57:18 -08:00
Matt Wells
2f2333abd1
parmdb fixes
2013-12-17 14:53:33 -08:00
Matt Wells
33ba8070b5
more bug fixes parmdb
2013-12-17 13:09:05 -08:00
Matt Wells
3f19ece776
parmdb updates
2013-12-16 17:07:15 -08:00
Matt Wells
1fe91cad2f
parmdb updates
2013-12-16 14:10:39 -08:00
Matt Wells
0615acff17
zero out url filters checkboxes on submit
2013-12-16 11:03:40 -08:00
mwells
76bb3d05e1
clean up logging so i can see what's going on
2013-12-10 16:41:30 -08:00
mwells
db74af766b
fix core in addExistingColl()
2013-12-10 15:46:38 -08:00
mwells
82494baa89
move CollectionRec stuff into Collectiondb files
...
for simplicity.
2013-12-10 15:28:04 -08:00
mwells
f2d5661965
parmdb overhaul. support collection add/del
...
sync when host comes back online. use udp not tcp.
host #0 can now handle a new incoming request while
a parm change is currently outstanding.
all missed "command" parms will be received when a dead host
comes back online, too, like a tight merge for instance.
does not use msg4, uses msg3e and msg3f for syncing and
sending parms.
2013-12-10 13:09:55 -08:00
Matt Wells
ed79b67d2e
core dump fixes
2013-12-08 15:36:23 -07:00
Matt Wells
06edfddf31
a bunch of bug fixes, mostly spider related.
...
also some for pagereindex.
2013-12-07 21:56:37 -07:00
Matt Wells
c669f8c138
fix file descriptor leak in Dir class.
...
try to fix core from Thread getting SIGALRM.
try to set NOFILES to 1024 at startup in case
more are allowed.
2013-11-19 13:41:56 -08:00
Matt Wells
fe1a7d1a75
rdbbase not fully resetting? it was
...
trying to dump to coll directories that
had been moved to trash folder.
and printing out "deleted from under us".
at least it was corrupting data in RdbMem
this time because i added m_dumpErrno logic.
2013-11-15 09:01:58 -08:00
Matt Wells
62432b3530
support for &restart=1
2013-11-14 14:02:56 -08:00
Matt Wells
af678b7c1b
fix a few bugs.
2013-11-10 22:11:13 -08:00
Matt Wells
105a201cde
fix mem leak.
...
check if tree writes are disabled and block
until not when deleting/resetting a collection.
just like we do it tree is being saved.
2013-11-10 16:28:00 -08:00
Matt Wells
9d016b5c3c
reset spiderstatus
2013-10-30 13:49:31 -07:00
Matt Wells
3acc1f9a51
deal with if callback is null
...
when deleting/resetting collnum
2013-10-30 13:18:19 -07:00
Matt Wells
d0ddfb7d7d
would block when deleting or resetting
...
a collection when the rdb tree is saving to
disk. keeps retrying every 100ms since it
modifies the tree.
2013-10-30 13:12:46 -07:00
Matt Wells
b83dd59913
fix bug when we nuke a collnum
...
from a tree right in the middle of when
saving rdb trees in process.cpp.
2013-10-30 12:27:08 -07:00
Matt Wells
adf4d258ae
better crawl status reporting.
...
allow for _ in coll names.
2013-10-30 10:00:46 -07:00
Matt Wells
54d3375a00
fixes when crawling on distributed 2x2
2013-10-25 14:54:24 -07:00
Matt Wells
e5aa795b76
reset seed dedup table when collrec reset
2013-10-23 18:12:50 -07:00
Matt Wells
c39b45ff88
fix crawl round end detection etc.
...
inc round counter even if not repeating crawl
2013-10-23 15:53:59 -07:00
Matt Wells
8f5bb4a787
a few core dump fixes. get crawl-delay
...
working a little. about half way done.
2013-10-22 15:44:10 -07:00
Matt Wells
c7d7e24f9b
show spider rounds and round starttime in
...
json output. fixed url filters display bug.
reset seeds safebuf parm when coll is reset.
2013-10-21 19:20:03 -07:00
Matt Wells
0e4d96b3f8
added "seeds" to json reply. store seed urls
...
(and deup them) in collrec. fixed some respidering
issues. any time we re-enter url filters
then rebuild the waiting tree.
2013-10-21 17:35:14 -07:00
Matt Wells
64a1c7c2f2
more bug fixes. if spiders disabled for row
...
in url filters, don't spider the url.
2013-10-21 14:45:12 -07:00
Matt Wells
978910ca7a
fix more bugs.
2013-10-21 14:17:32 -07:00
Matt Wells
605289e130
fix a couple collection related bugs
...
causing cores in crawlbot.
2013-10-21 11:38:33 -07:00
Matt Wells
85bca4f3d1
can now delete collection while spiders are out
2013-10-18 18:11:14 -07:00
Matt Wells
889583ec4b
now we can reset collection mid stream
2013-10-18 17:49:36 -07:00
Matt Wells
ecab57ff0f
change collnum of reset collection
...
so any adds in progress will fail.
2013-10-18 15:46:00 -07:00
Matt Wells
b589b17e63
fix collection resetting.
2013-10-18 15:21:00 -07:00
Matt Wells
f5e5b0f5d3
fix crawlbot bugs
2013-10-16 12:12:22 -07:00
mwells
e7377d72ab
fix robots.txt switch. fix collection rec saving.
...
require collname explicitly for injecturl urldata.
2013-09-27 11:39:23 -06:00
Matt Wells
7fdbd0f66a
delete spider coll when deleting coll
2013-09-18 15:36:30 -07:00
Matt Wells
e50da4d012
crawlbot api fixes
2013-09-17 15:47:44 -07:00
Matt Wells
c16fe8601b
more crawlbot api fixes
2013-09-17 15:32:28 -07:00
Matt Wells
c81f700bf0
get reset collection kinda working.
2013-09-17 14:13:44 -07:00
Matt Wells
4321f02e4e
trying to get reset collection working
2013-09-17 12:21:09 -07:00
Matt Wells
4c11265a98
more updates to crawlbot api
2013-09-16 13:59:11 -07:00
Matt Wells
5dc7bd2ab4
integrate diffbot from svn back into git.
2013-09-13 09:23:18 -07:00
Matt Wells
94e6492916
removed MAX_COLL_RECS so we can have unlimited
...
collections, really limited by the sizeof(collnum_t) only now,
which is 16bits, 15bits unsigned, which is the limitation.
can always expand this so we can have more than 32k collections.
2013-08-30 16:20:38 -07:00
Matt Wells
f6e560c1f4
Initial file population.
2013-08-02 13:12:24 -07:00