Commit Graph

46 Commits

Author SHA1 Message Date
mwells
2b37f56e4c Merge branch 'diffbot-matt' into testing 2014-05-10 07:56:45 -07:00
mwells
38a79888b6 Merge branch 'diffbot-testing' into testing 2014-05-10 07:49:29 -07:00
Matt Wells
eb49094343 try to start indexing spider replies
as regular search results in the index so
you can query on those. get histograms of
spider status msgs, etc. ability to turn
that and images on/off.
2014-05-09 11:18:24 -07:00
mwells
6048ae849b added support for spidering a particular language
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
066a01cba6 Merge branch 'diffbot-testing' into diffbot-matt 2014-04-28 14:15:02 -07:00
Matt Wells
e21e0a404c fixed bug for product title extraction.
titledb-saved.dat tree loop corruption bug.
no main coll bug.
put the ajax widget on spider status page so you can
see spider going in realtime. will give customers
a good idea of the spider moving along.
more widget fixes, to use new base64 thumbs, etc.
2014-04-28 13:30:24 -07:00
Matt Wells
20a2729827 added jobCreationTimeUTC and jobCompletionTimeUTC
to json api
2014-04-25 14:12:18 -07:00
mwells
8a003e3492 fix url filters profile logic. 2014-04-09 19:51:36 -07:00
mwells
be99155986 more updates 2014-04-09 11:03:31 -07:00
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
c143ee1fba fix core when creating a new collection because
we incremented m_numRecs but did not grow the ptr buffer.
also added support for localgb.conf so we can use that
instead of gb.conf to avoid git push/pull conflicts.
2014-03-07 09:05:14 -08:00
Matt Wells
5f3aa24805 took out restrictDomain logic. now we always
only follow links on the same domain as the seed
UNLESS a url crawl pattern or a url crawl regex
was specified.
2014-02-27 19:53:17 -08:00
Matt Wells
9d0dca71db fix rapid coll delete bug some more. 2014-02-16 20:13:06 -08:00
Matt Wells
3b0a571cea fix security system to actually work now 2014-02-12 00:06:00 -07:00
Matt Wells
69fa6662bc EDOCUNCHANGED fixes for diffbot 2014-02-10 16:23:39 -08:00
Matt Wells
9f0d2ad82e parm updates 2014-02-09 23:05:36 -07:00
Matt Wells
e60576c8eb another code checkpoint 2014-02-08 22:57:30 -07:00
Matt Wells
313cffc322 had to add per round page and process counts
in case they had maxToCrawl and respider frequencies
set. simplified round logic in Spider.cpp.
2014-01-23 13:23:09 -08:00
Matt Wells
9432ae870d fix bug to pass jenkins. 2014-01-23 09:38:15 -08:00
Matt Wells
26c76a3240 fixed bug of waiting trees not saving. 2014-01-23 01:04:24 -08:00
Matt Wells
45cb5c9a0c fix bugs to try to get sharding working
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
e9bbc16a9f took out pagecount table. just hafta scan
twice i think because caching counts
gets complicated because of adding
duplicate injection requests!
2014-01-19 20:34:38 -08:00
Matt Wells
58d0c444ac fixes for the global index quota system 2014-01-19 19:38:23 -08:00
Matt Wells
4606e88721 code cleanups.
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
9c1f6197eb added indexbody control so i can
turn it off for my special json
global index.
2014-01-18 10:04:33 -08:00
Matt Wells
5b7170e8c6 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	Json.cpp
	PageAddUrl.cpp
	PageStats.cpp
	Spider.cpp
2014-01-17 21:07:08 -08:00
Matt Wells
4e803210ee tons of changes from live github on neo.
lots of core fixes.
took out ppthtml powerpoint convert, it hangs.
dynamic rdbmap to save memory per coll.
fixed disk page cache logic and brought it
back.
2014-01-17 21:01:43 -08:00
Matt Wells
501f49c81b gui and parm updates. simplifcations. 2014-01-09 17:29:18 -08:00
Matt Wells
d8554bfb0f update default parm settings. 2014-01-09 13:22:51 -08:00
Matt Wells
d69fc065ce just use a global diffbot api url for simplicity
to avoid having to call XmlDoc::getUrlFilterNum()
which is not good practice since url filters are for
spiderdb reads really.
2014-01-07 12:31:25 -08:00
Matt Wells
909022642d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot 2014-01-07 12:10:59 -08:00
Matt Wells
e366c12470 Merge branch 'master' into diffbot
Conflicts:
	Collectiondb.cpp
	Msg13.cpp
	Parms.cpp
	Spider.h
2014-01-07 12:09:11 -08:00
Matt Wells
4f64677b4f get new global preemptive cache
logic compiling, with section voting
stats.
2014-01-05 11:51:09 -08:00
Matt Wells
7df2111ceb fixed 'gb inject titledb-DIR newhosts.conf' command
for populating an index from titledb files in DIR
and transmitting to appropriate host in newhosts.conf.
also prettied up the gb -h output to use a formatting
function.
2014-01-02 01:20:08 -07:00
Matt Wells
2f2333abd1 parmdb fixes 2013-12-17 14:53:33 -08:00
Matt Wells
33ba8070b5 more bug fixes parmdb 2013-12-17 13:09:05 -08:00
Matt Wells
3f19ece776 parmdb updates 2013-12-16 17:07:15 -08:00
Matt Wells
9b080ff89c more parmdb bug fixes 2013-12-16 13:36:31 -08:00
Matt Wells
0615acff17 zero out url filters checkboxes on submit 2013-12-16 11:03:40 -08:00
mwells
82494baa89 move CollectionRec stuff into Collectiondb files
for simplicity.
2013-12-10 15:28:04 -08:00
mwells
f2d5661965 parmdb overhaul. support collection add/del
sync when host comes back online. use udp not tcp.
host #0 can now handle a new incoming request while
a parm change is currently outstanding.
all missed "command" parms will be received when a dead host
comes back online, too, like a tight merge for instance.
does not use msg4, uses msg3e and msg3f for syncing and
sending parms.
2013-12-10 13:09:55 -08:00
Matt Wells
78a4cfe6da forgot to push the .h files 2013-12-07 22:12:48 -07:00
Matt Wells
62432b3530 support for &restart=1 2013-11-14 14:02:56 -08:00
Matt Wells
d0ddfb7d7d would block when deleting or resetting
a collection when the rdb tree is saving to
disk. keeps retrying every 100ms since it
modifies the tree.
2013-10-30 13:12:46 -07:00
Matt Wells
94e6492916 removed MAX_COLL_RECS so we can have unlimited
collections, really limited by the sizeof(collnum_t) only now,
which is 16bits, 15bits unsigned, which is the limitation.
can always expand this so we can have more than 32k collections.
2013-08-30 16:20:38 -07:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00