mwells
2b37f56e4c
Merge branch 'diffbot-matt' into testing
2014-05-10 07:56:45 -07:00
mwells
38a79888b6
Merge branch 'diffbot-testing' into testing
2014-05-10 07:49:29 -07:00
Matt Wells
eb49094343
try to start indexing spider replies
...
as regular search results in the index so
you can query on those. get histograms of
spider status msgs, etc. ability to turn
that and images on/off.
2014-05-09 11:18:24 -07:00
mwells
6048ae849b
added support for spidering a particular language
...
with higher priority.
2014-05-09 10:03:24 -06:00
Matt Wells
066a01cba6
Merge branch 'diffbot-testing' into diffbot-matt
2014-04-28 14:15:02 -07:00
Matt Wells
e21e0a404c
fixed bug for product title extraction.
...
titledb-saved.dat tree loop corruption bug.
no main coll bug.
put the ajax widget on spider status page so you can
see spider going in realtime. will give customers
a good idea of the spider moving along.
more widget fixes, to use new base64 thumbs, etc.
2014-04-28 13:30:24 -07:00
Matt Wells
20a2729827
added jobCreationTimeUTC and jobCompletionTimeUTC
...
to json api
2014-04-25 14:12:18 -07:00
mwells
8a003e3492
fix url filters profile logic.
2014-04-09 19:51:36 -07:00
mwells
be99155986
more updates
2014-04-09 11:03:31 -07:00
Matt Wells
8aa0662a27
Merge branch 'diffbot' into testing
...
Conflicts:
Make.depend
PageResults.cpp
Parms.cpp
Spider.cpp
Spider.h
gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
c143ee1fba
fix core when creating a new collection because
...
we incremented m_numRecs but did not grow the ptr buffer.
also added support for localgb.conf so we can use that
instead of gb.conf to avoid git push/pull conflicts.
2014-03-07 09:05:14 -08:00
Matt Wells
5f3aa24805
took out restrictDomain logic. now we always
...
only follow links on the same domain as the seed
UNLESS a url crawl pattern or a url crawl regex
was specified.
2014-02-27 19:53:17 -08:00
Matt Wells
9d0dca71db
fix rapid coll delete bug some more.
2014-02-16 20:13:06 -08:00
Matt Wells
3b0a571cea
fix security system to actually work now
2014-02-12 00:06:00 -07:00
Matt Wells
69fa6662bc
EDOCUNCHANGED fixes for diffbot
2014-02-10 16:23:39 -08:00
Matt Wells
9f0d2ad82e
parm updates
2014-02-09 23:05:36 -07:00
Matt Wells
e60576c8eb
another code checkpoint
2014-02-08 22:57:30 -07:00
Matt Wells
313cffc322
had to add per round page and process counts
...
in case they had maxToCrawl and respider frequencies
set. simplified round logic in Spider.cpp.
2014-01-23 13:23:09 -08:00
Matt Wells
9432ae870d
fix bug to pass jenkins.
2014-01-23 09:38:15 -08:00
Matt Wells
26c76a3240
fixed bug of waiting trees not saving.
2014-01-23 01:04:24 -08:00
Matt Wells
45cb5c9a0c
fix bugs to try to get sharding working
...
on crawlbot today
2014-01-21 13:58:21 -08:00
Matt Wells
e9bbc16a9f
took out pagecount table. just hafta scan
...
twice i think because caching counts
gets complicated because of adding
duplicate injection requests!
2014-01-19 20:34:38 -08:00
Matt Wells
58d0c444ac
fixes for the global index quota system
2014-01-19 19:38:23 -08:00
Matt Wells
4606e88721
code cleanups.
...
xmldoc::injectDoc(), and it'll
add a SpiderRequest as well.
better collectiondb init code.
2014-01-18 21:19:26 -08:00
Matt Wells
9c1f6197eb
added indexbody control so i can
...
turn it off for my special json
global index.
2014-01-18 10:04:33 -08:00
Matt Wells
5b7170e8c6
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
...
Conflicts:
Json.cpp
PageAddUrl.cpp
PageStats.cpp
Spider.cpp
2014-01-17 21:07:08 -08:00
Matt Wells
4e803210ee
tons of changes from live github on neo.
...
lots of core fixes.
took out ppthtml powerpoint convert, it hangs.
dynamic rdbmap to save memory per coll.
fixed disk page cache logic and brought it
back.
2014-01-17 21:01:43 -08:00
Matt Wells
501f49c81b
gui and parm updates. simplifcations.
2014-01-09 17:29:18 -08:00
Matt Wells
d8554bfb0f
update default parm settings.
2014-01-09 13:22:51 -08:00
Matt Wells
d69fc065ce
just use a global diffbot api url for simplicity
...
to avoid having to call XmlDoc::getUrlFilterNum()
which is not good practice since url filters are for
spiderdb reads really.
2014-01-07 12:31:25 -08:00
Matt Wells
909022642d
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
2014-01-07 12:10:59 -08:00
Matt Wells
e366c12470
Merge branch 'master' into diffbot
...
Conflicts:
Collectiondb.cpp
Msg13.cpp
Parms.cpp
Spider.h
2014-01-07 12:09:11 -08:00
Matt Wells
4f64677b4f
get new global preemptive cache
...
logic compiling, with section voting
stats.
2014-01-05 11:51:09 -08:00
Matt Wells
7df2111ceb
fixed 'gb inject titledb-DIR newhosts.conf' command
...
for populating an index from titledb files in DIR
and transmitting to appropriate host in newhosts.conf.
also prettied up the gb -h output to use a formatting
function.
2014-01-02 01:20:08 -07:00
Matt Wells
2f2333abd1
parmdb fixes
2013-12-17 14:53:33 -08:00
Matt Wells
33ba8070b5
more bug fixes parmdb
2013-12-17 13:09:05 -08:00
Matt Wells
3f19ece776
parmdb updates
2013-12-16 17:07:15 -08:00
Matt Wells
9b080ff89c
more parmdb bug fixes
2013-12-16 13:36:31 -08:00
Matt Wells
0615acff17
zero out url filters checkboxes on submit
2013-12-16 11:03:40 -08:00
mwells
82494baa89
move CollectionRec stuff into Collectiondb files
...
for simplicity.
2013-12-10 15:28:04 -08:00
mwells
f2d5661965
parmdb overhaul. support collection add/del
...
sync when host comes back online. use udp not tcp.
host #0 can now handle a new incoming request while
a parm change is currently outstanding.
all missed "command" parms will be received when a dead host
comes back online, too, like a tight merge for instance.
does not use msg4, uses msg3e and msg3f for syncing and
sending parms.
2013-12-10 13:09:55 -08:00
Matt Wells
78a4cfe6da
forgot to push the .h files
2013-12-07 22:12:48 -07:00
Matt Wells
62432b3530
support for &restart=1
2013-11-14 14:02:56 -08:00
Matt Wells
d0ddfb7d7d
would block when deleting or resetting
...
a collection when the rdb tree is saving to
disk. keeps retrying every 100ms since it
modifies the tree.
2013-10-30 13:12:46 -07:00
Matt Wells
94e6492916
removed MAX_COLL_RECS so we can have unlimited
...
collections, really limited by the sizeof(collnum_t) only now,
which is 16bits, 15bits unsigned, which is the limitation.
can always expand this so we can have more than 32k collections.
2013-08-30 16:20:38 -07:00
Matt Wells
f6e560c1f4
Initial file population.
2013-08-02 13:12:24 -07:00