Commit Graph

31 Commits

Author SHA1 Message Date
Matt Wells
2137e150e7 Merge branch 'testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Make.depend
	Parms.cpp
2014-06-27 17:17:14 -07:00
mwells
7506d66d4a fixes for page inject 2014-06-15 08:26:27 -07:00
mwells
5c0b371dc9 Merge branch 'testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	HttpServer.cpp
	Make.depend
	Parms.cpp
	Parms.h
2014-06-13 11:00:09 -07:00
mwells
ea90e7f755 more fixes for sectiondb markup code 2014-06-12 13:05:45 -07:00
mwells
108c281c33 fix annoying bug when adding new parms. 2014-06-10 12:29:50 -07:00
Matt Wells
8aa0662a27 Merge branch 'diffbot' into testing
Conflicts:

	Make.depend
	PageResults.cpp
	Parms.cpp
	Spider.cpp
	Spider.h
	gb.conf
2014-03-08 09:38:44 -07:00
Matt Wells
953b7c558d parm updates 2014-02-10 21:45:03 -07:00
Matt Wells
c9ef525338 code checkpoint 2014-02-09 12:55:45 -07:00
Matt Wells
8d534b8ed8 many more fixes for streaming mode 2014-02-06 18:21:22 -08:00
Matt Wells
6af9441818 change deduping logic to be first come first
server, but site rank trumps. fixed bug from
fix before.
2014-01-29 16:14:42 -08:00
Matt Wells
313cffc322 had to add per round page and process counts
in case they had maxToCrawl and respider frequencies
set. simplified round logic in Spider.cpp.
2014-01-23 13:23:09 -08:00
Matt Wells
58d0c444ac fixes for the global index quota system 2014-01-19 19:38:23 -08:00
Matt Wells
3ec44c5b35 fix streaming mode for sending back json
downloads/dumps.
2014-01-17 18:28:17 -08:00
Matt Wells
4b27b22949 git rebalancing working right 2014-01-15 17:40:17 -08:00
Matt Wells
8a49e87a61 got code with shard rebalancing compiling.
now we store a "sharded by termid" bit in posdb
key for checksums, etc keys that are not sharded
by docid. save having to do disk seeks on every
host in the cluster to do a dup check, etc.
2014-01-11 16:08:42 -08:00
mwells
76bb3d05e1 clean up logging so i can see what's going on 2013-12-10 16:41:30 -08:00
mwells
82494baa89 move CollectionRec stuff into Collectiondb files
for simplicity.
2013-12-10 15:28:04 -08:00
Matt Wells
e0a15194e1 fix json double decoding issue. no more
partial decodes, json parser stores
fully decoded string into separate buf.
2013-11-22 14:16:14 -08:00
Matt Wells
b589b17e63 fix collection resetting. 2013-10-18 15:21:00 -07:00
Matt Wells
a288217e9f a few bug fixes 2013-10-17 18:59:00 -07:00
Matt Wells
9d6c3626d8 json indexing/hashing updates. 2013-10-16 15:41:12 -07:00
mwells
43e4c939eb Merge branch 'master' into diffbot
Conflicts:
	Make.depend
2013-10-02 13:15:07 -06:00
Matt Wells
c911a606c9 renamed matches.h and matches.cpp to
matches2.h and matches2.cpp to avoid potential
confusion with Matches.h and Matches.cpp files.
2013-10-01 07:58:24 -07:00
mwells
d11e9520bd couple fixes to makefile etc. 2013-09-28 16:37:39 -06:00
mwells
fd081478de fix crawlbot to work on a distributed network
as far as adding/deleting/resetting  colls
and updating parms. ideally we'd have a Colldb
Rdb where each key was a parm. that would make
syncing easier if a host went down, then it would
get the negative/positive colldb parm keys later.
so it could sync up on all your operations as long
as all your operations in terms of adding and deleting
database key/value pairs.
2013-09-26 22:41:05 -06:00
Matt Wells
f90d20f4dd diffbot api integration updates 2013-09-18 15:07:47 -07:00
Matt Wells
f974d6a47b fixes for crawlbot universal api. 2013-09-16 10:49:37 -07:00
mwells
e152205765 make depend update 2013-09-09 02:37:47 -06:00
mwells
ca2a024d04 fixed up thread/spider log msgs.
fixed core from calling fprintf in
alarm signal missed quickpoll handler.
2013-08-29 21:15:42 -06:00
mwells
4f4047a3ad new Make.depend. 2013-08-09 17:13:45 -06:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00