Commit Graph

1315 Commits

Author SHA1 Message Date
mwells
bef076917d use -g for debug mode not -d, that's working dir. 2014-04-10 00:36:00 -07:00
mwells
02304073d4 doc updates. core fixes. 2014-04-10 00:31:41 -07:00
mwells
6675facc4f removed coll.conf 2014-04-09 20:16:41 -07:00
mwells
539a1d188e remove coll.main.0/coll.conf 2014-04-09 20:13:49 -07:00
mwells
f55d4d1230 merge diffbot-testing 2014-04-09 20:10:30 -07:00
mwells
1ea6c597be Merge branch 'diffbot-matt' into diffbot-testing
Conflicts:
	html/admin.html
2014-04-09 20:04:46 -07:00
mwells
8a003e3492 fix url filters profile logic. 2014-04-09 19:51:36 -07:00
mwells
2adf5b9bc5 more awesome fixes 2014-04-09 13:31:11 -07:00
mwells
72dc660598 Merge branch 'testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	HttpRequest.h
	PageBasic.cpp
	coll.main.0/coll.conf
2014-04-09 11:18:39 -07:00
mwells
be99155986 more updates 2014-04-09 11:03:31 -07:00
mwells
9e1199f113 hack about 35%ish done 2014-04-08 19:34:43 -07:00
mwells
41284bcf4f add diffbot support to admin doc 2014-04-07 14:24:52 -07:00
mwells
b3fcfb1ab0 updated admin.html 2014-04-06 21:19:39 -07:00
mwells
1b5c6a6278 create hosts.conf into cwd if not there.
pretty up logging system.
update admin.html
2014-04-06 21:12:52 -07:00
mwells
5ee79a4c2f daemonize on ./gb 0 etc. 2014-04-06 15:57:38 -07:00
Matt Wells
9b359aa876 Merge branch 'master' into diffbot-testing 2014-04-06 14:41:03 -07:00
Matt Wells
f2a23f7dd3 Merge branch 'master' into diffbot-testing 2014-04-06 14:39:48 -07:00
mwells
c20c30c53f Merge branch 'testing' of github.com:gigablast/open-source-search-engine into testing 2014-04-06 14:03:13 -07:00
mwells
23e5a94ddf move log file in the binary itself now. 2014-04-06 14:02:51 -07:00
mwells
4e6db38517 quick start doc update 2014-04-05 20:33:47 -07:00
mwells
fa7216f978 Merge branch 'testing' 2014-04-05 19:25:35 -07:00
mwells
5ff88fafbc spider status updates 2014-04-05 18:52:40 -07:00
mwells
264f27b826 fix url filters to have !insitelist directive 2014-04-05 18:40:39 -07:00
mwells
b0dbf833a7 fix sitelist update logic. 2014-04-05 18:26:00 -07:00
mwells
ac5cf7971b more misc updates. 2014-04-05 18:09:04 -07:00
mwells
bd82145626 Merge branch 'diffbot-testing' into testing 2014-04-05 12:34:46 -07:00
mwells
89f5c8c059 Merge branch 'diffbot-matt' into diffbot-testing 2014-04-05 11:34:27 -07:00
mwells
61b4ec4ca6 added some qa testing logic. qa.cpp. 2014-04-05 11:33:42 -07:00
Daniel Steinberg
0988a134d0 Merge remote-tracking branch 'origin/diffbot' into diffbot-dan 2014-04-01 19:48:24 -07:00
Daniel Steinberg
4856cc4c60 ||, not && 2014-04-01 10:45:54 -07:00
Daniel Steinberg
3e38bd169e and return an error 2014-04-01 10:43:17 -07:00
Daniel Steinberg
94b169b8dc only delete if there were no io errors 2014-04-01 10:42:12 -07:00
Daniel Steinberg
6568858e81 implement something that works like mv, which tries rename first, and if that fails copies the bytes. rename doesn't work across devices 2014-03-31 20:44:39 -07:00
Matt Wells
d6434191d1 nomenclature changes to reduce collissions.
name collection 'qatest123' for doing smoke tests,
not 'test'.
2014-03-31 15:02:17 -07:00
Matt Wells
9c8410767d fix critical title alloc/free bug
in title.cpp.
2014-03-28 08:01:01 -07:00
Matt Wells
c1671015c8 Merge branch 'diffbot-dan' into diffbot-testing 2014-03-27 12:19:50 -07:00
Matt Wells
582349334f do not use certain other json fields
when computing checksum for deduping.
like stats, querystring, ...
2014-03-27 12:20:53 -07:00
Matt Wells
402377d2e6 fix bug of gbmin, gbmax etc. not working.
floats were being rounded down to ints
in most cases it seems. so .9 -> 0 etc.
2014-03-26 11:56:06 -07:00
Daniel Steinberg
d67f09feeb also include a timestamp field with an RFC 1123 formatted date 2014-03-25 21:45:21 -07:00
Daniel Steinberg
0efac8c156 Defect #2080: seed URLs duplicated 2014-03-25 17:25:55 -07:00
Daniel Steinberg
e1b1b15a38 bigger buffer 2014-03-25 16:34:40 -07:00
Daniel Steinberg
9846061dff when restarting a bulk job, copy bulkurls.txt to /tmp, and then transfer it back to the new collection folder 2014-03-25 16:20:24 -07:00
Daniel Steinberg
ab90c06d8d add TODO for regex checking 2014-03-25 13:05:43 -07:00
Daniel Steinberg
1ff6c1fae0 Merge remote-tracking branch 'origin/diffbot' into diffbot-dan 2014-03-25 12:53:37 -07:00
Daniel Steinberg
b8836745f0 use SpiderRequest instead of isonsamedomain flag to determine whether to output data in CSV (Defect #2122) 2014-03-25 12:51:08 -07:00
mwells
b6e5424e32 do not download bulkjob urls in crawlbot.
just return a fake http reply.
however, do use crawl-delay throttling
logic. deduping is already turned off for
bulk jobs so it should be ok.
2014-03-21 12:40:38 -07:00
mwells
502752aba4 doc updates 2014-03-21 08:59:13 -07:00
Matt Wells
b33121af7d make all field names lower case without
spaces when we hash them to make the
prefixhash. since json names often have
mixed case field names and spaces.
2014-03-20 16:08:02 -07:00
Matt Wells
98a10d4936 Merge branch 'testing' into diffbot-testing 2014-03-20 15:50:49 -07:00
Matt Wells
bbc8fc0c79 always show admin link 2014-03-20 15:48:51 -07:00