mwells
|
bef076917d
|
use -g for debug mode not -d, that's working dir.
|
2014-04-10 00:36:00 -07:00 |
|
mwells
|
02304073d4
|
doc updates. core fixes.
|
2014-04-10 00:31:41 -07:00 |
|
mwells
|
6675facc4f
|
removed coll.conf
|
2014-04-09 20:16:41 -07:00 |
|
mwells
|
539a1d188e
|
remove coll.main.0/coll.conf
|
2014-04-09 20:13:49 -07:00 |
|
mwells
|
f55d4d1230
|
merge diffbot-testing
|
2014-04-09 20:10:30 -07:00 |
|
mwells
|
1ea6c597be
|
Merge branch 'diffbot-matt' into diffbot-testing
Conflicts:
html/admin.html
|
2014-04-09 20:04:46 -07:00 |
|
mwells
|
8a003e3492
|
fix url filters profile logic.
|
2014-04-09 19:51:36 -07:00 |
|
mwells
|
2adf5b9bc5
|
more awesome fixes
|
2014-04-09 13:31:11 -07:00 |
|
mwells
|
72dc660598
|
Merge branch 'testing' into diffbot-matt
Conflicts:
Collectiondb.cpp
HttpRequest.h
PageBasic.cpp
coll.main.0/coll.conf
|
2014-04-09 11:18:39 -07:00 |
|
mwells
|
be99155986
|
more updates
|
2014-04-09 11:03:31 -07:00 |
|
mwells
|
9e1199f113
|
hack about 35%ish done
|
2014-04-08 19:34:43 -07:00 |
|
mwells
|
41284bcf4f
|
add diffbot support to admin doc
|
2014-04-07 14:24:52 -07:00 |
|
mwells
|
b3fcfb1ab0
|
updated admin.html
|
2014-04-06 21:19:39 -07:00 |
|
mwells
|
1b5c6a6278
|
create hosts.conf into cwd if not there.
pretty up logging system.
update admin.html
|
2014-04-06 21:12:52 -07:00 |
|
mwells
|
5ee79a4c2f
|
daemonize on ./gb 0 etc.
|
2014-04-06 15:57:38 -07:00 |
|
Matt Wells
|
9b359aa876
|
Merge branch 'master' into diffbot-testing
|
2014-04-06 14:41:03 -07:00 |
|
Matt Wells
|
f2a23f7dd3
|
Merge branch 'master' into diffbot-testing
|
2014-04-06 14:39:48 -07:00 |
|
mwells
|
c20c30c53f
|
Merge branch 'testing' of github.com:gigablast/open-source-search-engine into testing
|
2014-04-06 14:03:13 -07:00 |
|
mwells
|
23e5a94ddf
|
move log file in the binary itself now.
|
2014-04-06 14:02:51 -07:00 |
|
mwells
|
4e6db38517
|
quick start doc update
|
2014-04-05 20:33:47 -07:00 |
|
mwells
|
fa7216f978
|
Merge branch 'testing'
|
2014-04-05 19:25:35 -07:00 |
|
mwells
|
5ff88fafbc
|
spider status updates
|
2014-04-05 18:52:40 -07:00 |
|
mwells
|
264f27b826
|
fix url filters to have !insitelist directive
|
2014-04-05 18:40:39 -07:00 |
|
mwells
|
b0dbf833a7
|
fix sitelist update logic.
|
2014-04-05 18:26:00 -07:00 |
|
mwells
|
ac5cf7971b
|
more misc updates.
|
2014-04-05 18:09:04 -07:00 |
|
mwells
|
bd82145626
|
Merge branch 'diffbot-testing' into testing
|
2014-04-05 12:34:46 -07:00 |
|
mwells
|
89f5c8c059
|
Merge branch 'diffbot-matt' into diffbot-testing
|
2014-04-05 11:34:27 -07:00 |
|
mwells
|
61b4ec4ca6
|
added some qa testing logic. qa.cpp.
|
2014-04-05 11:33:42 -07:00 |
|
Daniel Steinberg
|
0988a134d0
|
Merge remote-tracking branch 'origin/diffbot' into diffbot-dan
|
2014-04-01 19:48:24 -07:00 |
|
Daniel Steinberg
|
4856cc4c60
|
||, not &&
|
2014-04-01 10:45:54 -07:00 |
|
Daniel Steinberg
|
3e38bd169e
|
and return an error
|
2014-04-01 10:43:17 -07:00 |
|
Daniel Steinberg
|
94b169b8dc
|
only delete if there were no io errors
|
2014-04-01 10:42:12 -07:00 |
|
Daniel Steinberg
|
6568858e81
|
implement something that works like mv, which tries rename first, and if that fails copies the bytes. rename doesn't work across devices
|
2014-03-31 20:44:39 -07:00 |
|
Matt Wells
|
d6434191d1
|
nomenclature changes to reduce collissions.
name collection 'qatest123' for doing smoke tests,
not 'test'.
|
2014-03-31 15:02:17 -07:00 |
|
Matt Wells
|
9c8410767d
|
fix critical title alloc/free bug
in title.cpp.
|
2014-03-28 08:01:01 -07:00 |
|
Matt Wells
|
c1671015c8
|
Merge branch 'diffbot-dan' into diffbot-testing
|
2014-03-27 12:19:50 -07:00 |
|
Matt Wells
|
582349334f
|
do not use certain other json fields
when computing checksum for deduping.
like stats, querystring, ...
|
2014-03-27 12:20:53 -07:00 |
|
Matt Wells
|
402377d2e6
|
fix bug of gbmin, gbmax etc. not working.
floats were being rounded down to ints
in most cases it seems. so .9 -> 0 etc.
|
2014-03-26 11:56:06 -07:00 |
|
Daniel Steinberg
|
d67f09feeb
|
also include a timestamp field with an RFC 1123 formatted date
|
2014-03-25 21:45:21 -07:00 |
|
Daniel Steinberg
|
0efac8c156
|
Defect #2080: seed URLs duplicated
|
2014-03-25 17:25:55 -07:00 |
|
Daniel Steinberg
|
e1b1b15a38
|
bigger buffer
|
2014-03-25 16:34:40 -07:00 |
|
Daniel Steinberg
|
9846061dff
|
when restarting a bulk job, copy bulkurls.txt to /tmp, and then transfer it back to the new collection folder
|
2014-03-25 16:20:24 -07:00 |
|
Daniel Steinberg
|
ab90c06d8d
|
add TODO for regex checking
|
2014-03-25 13:05:43 -07:00 |
|
Daniel Steinberg
|
1ff6c1fae0
|
Merge remote-tracking branch 'origin/diffbot' into diffbot-dan
|
2014-03-25 12:53:37 -07:00 |
|
Daniel Steinberg
|
b8836745f0
|
use SpiderRequest instead of isonsamedomain flag to determine whether to output data in CSV (Defect #2122)
|
2014-03-25 12:51:08 -07:00 |
|
mwells
|
b6e5424e32
|
do not download bulkjob urls in crawlbot.
just return a fake http reply.
however, do use crawl-delay throttling
logic. deduping is already turned off for
bulk jobs so it should be ok.
|
2014-03-21 12:40:38 -07:00 |
|
mwells
|
502752aba4
|
doc updates
|
2014-03-21 08:59:13 -07:00 |
|
Matt Wells
|
b33121af7d
|
make all field names lower case without
spaces when we hash them to make the
prefixhash. since json names often have
mixed case field names and spaces.
|
2014-03-20 16:08:02 -07:00 |
|
Matt Wells
|
98a10d4936
|
Merge branch 'testing' into diffbot-testing
|
2014-03-20 15:50:49 -07:00 |
|
Matt Wells
|
bbc8fc0c79
|
always show admin link
|
2014-03-20 15:48:51 -07:00 |
|