Commit Graph

106 Commits

Author SHA1 Message Date
Matt
744cd54131 Merge branch 'ia' into ia-zak 2015-08-31 09:14:27 -06:00
Matt
d9422d8b0e get rid of limits on file sizes. dynamically allocate
file names and fixed-size File array in BigFile class. should
save gigabytes of memory in many-collection systems with
1+ million files or so.
2015-08-14 20:14:50 -06:00
Matt
a1ed368d82 bring back max mem control into master controls.
it's useful to limit per process mem usage to prevent
oom killer because we can't save if we get killed.
overhaul diskpagecache to just use rdbcache. much simpler
and faster, but disabled for now until debugged more.
reduce min files to merge for crawlbot collections so
they stay more tightly merged to conserve fds and mem.
improved logDebugDisk msgs.
overhauled File.cpp fd pool. now it is way faster and
doesn't use any extra mem. much simpler too. although
could be sped up a little by using a linked list, but
probably is not significant enough to warrant doing right now.
increase mem ptr table from 3M to 8M slots. should really make
dynamic though. fix core from null msg20s[0]->m_r.
only call attemptMergeAll once every 60 seconds really.
do not attempt merge if already merging.
2015-08-14 12:58:54 -06:00
Matt
e9f86f362e Merge branch 'ia' into ia-zak 2015-07-22 12:02:19 -06:00
Matt
16fd428887 fix more cores from the dynamic query size changes.
add how many query terms we truncated in the json/xml replies.
document those fields as well.
2015-07-18 14:15:47 -06:00
Zak Betz
a697b3d5a5 Fix Bad File Descriptor loop bug when downloading a static file on a
slow disk.
2015-07-14 17:00:09 -06:00
Matt
599b33524f wget cookie support 2015-05-02 21:52:58 -07:00
Matt Wells
2421bf3d1d ia checkpoint 2015-05-02 23:51:19 +00:00
Matt
d3c071e4c0 fix gbiaitem page 2015-04-30 21:27:11 -07:00
Matt
9370c8f52e more fixes 2015-04-28 23:20:16 -07:00
Matt
faf2c06d29 some fixes for indexing warcs/arcs. 2015-04-28 22:30:58 -07:00
Matt
0eb415d408 added preliminary support for spidering .warc.gz and .arc.gz files 2015-04-27 21:41:22 -06:00
Matt
ccb53eb4e7 use http://127.0.0.1:8000/iagbcoll/<itemname> as a url whose
content will be the arc/warc files as urls.
2015-04-25 17:50:22 -06:00
Matt Wells
a2feab9a4a tap in some fixes for running the newly updated smokes
for dealing with the new urls.csv format
2015-04-21 15:20:57 -07:00
Matt
ef42a9cf28 new urls.csv polish. moved columns around. added
some new gbss fields, like spidered time.
2015-04-15 17:42:56 -06:00
Matt
43ced700d0 calls NEWS BLOG 2015-04-12 12:33:09 -06:00
Matt
95e3a760e9 proxy fixes 2015-03-05 11:10:40 -08:00
Matt Wells
b80a70a6fd fix for https urls through proxies
using newly updated tcp/loop code.
2015-02-21 09:25:54 -08:00
Matt
2488c1a338 added proper write callback registration into
TcpServer.cpp so we only register write callbacks
when a non-blocking write does not write all the
bytes requested of it, or when a connection does not
complete. also fixed up the sslHandshake() function
which calls SSL_connect().
2015-02-16 14:48:39 -07:00
Matt
c15bd53e52 added support for supplying basic proxy authorization
to spider proxies. username:password@1.2.3.4:80
2015-02-02 13:23:38 -08:00
mwells
87285ba3cd use gbmemcpy not memcpy so we can get profiler working again
since memcpy can't be interrupted and backtrace() called.
2015-01-13 12:25:42 -07:00
Matt
adcef39376 Merge branch 'diffbot-testing' into diffbot-matt
Conflicts:
	Collectiondb.cpp
	Collectiondb.h
	Conf.cpp
	Conf.h
	Msg39.cpp
	PageEvents.cpp
	PageResults.cpp
	PageTurk.cpp
	Pages.cpp
	Parms.cpp
	Posdb.cpp
	Proxy.cpp
	Query.cpp
	Query.h
	RdbBase.cpp
	RdbMap.cpp
	Repair.cpp
	Repair.h
	SafeBuf.cpp
	Spider.cpp
	Tagdb.cpp
	TopTree.cpp
	XmlDoc.cpp
	main.cpp
2014-11-20 16:53:07 -08:00
Matt
4e8a42e024 text replacements for bad int32_t substitutions 2014-11-17 18:24:38 -08:00
Matt
931a1c4bc6 good checkpoint. quite a few fixes. 2014-11-17 18:13:36 -08:00
Matt
4c19453ea9 working with -m32 for basic testing.
compiles for 64-bit.
2014-11-12 11:38:37 -08:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
95f6dcf4f7 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Conflicts:
	HttpServer.cpp
2014-11-01 06:18:20 -07:00
Matt Wells
45972d9837 disregard CONNECT requests for now 2014-11-01 06:17:36 -07:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
Matt Wells
033a8b80a0 fix core if json item has column not in table
when dumping json items as csv.
2014-10-10 07:00:11 -07:00
Matt Wells
8bb3545b71 emergency fixes for out of sockets core and
get proxy request timing out causing spider to hang bug.
2014-10-09 07:20:04 -07:00
Mike Tung
837974e0ec Valid JSON output for showinput=1. 2014-10-05 21:15:08 -07:00
Matt Wells
7df7fbe721 support the CONNECT for gb squid proxy 2014-10-02 12:36:43 -07:00
mwells
4e7152b487 fix more bugs in squid proxy implementation.
force squid proxy stack to use floaters.
2014-10-02 11:54:50 -07:00
mwells
42b891219d several fixes for floater proxy through squid proxy.
gb needs to act like squid for the rendering machines so
it can do crawl delay backoff and load balancing over the
floaters.
2014-10-02 02:08:38 -07:00
mwells
6de7a3f6b3 get advanced search working again 2014-09-27 11:12:47 -07:00
mwells
783ae1d4e7 print chrome on other pages 2014-09-23 20:59:48 -07:00
mwells
5b69f03b59 more updates 2014-09-20 16:41:14 -07:00
Matt Wells
6b6583fc0a update gui 2014-09-20 11:01:22 -07:00
mwells
65e533bbb7 website updates 2014-09-01 17:23:15 -07:00
mwells
58d8861a34 widget page updates 2014-09-01 17:04:08 -07:00
mwells
25b79684c5 website gui fixes 2014-09-01 13:31:13 -07:00
mwells
1bc5fecb33 website updates 2014-08-31 11:11:12 -07:00
mwells
ef8cb47590 website updates. 2014-08-31 10:51:37 -07:00
mwells
754d5b4755 rename admin.html to faq.html etc. file juggling. 2014-08-31 09:51:21 -07:00
mwells
947be58f10 Merge branch 'diffbot-testing' into testing
Conflicts:
	HttpRequest.cpp
	Msg13.cpp
	XmlDoc.cpp
2014-08-05 17:19:53 -07:00
mwells
cc1ceaaac2 fix nyt.com cookie redir bug.
fixed bug when POSTing injection request with multipart/form-data.
2014-08-05 17:04:11 -07:00
mwells
13743acd5a gui updates 2014-08-03 10:42:45 -07:00
mwells
3cc54b72cc qa updates 2014-07-28 19:15:31 -07:00
mwells
d5805733e5 more api updates 2014-07-13 09:35:44 -07:00