Commit Graph

13 Commits

Author SHA1 Message Date
Zak Betz
16b6e44bd1 Show utf8 url in page results. 2015-09-21 16:44:40 -06:00
Zak Betz
83190e3bbc Make punycoded urls printable. 2015-09-21 09:17:40 -06:00
Zak Betz
519b2c4f42 Fix repeating xn--xn-- when there are spaces in the domain.
Make gb unittest take a name of the unit test to run.
2015-09-14 10:24:22 -06:00
Zak Betz
5622ca47ee Work on non-ascii domain names. It works on correct inputs, but
will crash on some non correct inputs, so it is forced to be disabled.
2015-09-14 00:34:44 -06:00
Matt
0ca27638bc checkpoint. moved warc and arc looping into xmldoc.
now will any container doc from pageinject into
xmldoc. simplifies pageinject.cpp a lot. and sets up
a framework for dealing with container docs.
2015-05-01 19:11:13 -07:00
Matt
184b157365 Merge branch 'diffbot-testing' into ia 2015-04-29 21:43:00 -07:00
Matt
09a79d230c check for .css?* better as media extensions.
do it when adding outlinks in xmldoc.cpp.
2015-04-28 14:42:04 -07:00
Matt
0eb415d408 added preliminary support for spidering .warc.gz and .arc.gz files 2015-04-27 21:41:22 -06:00
Matt Wells
644ad28912 debugging the hopcount bug 2015-04-19 15:51:29 -06:00
Matt
f4ca6d8cd4 try ddomain only urls with www. when looking up
in sitelinks.txt
2015-01-31 15:33:37 -07:00
Matt
96b8197ad3 now it compiles with -m32 2014-11-10 14:45:11 -08:00
Matt Wells
e7dd8f7956 replace long long with int64_t 2014-10-30 13:36:39 -06:00
Matt Wells
f6e560c1f4 Initial file population. 2013-08-02 13:12:24 -07:00