Zak Betz
|
16b6e44bd1
|
Show utf8 url in page results.
|
2015-09-21 16:44:40 -06:00 |
|
Zak Betz
|
83190e3bbc
|
Make punycoded urls printable.
|
2015-09-21 09:17:40 -06:00 |
|
Zak Betz
|
519b2c4f42
|
Fix repeating xn--xn-- when there are spaces in the domain.
Make gb unittest take a name of the unit test to run.
|
2015-09-14 10:24:22 -06:00 |
|
Zak Betz
|
5622ca47ee
|
Work on non-ascii domain names. It works on correct inputs, but
will crash on some non correct inputs, so it is forced to be disabled.
|
2015-09-14 00:34:44 -06:00 |
|
Matt
|
0ca27638bc
|
checkpoint. moved warc and arc looping into xmldoc.
now will any container doc from pageinject into
xmldoc. simplifies pageinject.cpp a lot. and sets up
a framework for dealing with container docs.
|
2015-05-01 19:11:13 -07:00 |
|
Matt
|
184b157365
|
Merge branch 'diffbot-testing' into ia
|
2015-04-29 21:43:00 -07:00 |
|
Matt
|
09a79d230c
|
check for .css?* better as media extensions.
do it when adding outlinks in xmldoc.cpp.
|
2015-04-28 14:42:04 -07:00 |
|
Matt
|
0eb415d408
|
added preliminary support for spidering .warc.gz and .arc.gz files
|
2015-04-27 21:41:22 -06:00 |
|
Matt Wells
|
644ad28912
|
debugging the hopcount bug
|
2015-04-19 15:51:29 -06:00 |
|
Matt
|
f4ca6d8cd4
|
try ddomain only urls with www. when looking up
in sitelinks.txt
|
2015-01-31 15:33:37 -07:00 |
|
Matt
|
96b8197ad3
|
now it compiles with -m32
|
2014-11-10 14:45:11 -08:00 |
|
Matt Wells
|
e7dd8f7956
|
replace long long with int64_t
|
2014-10-30 13:36:39 -06:00 |
|
Matt Wells
|
f6e560c1f4
|
Initial file population.
|
2013-08-02 13:12:24 -07:00 |
|