sapling/eden/scm/tests/test-encoding.t

158 lines
4.1 KiB
Perl
Raw Normal View History

#require py2
2010-09-26 22:41:32 +04:00
Test character encoding
$ hg init t
$ cd t
we need a repo with some legacy latin-1 changesets
$ hg unbundle "$TESTDIR/bundles/legacy-encoding.hg"
2010-09-26 22:41:32 +04:00
adding changesets
adding manifests
adding file changes
added 2 changesets with 2 changes to 1 files
$ hg co
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
$ $PYTHON << EOF
> f = open('latin-1', 'wb'); _ = f.write(b"latin-1 e' encoded: \xe9"); f.close()
> f = open('utf-8', 'wb'); _ = f.write(b"utf-8 e' encoded: \xc3\xa9"); f.close()
> f = open('latin-1-tag', 'wb'); _ = f.write(b"\xe9"); f.close()
2010-09-26 22:41:32 +04:00
> EOF
should fail with encoding error
$ echo "plain old ascii" > a
$ hg st
M a
? latin-1
? latin-1-tag
? utf-8
$ HGENCODING=ascii hg ci -l latin-1
encoding: replace 'ascii' with 'utf-8' automatically Summary: `ascii` was used as the default / fallback, which is not a user-friendly choice. Nowadays utf-8 dominates: - Rust stdlib is utf-8. - Ruby since 1.9 is utf-8 by default. - Python 3 is unicode by default. - Windows 10 adds utf-8 code page. Given the fact that: - Our CI sets HGENCODING to utf-8 - Nuclide passes `--encoding=utf-8` to every command. - Some people have messed up with `LC_*` and complained about hg crashes. - utf-8 is a super set of ascii, nobody complains that they want `ascii` encoding and the `utf-8` encoding messed their setup up. Let's just use `utf-8` as the default encoding. More aggressively, if someone sets `ascii` as the encoding, it's almost always a mistake. Auto-correct that to `utf-8` too. This should also make future integration with Rust easier (where it's enforced utf-8 and does not have an option to change the encoding). In the future we might just drop the flexibility of choosing customized encoding, so this diff autofixes `ascii` to `utf-8`, instead of allowing `ascii` to be set. We cannot enforce `utf-8` yet, because of Windows. Here is our encoding strategy vs the upstream's: | item | upstream | | ours | ours | | | current | ideal | current | ideal | | CLI argv | bytes | bytes | utf-8 [1] | utf-8 | | path | bytes | auto [3] | migrating [2] | utf-8 | | commit message | utf-8 | utf-8 | utf-8 | utf-8 | | bookmark name | utf-8 | utf-8 | utf-8 | utf-8 | | file content | bytes | bytes | bytes | bytes | [1]: Argv was accidentally enforced utf-8 for command-line arguments by a Rust wrapper. But it simplified a lot of things and is kind of ok: everything that can be passed as CLI arguments are utf-8: -M commit message, -b bookmark, paths, etc. There is no "file content" passed via CLI arguments. [2]: Path is controversial, because it's possible for systems to have non-utf8 paths. The upstream behavior is incorrect if a repo gets shared among different encoding systems (ex. both Linux and Windows). We have to know the encoding of paths to be able to convert them suitable for the local system. One way is to enforce UTF-8 for paths. The other is to keep encoding information stored with individual paths (like Ruby strings). The UTF-8 approach is much simpler with the tradeoff that non-utf-8 paths become unsupported, which seems to be a reasonable trade-off. [3]: See https://www.mercurial-scm.org/wiki/WindowsUTF8Plan. Reviewed By: singhsrb Differential Revision: D17098991 fbshipit-source-id: c0ff1e586a887233bd43cdb854fb3538aa9b70c2
2019-09-13 01:05:08 +03:00
abort: decoding near ' encoded: \xe9': 'utf8' codec can't decode byte 0xe9 in position 20: unexpected end of data! (esc)
2010-09-26 22:41:32 +04:00
[255]
these should work
$ echo "latin-1" > a
$ HGENCODING=latin-1 hg ci -l latin-1
$ echo "utf-8" > a
$ HGENCODING=utf-8 hg ci -l utf-8
hg log (ascii)
$ hg --encoding ascii log
commit: ca661e7520de
2010-09-26 22:41:32 +04:00
user: test
date: Thu Jan 01 00:00:00 1970 +0000
summary: utf-8 e' encoded: ?
commit: 650c6f3d55dd
2010-09-26 22:41:32 +04:00
user: test
date: Thu Jan 01 00:00:00 1970 +0000
summary: latin-1 e' encoded: ?
commit: 0e5b7e3f9c4a
2010-09-26 22:41:32 +04:00
user: test
date: Mon Jan 12 13:46:40 1970 +0000
summary: koi8-r: ????? = u'\u0440\u0442\u0443\u0442\u044c'
commit: 1e78a93102a3
2010-09-26 22:41:32 +04:00
user: test
date: Mon Jan 12 13:46:40 1970 +0000
summary: latin-1 e': ? = u'\xe9'
hg log (latin-1)
$ hg --encoding latin-1 log
commit: ca661e7520de
2010-09-26 22:41:32 +04:00
user: test
date: Thu Jan 01 00:00:00 1970 +0000
summary: utf-8 e' encoded: \xe9 (esc)
2010-09-26 22:41:32 +04:00
commit: 650c6f3d55dd
2010-09-26 22:41:32 +04:00
user: test
date: Thu Jan 01 00:00:00 1970 +0000
summary: latin-1 e' encoded: \xe9 (esc)
2010-09-26 22:41:32 +04:00
commit: 0e5b7e3f9c4a
2010-09-26 22:41:32 +04:00
user: test
date: Mon Jan 12 13:46:40 1970 +0000
summary: koi8-r: \xd2\xd4\xd5\xd4\xd8 = u'\\u0440\\u0442\\u0443\\u0442\\u044c' (esc)
2010-09-26 22:41:32 +04:00
commit: 1e78a93102a3
2010-09-26 22:41:32 +04:00
user: test
date: Mon Jan 12 13:46:40 1970 +0000
summary: latin-1 e': \xe9 = u'\\xe9' (esc)
2010-09-26 22:41:32 +04:00
hg log (utf-8)
$ hg --encoding utf-8 log
commit: ca661e7520de
2010-09-26 22:41:32 +04:00
user: test
date: Thu Jan 01 00:00:00 1970 +0000
summary: utf-8 e' encoded: \xc3\xa9 (esc)
2010-09-26 22:41:32 +04:00
commit: 650c6f3d55dd
2010-09-26 22:41:32 +04:00
user: test
date: Thu Jan 01 00:00:00 1970 +0000
summary: latin-1 e' encoded: \xc3\xa9 (esc)
2010-09-26 22:41:32 +04:00
commit: 0e5b7e3f9c4a
2010-09-26 22:41:32 +04:00
user: test
date: Mon Jan 12 13:46:40 1970 +0000
summary: koi8-r: \xc3\x92\xc3\x94\xc3\x95\xc3\x94\xc3\x98 = u'\\u0440\\u0442\\u0443\\u0442\\u044c' (esc)
2010-09-26 22:41:32 +04:00
commit: 1e78a93102a3
2010-09-26 22:41:32 +04:00
user: test
date: Mon Jan 12 13:46:40 1970 +0000
summary: latin-1 e': \xc3\xa9 = u'\\xe9' (esc)
2010-09-26 22:41:32 +04:00
hg log (utf-8)
$ HGENCODING=utf-8 hg log
commit: ca661e7520de
2010-09-26 22:41:32 +04:00
user: test
date: Thu Jan 01 00:00:00 1970 +0000
summary: utf-8 e' encoded: \xc3\xa9 (esc)
2010-09-26 22:41:32 +04:00
commit: 650c6f3d55dd
2010-09-26 22:41:32 +04:00
user: test
date: Thu Jan 01 00:00:00 1970 +0000
summary: latin-1 e' encoded: \xc3\xa9 (esc)
2010-09-26 22:41:32 +04:00
commit: 0e5b7e3f9c4a
2010-09-26 22:41:32 +04:00
user: test
date: Mon Jan 12 13:46:40 1970 +0000
summary: koi8-r: \xc3\x92\xc3\x94\xc3\x95\xc3\x94\xc3\x98 = u'\\u0440\\u0442\\u0443\\u0442\\u044c' (esc)
2010-09-26 22:41:32 +04:00
commit: 1e78a93102a3
2010-09-26 22:41:32 +04:00
user: test
date: Mon Jan 12 13:46:40 1970 +0000
summary: latin-1 e': \xc3\xa9 = u'\\xe9' (esc)
2010-09-26 22:41:32 +04:00
hg log (dolphin)
$ HGENCODING=dolphin hg log
abort: unknown encoding: dolphin
(please check your locale settings)
2010-09-26 22:41:32 +04:00
[255]
$ cp latin-1-tag .hg/branch
$ HGENCODING=latin-1 hg ci -m 'auto-promote legacy name'
$ cd ..
Test roundtrip encoding/decoding of utf8b for generated data
#if hypothesis
>>> from hypothesishelpers import *
>>> from edenscm.mercurial import encoding
>>> roundtrips(st.binary(), encoding.fromutf8b, encoding.toutf8b)
Round trip OK
#endif