mirror of
https://github.com/composewell/streamly.git
synced 2024-10-27 12:12:05 +03:00
7c5fd94660
Now that it hosts all sorts of developer documentation
231 lines
10 KiB
ReStructuredText
231 lines
10 KiB
ReStructuredText
Introduction
|
|
------------
|
|
|
|
Paths are used by file systems as well as protocols to represent paths to files
|
|
and other resources. We need a generic type safe way to represent paths in
|
|
general and file system paths in particular.
|
|
|
|
Path limits
|
|
-----------
|
|
|
|
OS include files generally define PATH_MAX to 4K and NAME_MAX to 255,
|
|
however, it is possible to create paths bigger than these depending on
|
|
the file system.
|
|
|
|
Scalability
|
|
-----------
|
|
|
|
In general directory trees could be quite deep and a directory can contain
|
|
millions of entries. A good benchmark to measure the efficiency of path
|
|
representation would be to traverse a directory tree recursively and list all
|
|
the nodes under the tree. We could do this many times so that we do not need a
|
|
really big directory tree.
|
|
|
|
Compatibility
|
|
-------------
|
|
|
|
A file system starts with just the root directory and then files are
|
|
created in the file system by the user or by programs storing their data
|
|
on the file system. When a directory or file is created, or when a directory is
|
|
listed, the following operations are performed:
|
|
|
|
1. for lookups an existing directory name must be resolved based on the name
|
|
supplied by the user.
|
|
2. for creation the file name to be created is supplied by the user
|
|
|
|
When the user asks the file system to lookup or create a file or
|
|
directory in the file system:
|
|
|
|
1) The operating system passes the name, as it is without any changes
|
|
whatsoever, to the file system. or does it? Windows?
|
|
2) The file system may translate the name to its own conventions before a
|
|
lookup or create, e.g. it may
|
|
|
|
* change the name to upper case
|
|
* translate the name to 8.3 chars
|
|
* change the character encoding?
|
|
* change the unicode normalization form of the name (Apple)
|
|
|
|
When resolving an existing directory name in the file system we need
|
|
to supply a path which consists of component names separated by a separator
|
|
byte. Separators are of no consequence to the file system, they are
|
|
resolved by the OS and the path components are used to lookup the paths
|
|
one components at a time. The path for lookup is acquired either by a
|
|
user input, device input or by the program which previously got the path
|
|
entries by traversing the file system itself.
|
|
|
|
1) When the path is acquired by a user input, the user input could be:
|
|
|
|
a) a literal string in the program
|
|
b) a path entered via an input device
|
|
c) a path coming from the network
|
|
|
|
2) If the path was previously acquired from the file system then the
|
|
best thing to do is to never change anything in the path and store it
|
|
as it is and supply exactly the same path when needed. That way we can
|
|
guarantee that the path remains exactly what it was in the file system.
|
|
|
|
Handling String Literals
|
|
========================
|
|
|
|
The encoding of the source code file depends on the editor used and the
|
|
encoding chosen when saving it. The string literals would be parsed
|
|
by the GHC parser and then stored in the generated binaries as null
|
|
terminated C string literals encoded in UTF-8 (see GHC reference). There are
|
|
several possible points of failure here:
|
|
|
|
a) GHC parser needs to interpret the source code encoding correctly.
|
|
b) We assume that the editor does not perform any translation on the
|
|
literal as entered by the user e.g. it does not perform unicode
|
|
normalization on it. If it does then the string as entered by the user
|
|
won't remain the same when it reaches the file system.
|
|
c) GHC parser stores the parsed string literal in UTF-8 encoding. We
|
|
assume that GHC does not perform any unicode normalization or any
|
|
other translation on the string.
|
|
|
|
The UTF-8 encoded string literal can be passed as a blob of bytes to the file
|
|
system or it can be converted to String type and re-encoded as UTF-8 both
|
|
should work equivalently in this case.
|
|
|
|
Handling Input From Devices
|
|
===========================
|
|
|
|
The path provided by the user would assume some encoding based on
|
|
the terminal settings or the encoding assumed by the sender over the
|
|
network. The correctness depends on the contract between the two parties
|
|
e.g. the locale setting. We assume that we get a sequence of raw bytes
|
|
from the input device. We need to use the sequence as raw bytes and send
|
|
it as it is to the file system without any translation.
|
|
|
|
Manipulating Paths
|
|
------------------
|
|
|
|
We need to parse the path components by the separator bytes. We assume
|
|
that the separator can be identified and removed correctly irrespective
|
|
of the encoding. We also make sure that none of the bytes in the
|
|
components is a separator byte.
|
|
|
|
Also, we would join the path components by the separator byte
|
|
irrespective of the encoding of the components. If the OS treats the
|
|
path as a sequence of bytes and nothing else and the components do not
|
|
have the separator byte then we are good, we know that the OS would also be
|
|
parsing based on the separator as a raw byte.
|
|
|
|
We may perform some validations on the paths such as the file names are
|
|
not "." or "..". Such validations could be optional and we could also
|
|
provide a way to not perform any validations and just blindly use the
|
|
paths as provided by the user and let the file system/OS fail.
|
|
|
|
File System Translations
|
|
------------------------
|
|
|
|
As we noted earlier, the file system may translate the paths before
|
|
using them. For example, it may store the path after converting it to
|
|
NFD unicode normalization. Translation may create some round tripping
|
|
issues for programs. For example, a program may use a string literal
|
|
which is stored in NFC and the file system converts it to NFD. Later,
|
|
when the same path is retrieved from the file system and compared with
|
|
the string literal that was to create it then it won't match. For such
|
|
cases the programs need to understand the file system and perform
|
|
comparisons by performing appropriate translations on the paths. To
|
|
perform matching and translations correctly the program needs to
|
|
correctly interpret the encoding specific to the file system.
|
|
|
|
Displaying Paths
|
|
----------------
|
|
|
|
When we display the paths to the user then we are forced to interpret
|
|
it according to some encoding, to display the path correctly we have to
|
|
know exactly how the file system stores the path. Otherwise if we display
|
|
it differently, the user may use the displayed result to find the file
|
|
and may not find it.
|
|
|
|
Type Safety Requirements
|
|
------------------------
|
|
|
|
* Safety against using an absolute path where a relative path is to be
|
|
used and vice-versa.
|
|
|
|
* Validations for absolute or relative path when constructing a path.
|
|
* We cannot append an absolute path to another path
|
|
* Safety against using a file name where a directory name is to be used and
|
|
vice-versa.
|
|
|
|
* Certain validations can be performed e.g. file names cannot be "." or "..".
|
|
* We should not be appending more directory components to a file path
|
|
|
|
In don't care situations we should be easily able to use any type
|
|
conveniently or cast a type into another. It is desirable that the
|
|
programmer can choose the safety level. For example, we should be able
|
|
to instantiate a path type where we only worry about the distinction
|
|
between Absolute and Relative paths but no distinction between files and
|
|
directories or vice versa.
|
|
|
|
Requirement Summary
|
|
-------------------
|
|
|
|
* minimal dependencies, specifically streamly does not depend on bytestring
|
|
* round-tripping safety wrt to file system returned paths
|
|
* type safety for different path types
|
|
* support Posix/Windows
|
|
* support URI paths and other ways to represent paths where the separator could
|
|
be different.
|
|
|
|
Design Considerations
|
|
---------------------
|
|
|
|
* Should we store path as separate components or single string with
|
|
separators?
|
|
|
|
* Should we validate the paths returned from the file system or trust
|
|
those and use directly without any validations? Need to see if that makes
|
|
any difference to path heavy benchmarks. If we want to use it directly
|
|
then we have to store it as a single string.
|
|
|
|
* Parameterize the low level APIs with the separator so that we can
|
|
support arbitrary separators when parsing or reconstructing paths.
|
|
|
|
* The low level API can support path handling in trees/DAGs/Graphs in general.
|
|
For example, in trees we cannot have multiple parents of a child whereas in
|
|
DAGs that is allowed, in graphs we can have cycles. We may also need ways to
|
|
detect cycles.
|
|
|
|
* Do we need to support arbitrarily long paths i.e. streaming of path? We do
|
|
not need that for file system paths and file system paths are limited size
|
|
and operating system anyway requires them in strict buffers. In case of
|
|
graphs if we have cycles paths can be infinite, we could generate a stream of
|
|
path and the consumer could be traversing the graph according to the
|
|
generated stream. If we want to support streaming then we have to store paths
|
|
as a stream of chunks rather than a single string.
|
|
|
|
* In general, paths need not be strings, e.g. they can be references to
|
|
locations in memory or they can be IP addresses of nodes. At an abstract
|
|
level, paths are just a stream of tokens that represent a certain traversal.
|
|
|
|
* Relative paths are the most general representation. At a low level,
|
|
all paths are relative, absolute paths are relative to a specified root
|
|
whereas relative paths are relative to a dynamic root which is the
|
|
current directory.
|
|
|
|
* Windows can have the root as different drive letters. So to represent paths
|
|
with a root in general we can also store the specific root along with the
|
|
path. In case of POSIX this will always be "/". In general, it could be a
|
|
host name or IP address or dependent on the protocol whose path we are
|
|
representing.
|
|
|
|
* We can parameterize the low level path type with the type of path e.g. POSIX,
|
|
WINDOWS, HTTP etc. In general, programs may have to manipulate different
|
|
types of paths at the same time. High level path types can be instantiated
|
|
using the low level type therefore they can be much simpler as desired.
|
|
|
|
References
|
|
----------
|
|
|
|
Some related links found by web search:
|
|
|
|
* https://gitlab.haskell.org/ghc/ghc/issues/5218
|
|
* https://nodejs.org/fr/docs/guides/working-with-different-filesystems/
|
|
* https://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux
|
|
* https://docs.microsoft.com/en-us/windows/win32/intl/character-sets-used-in-file-names
|
|
* https://beets.io/blog/paths.html
|