blob: 15595b50d998d36c8c5c6de786b5dfef5c4367d8 [file] [log] [blame]
% File src/library/base/man/UTFfilepaths.Rd
% Part of the R package, https://www.R-project.org
% Copyright 2019 R Core Team
% Distributed under GPL 2 or later
\name{UTF8filepaths}
\alias{file path encoding}
\alias{UTF-8 file path}
\title{File Paths not in the Native Encoding}
\description{
Most modern file systems store file-path components (names of
directories and files) in a character encoding of wide scope: usually
UTF-8 on a Unix-alike and UCS-2/UTF-16 on Windows. However, this was
not true when \R was first developed and there are still exceptions
amongst file systems, e.g.\sspace{}FAT32.
This was not something anticipated by the C and POSIX standards which
only provide means to access files \emph{via} file paths encoded in
the current locale, for example those specified in Latin-1 in a
Latin-1 locale.
Everything here apart from the specific section on Windows is about
Unix-alikes.
}
\details{
It is possible to mark character strings (elements of character
vectors) as being in UTF-8 or Latin-1 (see \code{\link{Encoding}}).
This allows file paths not in the native encoding to be
expressed in \R character vectors but there is almost no way to use
them unless they can be translated to the native encoding. That is of
course not a problem if that is UTF-8, so these details are really only
relevant to the use of a non-UTF-8 locale (including a C locale) on a
Unix-alike.
Functions to open a file such as \code{\link{file}},
\code{\link{fifo}}, \code{\link{pipe}}, \code{\link{gzfile}},
\code{\link{bzfile}}, \code{\link{xzfile}} and \code{\link{unz}} give
an error for non-native filepaths. Where functions look at existence
such as \code{file.exists}, \code{\link{dir.exists}},
\code{\link{unlink}}, \code{\link{file.info}} and
\code{\link{list.files}}, non-native filepaths are treated as
non-existent.
Many other functions use \code{file} or \code{gzfile} to open their
files.
\code{\link{file.path}} allows non-native file paths to be combined,
marking them as UTF-8 if needed.
\code{\link{path.expand}} only handles paths in the native encoding.
}
\section{Windows}{
Windows provides proprietary entry points to access its file systems,
and these gained \sQuote{wide} versions in Windows NT that allowed
file paths in UCS-2/UTF-16 to be accessed from any locale.
Some \R functions use these entry points when file paths are marked
as Latin-1 or UTF-8 to allow access to paths not in the current
encoding. These include
%
\code{\link{file}}, \code{\link{file.access}},
\code{\link{file.append}}, \code{\link{file.copy}},
\code{\link{file.create}}, \code{\link{file.exists}},
\code{\link{file.info}}, \code{\link{file.link}},
\code{\link{file.remove}}, \code{\link{file.rename}},
\code{\link{file.symlink}}
%
and
%
\code{\link{dir.create}}, \code{\link{dir.exists}},
\code{\link{normalizePath}}, \code{\link{path.expand}},
\code{\link{pipe}}, \code{\link{Sys.glob}},
#ifdef unix
\code{Sys.junction},
#endif
#ifdef windows
\code{\link{Sys.junction}},
#endif
\code{\link{unlink}}
%
but not \code{\link{gzfile}} \code{\link{bzfile}},
\code{\link{xzfile}} nor \code{\link{unz}}.
For functions using \code{\link{gzfile}} (including
\code{\link{load}}, \code{\link{readRDS}}, \code{\link{read.dcf}} and
\code{\link{tar}}), it is often possible to use a \code{\link{gzcon}}
connection wrapping a \code{\link{file}} connection.
Other notable exceptions are \code{\link{list.files}},
\code{\link{list.dirs}}, \code{\link{system}} and file-path inputs for
graphics devices.
}
\section{Historical comment}{
Before \R 4.0.0, file paths marked as being in Latin-1 or UTF-8 were
silently translated to the native encoding using escapes such as
\samp{<e7>} or \samp{<U+00e7>}. This created valid file names but
maybe not those intended.
}
\note{
This document is still a work-in-progress.
}
\keyword{file}