Wednesday, July 23, 2008

man-pages-3.05 is released

I've uploaded man-pages-3.05 into the release directory (or view the online pages). Notable changes in man-pages-3.05 are:

  • A new math_error(7) page describes how to diagnose errors when calling math functions.
  • A new matherr(3) page describes the older SVID-specified mechanism for diagnosing errors from math functions.

There are a few other minor changes, but this week's release is indeed smaller than usual. That's because I'm working on pending changes to almost all of the math man pages to improve: the discussion of return values, treatment of special values (+0, -0, infinities, NaN), and error diagnostics; the accuracy of the description of feature test macro requirements; and various other details. Expect a (very) big release next week.

Blick auf die Grünegg, Konolfingen

Wednesday, July 16, 2008

Capabilities have fully arrived, finally

Linux 2.6.26 is out, which means that a complete Linux capabilities implementation has finally arrived, since we now have:

  • The ability to attach capability sets to files (added in 2.6.24), so that a process can acquire capabilities during an execve(2).
  • A CAP_SETPCAP capability with the proper semantics (since 2.6.25).
  • A per-thread capability bounding set (added in 2.6.25).
  • The per-thread securebits flags (added in 2.6.26), which can be used to restrict a thread and its children to a pure capabilities-only environment (i.e., one in which there is no special treatment of UID 0).

All of the details are provided in the recently revised capabilities(7) man page. A couple of other useful places to look for information on capabilities are Serge Hallyn's article, POSIX file capabilities: Parceling the power of root, and Chris Friedhoff's page on capabilities.

man-pages-3.04 is released

I've uploaded man-pages-3.04 into the release directory (or view the online pages). Notable changes in man-pages-3.04 are:

  • A new utimensat(2) page describes the utimensat() system call (new in kernel 2.6.22, fixed in 2.6.26) and the futimens() library function.
  • A new end(3) page describes the end, etext, and edata variables.
  • The capget(2) page was updated by Andrew Morgan, in line with changes in Linux capabilities in kernels 2.6.24, 2.6.25, and 2.6.26.
  • The capabilities(7) page adds discussion of: file capabilities (new in Linux 2.6.24), the per-thread capability bounding set (new in Linux 2.6.25); the changed semantics for CAP_SETPCAP (Linux 2.6.25); per-thread securebits flags (new in Linux 2.6.26); three new capabilities, CAP_MAC_ADMIN, CAP_MAC_OVERRIDE, CAP_SETFCAP (Linux 2.6.25); historical CAP_SETPCAP semantics; the rules governing changes to capability sets; and the rationale for the inheritable set and capability bounding set. There were also many other more minor changes to this page. Thanks to Serge Hallyn for his contributions here.
  • The getrusage(2) page adds a description of the RUSAGE_THREAD option, which is new in Linux 2.6.26.
  • Many changes were made to the prctl(2) page, including adding documentation of: PR_CAPBSET_READ and PR_CAPBSET_DROP (thanks to Serge Hallyn); PR_GET_TSC and PR_SET_TSC (thanks to Erik Bosman); PR_SET_SECCOMP and PR_GET_SECCOMP; and PR_SET_SECUREBITS and PR_GET_SECUREBITS.
  • The proc(5) page adds documentation of /proc/config.gz (new in kernel 2.6); /proc/sys/vm/oom_kill_allocating_task (new in Linux 2.6.24); /proc/sys/vm/oom_dump_tasks (new in Linux 2.6.25); and /proc/sys/vm/panic_on_oom (new in Linux 2.6.18).
  • The getopt(3) page adds details on the use of optind for restarting an argument list scan.
  • The memchr(3) page adds a description of rawmemchr().
  • There were also very many minor content and and minor and major formatting fixes, many of them triggered by the around 70 mailed reports that I received from the Alain Portal, the French translator of man-pages. (Alain also provided dozens of reports that led to fixes in man-pages-3.03.) Thanks Alain!

Tuesday, July 8, 2008

man-pages-3.03 is released

I've uploaded man-pages-3.03 into the release directory (or view the online pages). Notable changes in man-pages-3.03 are:

  • A new cpuset(7) page, written by Paul Jackson, describes the cpuset file system, the mechanism introduced in Linux 2.6.12 for confining processes to designated processors and nodes. This enormous page becomes the fourth largest in man-pages. Thanks for a great contribution Paul!

  • A new getcpu(2) page, written by Andi Kleen, to document the getcpu(2) system call, introduced in Linux 2.6.19.

  • A new sched_getcpu(3) page documents a glibc wrapper for getcpu(2).

  • The readdir(3) page adds a description of readdir_r(3), the reentrant analog of readdir(3).

  • The signal(7) page adds a section on system call restarting (SA_RESTART), and describes the aberrant Linux behavior whereby a stop signal plus SIGCONT can interrupt some system calls, even if no signal handler has been established.

Wednesday, July 2, 2008

man-pages-3.02 is released

I've uploaded man-pages-3.02 into the release directory (or view the online pages).

Notable changes in man-pages-3.02 are:

  • A new clock_nanosleep(2) page, describing the system call introduced in kernel 2.6.

  • A rewritten getgrouplist(3) page, which provides additional information and an example program.

  • A new getutmp(3) page documenting the getutmp(3) and getutmpx(3) functions.

  • A new gnu_get_libc_version(3) page documenting get_gnu_libc_version(3) and get_gnu_libc_release(3).

  • A new sigwait(3) page dopcumenting the sigwait(3) library function.

  • A new shm_overview(7) page providing an overview of the POSIX shared memory API.

  • Additional information in the sigreturn(2) man page.

  • Additons and updates to various pages describing the login accounting APIs, including: a new getutmp(3) page documenting the getutmp(3) and getutmpx(3) functions; and various improvements in the getutent(3) and utmp(5) pages.

Monday, June 30, 2008

What's wrong with kernel-userland interface development?

So I came to write the man pages for utimensat() and futimens(), interfaces that are specified in the upcoming POSIX.1 revision, and recently added to Linux (kernel 2.6.22, glibc 2.6), and it's a familiar story: bugs, and yet more bugs.

This case is a little worse than usual (the gory details are below), but hardly exceptional: it was a similar story with other recent APIs, such as splice(), signalf(), and the timerfd API (the last of which was ultimately redesigned after some pushing by me). In each case, an interface that was already released in a stable kernel turned out to have easy-to-find bugs when I came to document and test it. All that was required was to come up with something like a reasonable written specification (i.e., a man page), and then start testing against that specification.

The following seems (to me) like a reasonable development model for new kernel userland APIs:

  1. Write a more or less complete design specification (preferably something like a man page, which I'm more than happy to edit and review at this stage in the process), and perhaps even code up an initial version of the interface.

  2. Get review comments about the design; revise the design if necessary.

  3. Implement something like a final version of the interface.

  4. Test the interface. Preferably: test in collaboration with other people who had nothing to do with the implementation (they will think about the interface in different ways from the implementer). If it's sensible to do so (and usually it is), produce a test suite to check the correctness of as many operational cases of the interface as possible (and then send that suite to the Linux Test Project).

  5. Write the interface documentation (i.e., a man page) for userland programmers, and send it to me for editing and review. (If you do this smart, then you can just recycle the work from the first step.)

  6. Get the interface accepted into mainline.

However, as far as I can see, the actual development model for kernel-userland interfaces seems to work something like this:

  1. Code up the new interface.

  2. Post the code to LKML, explaining that "I tested it, and it works", and, maybe, include some brief documentation of the interface.

  3. Often, someone looks at the code and suggests some fixes. Usually, no one else does any testing.

  4. Code gets accepted into mainline.

  5. Usually, some few dot releases later: someone (often me) who had nothing to do with the earlier steps writes the man page.

  6. Often, some few dot releases later: find and fix all of the bugs (e.g., in the case of utimensat(), released for 2.6.22, the fixes will only make it into 2.6.26 at the earliest).

  7. Occasionally, some few dot releases later: realize that the interface could have been better designed, but it's too late now to change it, because we can't break the ABI.

Most kernel-userland interfaces are seeing far too little testing before release. This hurts because:

  • Userland programmers end up dealing with kernel bugs they shouldn't have to deal with. (When this happens often, it can damage the reputation of the Linux kernel-userland interface, and userland programmers may become wary of adopting new interfaces, which can further delay any kind of real-world testing and adoption of the interface.)

  • Code that uses new interfaces may have to special case for ABI bugs that were present in the first few kernel releases that contained the bug.

Similarly, many kernel-userland interfaces could do with more design review as well. This can hurt even more than interfaces released with bugs, since, once an interface is released, it is at least difficult, but often impossible to modify the design, since doing so would break the kernel-userland ABI. (With timerfd, we got lucky: the interface was in any case broken by a bug, which gave a more or less blank slate for a redesign.) As a result, userland programmers usually just have to live with the bad design.

I'll have more time for testing and review nowadays, and hopefully I'll catch more problems before they are released into stable kernels, but there's certainly more than one person can do. And, as described above, there is a culture problem when it comes to adding kernel-userland interfaces on Linux. The question is whether the culture can be changed. (And here its worth repeating a point that I often make: one could perhaps try and blame individual developers for buggy interfaces, but that doesn't really get to the core issue (everyone is going to write bugs): the real problem is the process by which kernel-userland interface changes are accepted into the kernel.)

The Gory Details

For those who are interested, here are the problems I found with utimensat() and futimens(), explained by annotating the current draft of the man page.

SYNOPSIS
#include <sys/stat.h>


int utimensat(int dirfd, const char *pathname,
const struct timespec times[2], int flags);


int futimens(int fd, const struct timespec times[2]);

DESCRIPTION
utimensat() and futimens() update the timestamps of a
file with nanosecond precision. This contrasts with the
historical utime(2) and utimes(2), which permit only sec-
ond and microsecond precision, respectively, when setting
file timestamps.

This is the first of the advantages of the new interfaces: greater precision for setting timestamps. (The next revision of POSIX.1 specifies nanosecond file timestamps, and extends stat(2) to allow their retrieval. File system support is needed, and can't necessarily be retrofitted to some older file systems. Currently, Linux supports nanosecond timestamps on XFS, JFS, and ext4.)

    With utimensat() the file is specified via  the  pathname
given in pathname. With futimens() the file whose times-
tamps are to be updated is specified via an open file
descriptor, fd.

For both calls, the new file timestamps are specified in
the array times: times[0] specifies the new "last access
time" (atime); times[1] specifies the new "last modifica-
tion time" (mtime). Each of the elements of times speci-
fies a time in seconds and nanoseconds since the Epoch
(00:00:00, 1 Jan 1970, UTC), in a structure of the fol-
lowing form:

struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};

If the tv_nsec field of one of the timespec structures
has the special value UTIME_NOW, then the corresponding
file timestamp is set to the current time. If the
tv_nsec field of one of the timespec structures has the
special value UTIME_OMIT, then the corresponding file
timestamp is left unchanged. In both of these cases, the
value of the corresponding tv_sec field is ignored.

These are the other advantages of the new interfaces.

With utime(2) and utimes(2), to change just one of the timestamps, we must make a call to stat(2) to retrieve the current timestamps, use one of the timestamps to initialize the times element that we don't want to change, and then call utimensat() with the desired value for the other timestamp. This can be subject to race conditions: if another process updates the file timestamps between the two calls, then that update will be lost. UTIME_OMIT allows us to avoid this problem.

UTIME_NOW exists mainly as a convenience: it allows us to avoid fetching the current time (using gettimeofday(2) or similar) in order to set a file timestamp to "now".

    If  times  is  NULL,  then both timestamps are set to the
current time.

Permissions requirements
To set both file timestamps to the current time (i.e.,
times is NULL, or both tv_nsec fields specify UTIME_NOW),
either:

1. the caller must have write access to the file;

2. the caller's effective user ID must match the owner of
the file; or

3. the caller must have appropriate privileges.

To make any change other than setting both timestamps to
the current time (i.e., times is not NULL, and both
tv_nsec fields are not UTIME_NOW and both tv_nsec fields
are not UTIME_OMIT), either condition 2 or 3 above must
apply.

If both tv_nsec fields are specified as UTIME_OMIT, then
no file ownership or permission checks are performed, and
the file timestamps are not modified, but other error
conditions may still be detected.

utimensat() specifics
[Details of dirfd and flags arguments omitted.]

For details of the dirfd argument, see openat(2).
RETURN VALUE
On success, utimensat() and futimens() return 0. On
error, -1 is returned and errno is set to indicate the
error.


ERRORS
[Various other errors that are irrelevant to this post
have been omitted.]

EACCES times is NULL, or both tv_nsec values are
UTIME_NOW, but the effective effective ID of the
caller does not match the owner of the file, the
caller does not have write access to the file, and
the caller is not privileged (Linux: does not have
either the CAP_FOWNER or the CAP_DAC_OVERRIDE
capability).

EACCES times is NULL, or both tv_nsec values are
UTIME_NOW, and the file is marked immutable.

EPERM The caller attempted to change one or both times-
tamps to a value other than the current time, or
to change one of the timestamps to the current
time while leaving the other timestamp unchanged,
(i.e., times is not NULL, both tv_nsec fields are
not UTIME_NOW, and both tv_nsec fields are not
UTIME_OMIT) but the caller's effective user ID
does not match the owner of file, and the caller
is not privileged (Linux: does not have the
CAP_FOWNER capability).

EPERM times is not NULL, and the file is marked append-
only or immutable.

NOTES
On Linux, futimens() is a library function implemented on
top of the utimensat() system call. To support this, the
Linux utimensat() system call implements a non-standard
feature: if pathname is NULL, then the system call modi-
fies the timestamps of the open file referred to by the
file descriptor dirfd (which may refer to any type of
file). Using this feature, the call futimens(fd, times)
is implemented as:

utimensat(fd, NULL, times, 0);

BUGS
Several bugs afflict utimensat() and futimens() on ker-
nels before 2.6.??.

Really, there should have been a test suite to go along with the initial implementation. There wasn't, unfortunately, so I wrote one (later revised a little before it went to LTP) whose results can be seen here.

I've posted patches to fix all of the bugs, and Andrew Morton has accepted them into -mm, and from there, they seem to have gone upstream to Al Viro. It's just a question of when they will get pushed into mainline. It'd be nice if they make it into 2.6.26, but maybe they won't, since it is already getting very late in the -rc cycle. (Update, 6 Jul 08: the patches have gone into -rc9.)

                         These bugs  are  either  non-confor-
mances with the POSIX.1 draft specification or inconsis-
tencies with historical Linux behavior.

* POSIX.1 specifies that if one of the tv_nsec fields has
the value UTIME_NOW or UTIME_OMIT, then the value of
the corresponding tv_sec field should be ignored.
Instead, the value of the tv_sec field is required to
be 0 (or the error EINVAL results).
This was a simple and obvious divergence from the specification.
    * Various bugs mean that for the purposes  of  permission
checking, the case where both tv_nsec fields are set to
UTIME_NOW isn't always treated the same as specifying
times as NULL, and the case where one tv_nsec value is
UTIME_NOW and the other is UTIME_OMIT isn't treated the
same as specifying times as a pointer to a structure
containing arbitrary time values. As a result, in some
cases: a) file timestamps can be updated by a process
that shouldn't have permission to perform updates; b)
file timestamps can't be updated by a process that
should have permission to perform updates; and c) the
wrong errno value is returned in case of an error.
There are multiple error cases here:
  • If one of the tv_nsec fields is UTIME_OMIT and the other is UTIME_NOW, then the error EPERM should occur if the process's effective user ID does not match the file owner and the process is not privileged. Instead, the call successfully changes one of the timestamps.
  • If the file is not writable by the effective user ID of the process and the process's effective user ID does not match the file owner and the process is not privileged, and times is NULL, then the error EACCES results. This error should also occur if times points to a structure in which both tv_nsec fields are UTIME_NOW. Instead the call succeeds.
  • If a file is marked as append-only (see chattr(1)), then Linux traditionally (i.e., utime(2), utimes(2)), permits a NULL times argument to be used in order to update both timestamps to the current time. For consistency, utimensat() and futimens() should also produce the same result when given a times argument that points to a structure in which both tv_nsec fields are UTIME_NOW. Instead, the call fails with the error EPERM.
  • If a file is marked as immutable (see chattr(1)), then Linux traditionally (i.e., utime(2), utimes(2)), gives an EACCES error if times is NULL. For consistency, utimensat() and futimens() should also produce the same result when given a times that points to a structure in which both tv_nsec fields are UTIME_NOW. Instead, the call fails with the error EPERM.
    * POSIX.1  says  that  a process that has write access to
the file can make a call with times as NULL, or with
times pointing to a structure in which both tv_nsec
fields are UTIME_NOW, in order to update the both
timestamps to the current time. However, futimens()
instead checks whether the access mode of the file
descriptor allows writing.

This means that a process with a file descriptor that allows writing could change the timestamps of a file for which it does not have write permission; conversely, a process with a read-only file descriptor won't be able to update the timestamps of a file, even if it has write permission on the file.

Wednesday, June 25, 2008

man-pages-3.01 is released

I've uploaded man-pages-3.01 into the release directory (or view the online pages).

Notable changes in man-pages-3.01 are:

  • A new hostname(7) page (based on the FreeBSD page) describing hostname resolution.
  • A new symlink(7) page (heavily modified from the original FreeBSD page) describing symbolic links.
  • A rewritten acct(5) page providing much more detail on the information written to the process accounting file.
  • The getrlimit(2) page adds a description of RLIMIT_RTTIME limit, new in Linux 2.6.25.
  • The mkstemp(3) page adds a description of mkostemp(), new in glibc 2.7.
  • The core(5) page adds a description (and an example program for) of the core_pattern pipe syntax, which appeared in Linux 2.6.19, and documents /proc/PID/coredump_filter, new in kernel 2.6.23.
  • The proc(5) page adds details for a number of previously undocumented /proc interfaces, including /proc/PID/oom_score, /proc/PID/oom_adj, /proc/PID/limits, /proc/PID/fdinfo/*, /proc/PID/mountinfo, /proc/PID/mountstats, and /proc/PID/status.
  • The time(7) page enhances the discussion of jiffies, and adds a section on high-resolution timers.
  • The unix(7) page adds a clear description of the three types of address that can appear in the sockaddr_un structure: pathname, unnamed, and abstract.
  • Many other pages saw significant changes, including brk(2), chmod(2), chown(2), nanosleep(2), open(2), sched_setscheduler(2), syscalls(2), ftime(3), getaddrinfo(3), inet(3), inet_pton(3), scanf(3), strerror(3), random(4), and locale(5).

Tuesday, June 17, 2008

man-pages-3.00 is released

I've uploaded man-pages-3.00 into the release directory (or view the online pages). Notable changes in man-pages-3.00 are:

  • The POSIX man pages (man pages sections 0p, 1p, 3p) are now split out into a separate package, which can be downloaded here. This makes sense because the POSIX pages are logically separate: they are copies of specifications from the POSIX.1 standard (to be precise, we are currently redistributing pages from the 2003 Technical Corrigendum 1), that we have kindly been given permission to publish by the IEEE and The Open Group. Furthermore, the POSIX pages are only updated rarely (e.g., occasional formatting fixes, and periodic updates when a new version of the standard is released, such as the revision due to appear later this year), so there is no need to redistribute them with each release of man-pages.

  • After Stuart Brady noted that the quotes around single-quoted characters were not formatted well in UTF-8 output, I've globally replaced the characters used in the source files for quoting in man-pages in order to try and produce better formatted output for both ASCII and UTF-8 xterms, and also to bring greater consistency in general to the use of single and double quotes in the pages. This may not be the final word on the subject, but at least things should be somewhat improved.

Monday, June 9, 2008

man-pages is now supported

Last month I started on what is, for the moment, my dream job: man-pages finally has a paid, full-time maintainer, thanks to a fellowship from the Linux Foundation. For the foreseeable future, that means I'll be working on:

  • Documenting every new Linux kernel-userland (and glibc) API, and every API change, that is released into the mainline kernel, ideally before actual release. (That's the ideal, but there's a quite a backlog, so I'm not going to achieve the ideal immediately.)

  • Testing new APIs, again ideally before they are released into the mainline kernel, and probably doing some light bug fixing while I'm at it.

  • Design review of new kernel-userland APIs.

  • And of course accepting patches and dealing with bug reports for existing man pages.

Other than that, I'll be helping out with in-kernel documentation (the /Documentation directory) by providing editorial, review, and possibly other assistance.

When I find time, I'll try to post a longer job description (but have a look here in the meantime). In the meantime, I'm looking forward to going to the Linux Plumbers Conference, September 17-19, in Portland, OR, USA, where there is currently a planned track on kernel-userland APIs.

Friday, June 6, 2008

man-pages-2.80 and man-pages-fr-2.80.0 are released

The Linux man-pages maintainer, Michael Kerrisk, and the maintainer of the French translation of man-pages, Alain Portal, proudly announce the simultaneous release of man-pages-2.80 in English original and French translation.

The English version can be downloaded from the usual release directory (or viewed online at the usual location).

The French translation can be downloaded from http://manpagesfr.free.fr/download.html, or viewed online at http://manpagesfr.free.fr/consulter.html.

There is long list of changes in man-pages-2.80. Among the more notable changes are the following:

  • A new random_r(3) page describes random_r(3), srandom_r(3), initstate_r(3), and setstate_r(3), the reentrant equivalents of random(3), srandom(3), initstate(3), and setstate(3).
  • getpriority(2) adds text describing the punchier effect of nice values since kernel 2.6.23.
  • mmap(2) describes improvements for the support of MAP_POPULATE in kernel 2.6.23.
  • sched_setscheduler(2) adds a description of the SCHED_IDLE policy, new in 2.6.23.
  • futimes(3) adds a description of lutimes(3), which was added in glibc 2.6
  • credentials(7) adds some text describing how NPTL maintains the same UIDs and GIDs for all threads in a process, even though the Linux kernel allows each thread to have distinct credentials.
  • inotify(7) documents the SIGIO support that was added for inotify in kernel 2.6.25.
  • standards(7) adds a description of the upcoming POSIX.1 revision.
  • A large number of consistency and formatting fix-ups were made to many pages.






Monday, March 3, 2008

man-pages-2.79 is released

I've uploaded man-pages-2.79 into the release directory (or view the online pages).

Notable changes in man-pages-2.79 are:

  • a new page describing the timerfd API (timerfd_create(2), timerfd_settime(2), timerfd_gettime(2)), which is new in kernel 2.6.25
  • various additions and improvements to the syslog(2) page (thanks in part to Jeremy Kerr)
  • many clarifications and additions to the epoll.7 page (with help from Davide Libenzi and Chris Heath)

Tuesday, February 12, 2008

man-pages-2.78 is released

I've uploaded man-pages-2.78 into the release directory (or view the online pages).

Notable changes in man-pages-2.78 are new pages describing the eventfd(2) and signalfd(2) system calls, and a substantial addition to open(2) by Greg Banks describing the semantics and use of O_DIRECT.

Thursday, January 31, 2008

man-pages-2.77 is released

I've uploaded man-pages-2.77 into the release directory (or view the online version, which now includes facilities for searching the pages).

Look here for a list of changes in this release.

Monday, January 14, 2008

man-pages-2.76 is released

I've uploaded man-pages-2.76 into the release directory (or view the online pages).

Changes include rewrites of (parts of) the gettid(2), pipe(2), and umask(2) pages.

Tuesday, January 8, 2008

man-pages-2.75 and man-pages-fr-2.75.0 are released

The Linux man-pages maintainer, Michael Kerrisk, and the maintainer of the French translation of man-pages, Alain Portal, proudly announce the simultaneous release of man-pages-2.75 in English original and French translation.

The English version can be downloaded from the usual release directory (or view the online pages).

The French translation can be downloaded from http://manpagesfr.free.fr/download.html. The translated French pages are viewable online, along with some translated man pages from other projects, at http://manpagesfr.free.fr/consulter.html. Comments and suggestions for the French translation can be sent to manpagesfr AT free DOT fr.

A huge thanks to Alain, who has over the last two and a half years made a terrific effort to bring the French translation fully up to date with the English original.