Monday, June 30, 2008

What's wrong with kernel-userland interface development?

So I came to write the man pages for utimensat() and futimens(), interfaces that are specified in the upcoming POSIX.1 revision, and recently added to Linux (kernel 2.6.22, glibc 2.6), and it's a familiar story: bugs, and yet more bugs.

This case is a little worse than usual (the gory details are below), but hardly exceptional: it was a similar story with other recent APIs, such as splice(), signalf(), and the timerfd API (the last of which was ultimately redesigned after some pushing by me). In each case, an interface that was already released in a stable kernel turned out to have easy-to-find bugs when I came to document and test it. All that was required was to come up with something like a reasonable written specification (i.e., a man page), and then start testing against that specification.

The following seems (to me) like a reasonable development model for new kernel userland APIs:

  1. Write a more or less complete design specification (preferably something like a man page, which I'm more than happy to edit and review at this stage in the process), and perhaps even code up an initial version of the interface.

  2. Get review comments about the design; revise the design if necessary.

  3. Implement something like a final version of the interface.

  4. Test the interface. Preferably: test in collaboration with other people who had nothing to do with the implementation (they will think about the interface in different ways from the implementer). If it's sensible to do so (and usually it is), produce a test suite to check the correctness of as many operational cases of the interface as possible (and then send that suite to the Linux Test Project).

  5. Write the interface documentation (i.e., a man page) for userland programmers, and send it to me for editing and review. (If you do this smart, then you can just recycle the work from the first step.)

  6. Get the interface accepted into mainline.
However, as far as I can see, the actual development model for kernel-userland interfaces seems to work something like this:
  1. Code up the new interface.

  2. Post the code to LKML, explaining that "I tested it, and it works", and, maybe, include some brief documentation of the interface.

  3. Often, someone looks at the code and suggests some fixes. Usually, no one else does any testing.

  4. Code gets accepted into mainline.

  5. Usually, some few dot releases later: someone (often me) who had nothing to do with the earlier steps writes the man page.

  6. Often, some few dot releases later: find and fix all of the bugs (e.g., in the case of utimensat(), released for 2.6.22, the fixes will only make it into 2.6.26 at the earliest).

  7. Occasionally, some few dot releases later: realize that the interface could have been better designed, but it's too late now to change it, because we can't break the ABI.
Most kernel-userland interfaces are seeing far too little testing before release. This hurts because:
  • Userland programmers end up dealing with kernel bugs they shouldn't have to deal with. (When this happens often, it can damage the reputation of the Linux kernel-userland interface, and userland programmers may become wary of adopting new interfaces, which can further delay any kind of real-world testing and adoption of the interface.)

  • Code that uses new interfaces may have to special case for ABI bugs that were present in the first few kernel releases that contained the bug.
Similarly, many kernel-userland interfaces could do with more design review as well. This can hurt even more than interfaces released with bugs, since, once an interface is released, it is at least difficult, but often impossible to modify the design, since doing so would break the kernel-userland ABI. (With timerfd, we got lucky: the interface was in any case broken by a bug, which gave a more or less blank slate for a redesign.) As a result, userland programmers usually just have to live with the bad design.

I'll have more time for testing and review nowadays, and hopefully I'll catch more problems before they are released into stable kernels, but there's certainly more than one person can do. And, as described above, there is a culture problem when it comes to adding kernel-userland interfaces on Linux. The question is whether the culture can be changed. (And here its worth repeating a point that I often make: one could perhaps try and blame individual developers for buggy interfaces, but that doesn't really get to the core issue (everyone is going to write bugs): the real problem is the process by which kernel-userland interface changes are accepted into the kernel.)

The Gory Details
For those who are interested, here are the problems I found with utimensat() and futimens(), explained by annotating the current draft of the man page.

SYNOPSIS
    #include <sys/stat.h>
 
    int utimensat(int dirfd, const char *pathname,
                  const struct timespec times[2], int flags);
 
    int futimens(int fd, const struct timespec times[2]);
 
DESCRIPTION
    utimensat() and futimens() update  the  timestamps  of  a
    file  with nanosecond precision.  This contrasts with the
    historical utime(2) and utimes(2), which permit only sec-
    ond and microsecond precision, respectively, when setting
    file timestamps.

This is the first of the advantages of the new interfaces: greater precision for setting timestamps. (The next revision of POSIX.1 specifies nanosecond file timestamps, and extends stat(2) to allow their retrieval. File system support is needed, and can't necessarily be retrofitted to some older file systems. Currently, Linux supports nanosecond timestamps on XFS, JFS, and ext4.)

    With utimensat() the file is specified via  the  pathname
    given in pathname.  With futimens() the file whose times-
    tamps are to be updated is specified  via  an  open  file
    descriptor, fd.
 
    For  both calls, the new file timestamps are specified in
    the array times: times[0] specifies the new "last  access
    time" (atime); times[1] specifies the new "last modifica-
    tion time" (mtime).  Each of the elements of times speci-
    fies  a  time  in seconds and nanoseconds since the Epoch
    (00:00:00, 1 Jan 1970, UTC), in a structure of  the  fol-
    lowing form:
 
        struct timespec {
            time_t tv_sec;        /* seconds */
            long   tv_nsec;       /* nanoseconds */
        };
 
    If  the  tv_nsec  field of one of the timespec structures
    has the special value UTIME_NOW, then  the  corresponding
    file  timestamp  is  set  to  the  current  time.  If the
    tv_nsec field of one of the timespec structures  has  the
    special  value  UTIME_OMIT,  then  the corresponding file
    timestamp is left unchanged.  In both of these cases, the
    value of the corresponding tv_sec field is ignored.

These are the other advantages of the new interfaces.

With utime(2) and utimes(2), to change just one of the timestamps, we must make a call to stat(2) to retrieve the current timestamps, use one of the timestamps to initialize the times element that we don't want to change, and then call utimensat() with the desired value for the other timestamp. This can be subject to race conditions: if another process updates the file timestamps between the two calls, then that update will be lost. UTIME_OMIT allows us to avoid this problem.

UTIME_NOW exists mainly as a convenience: it allows us to avoid fetching the current time (using gettimeofday(2) or similar) in order to set a file timestamp to "now".

    If  times  is  NULL,  then both timestamps are set to the
    current time.
 
Permissions requirements
    To set both file timestamps to the  current  time  (i.e.,
    times is NULL, or both tv_nsec fields specify UTIME_NOW),
    either:
 
    1. the caller must have write access to the file;
 
    2. the caller's effective user ID must match the owner of
       the file; or
 
    3. the caller must have appropriate privileges.
 
    To  make any change other than setting both timestamps to
    the current time (i.e.,  times  is  not  NULL,  and  both
    tv_nsec  fields are not UTIME_NOW and both tv_nsec fields
    are not UTIME_OMIT), either condition 2 or 3  above  must
    apply.
 
    If  both tv_nsec fields are specified as UTIME_OMIT, then
    no file ownership or permission checks are performed, and
    the  file  timestamps  are  not modified, but other error
    conditions may still be detected.
 
utimensat() specifics
    [Details of dirfd and flags arguments omitted.]

For details of the dirfd argument, see openat(2).

RETURN VALUE
    On success, utimensat()  and  futimens()  return  0.   On
    error,  -1  is  returned and errno is set to indicate the
    error.
 
ERRORS
    [Various other errors that are irrelevant  to  this  post
    have been omitted.]
 
    EACCES times   is   NULL,  or  both  tv_nsec  values  are
           UTIME_NOW, but the effective effective ID  of  the
           caller  does  not match the owner of the file, the
           caller does not have write access to the file, and
           the caller is not privileged (Linux: does not have
           either  the  CAP_FOWNER  or  the  CAP_DAC_OVERRIDE
           capability).
 
    EACCES times   is   NULL,  or  both  tv_nsec  values  are
           UTIME_NOW, and the file is marked immutable.
 
    EPERM  The caller attempted to change one or both  times-
           tamps  to  a value other than the current time, or
           to change one of the  timestamps  to  the  current
           time  while leaving the other timestamp unchanged,
           (i.e., times is not NULL, both tv_nsec fields  are
           not  UTIME_NOW,  and  both  tv_nsec fields are not
           UTIME_OMIT) but the  caller's  effective  user  ID
           does  not  match the owner of file, and the caller
           is  not  privileged  (Linux:  does  not  have  the
           CAP_FOWNER capability).
 
    EPERM  times  is not NULL, and the file is marked append-
           only or immutable.
 
NOTES
    On Linux, futimens() is a library function implemented on
    top of the utimensat() system call.  To support this, the
    Linux utimensat() system call implements  a  non-standard
    feature:  if pathname is NULL, then the system call modi-
    fies the timestamps of the open file referred to  by  the
    file  descriptor  dirfd  (which  may refer to any type of
    file).  Using this feature, the call  futimens(fd, times)
    is implemented as:
 
        utimensat(fd, NULL, times, 0);
 
BUGS
    Several  bugs  afflict utimensat() and futimens() on ker-
    nels before 2.6.??.  

Really, there should have been a test suite to go along with the initial implementation. There wasn't, unfortunately, so I wrote one (later revised a little before it went to LTP) whose results can be seen here.

I've posted patches to fix all of the bugs, and Andrew Morton has accepted them into -mm, and from there, they seem to have gone upstream to Al Viro. It's just a question of when they will get pushed into mainline. It'd be nice if they make it into 2.6.26, but maybe they won't, since it is already getting very late in the -rc cycle. (Update, 6 Jul 08: the patches have gone into -rc9.)

                         These bugs  are  either  non-confor-
    mances  with the POSIX.1 draft specification or inconsis-
    tencies with historical Linux behavior.
 
    * POSIX.1 specifies that if one of the tv_nsec fields has
      the  value  UTIME_NOW  or UTIME_OMIT, then the value of
      the  corresponding  tv_sec  field  should  be  ignored.
      Instead,  the  value of the tv_sec field is required to
      be 0 (or the error EINVAL results).   

This was a simple and obvious divergence from the specification.

    * Various bugs mean that for the purposes  of  permission
      checking, the case where both tv_nsec fields are set to
      UTIME_NOW isn't always treated the same  as  specifying
      times  as NULL, and the case where one tv_nsec value is
      UTIME_NOW and the other is UTIME_OMIT isn't treated the
      same  as  specifying  times as a pointer to a structure
      containing arbitrary time values.  As a result, in some
      cases:  a)  file timestamps can be updated by a process
      that shouldn't have permission to perform  updates;  b)
      file  timestamps  can't  be  updated  by a process that
      should have permission to perform updates; and  c)  the
      wrong errno value is returned in case of an error.   

There are multiple error cases here:
  • If one of the tv_nsec fields is UTIME_OMIT and the other is UTIME_NOW, then the error EPERM should occur if the process's effective user ID does not match the file owner and the process is not privileged. Instead, the call successfully changes one of the timestamps.
  • If the file is not writable by the effective user ID of the process and the process's effective user ID does not match the file owner and the process is not privileged, and times is NULL, then the error EACCES results. This error should also occur if times points to a structure in which both tv_nsec fields are UTIME_NOW. Instead the call succeeds.
  • If a file is marked as append-only (see chattr(1)), then Linux traditionally (i.e., utime(2), utimes(2)), permits a NULL times argument to be used in order to update both timestamps to the current time. For consistency, utimensat() and futimens() should also produce the same result when given a times argument that points to a structure in which both tv_nsec fields are UTIME_NOW. Instead, the call fails with the error EPERM.
  • If a file is marked as immutable (see chattr(1)), then Linux traditionally (i.e., utime(2), utimes(2)), gives an EACCES error if times is NULL. For consistency, utimensat() and futimens() should also produce the same result when given a times that points to a structure in which both tv_nsec fields are UTIME_NOW. Instead, the call fails with the error EPERM.
    * POSIX.1  says  that  a process that has write access to
      the file can make a call with times as  NULL,  or  with
      times  pointing  to  a  structure in which both tv_nsec
      fields are UTIME_NOW,  in  order  to  update  the  both
      timestamps  to  the  current time.  However, futimens()
      instead checks whether the  access  mode  of  the  file
      descriptor  allows  writing.

This means that a process with a file descriptor that allows writing could change the timestamps of a file for which it does not have write permission; conversely, a process with a read-only file descriptor won't be able to update the timestamps of a file, even if it has write permission on the file.

Updated, 2012-03-06: Changed link to tarball containing tests

Wednesday, June 25, 2008

man-pages-3.01 is released

I've uploaded man-pages-3.01 into the release directory (or view the online pages).

Notable changes in man-pages-3.01 are:

  • A new hostname(7) page (based on the FreeBSD page) describing hostname resolution.
  • A new symlink(7) page (heavily modified from the original FreeBSD page) describing symbolic links.
  • A rewritten acct(5) page providing much more detail on the information written to the process accounting file.
  • The getrlimit(2) page adds a description of RLIMIT_RTTIME limit, new in Linux 2.6.25.
  • The mkstemp(3) page adds a description of mkostemp(), new in glibc 2.7.
  • The core(5) page adds a description (and an example program for) of the core_pattern pipe syntax, which appeared in Linux 2.6.19, and documents /proc/PID/coredump_filter, new in kernel 2.6.23.
  • The proc(5) page adds details for a number of previously undocumented /proc interfaces, including /proc/PID/oom_score, /proc/PID/oom_adj, /proc/PID/limits, /proc/PID/fdinfo/*, /proc/PID/mountinfo, /proc/PID/mountstats, and /proc/PID/status.
  • The time(7) page enhances the discussion of jiffies, and adds a section on high-resolution timers.
  • The unix(7) page adds a clear description of the three types of address that can appear in the sockaddr_un structure: pathname, unnamed, and abstract.
  • Many other pages saw significant changes, including brk(2), chmod(2), chown(2), nanosleep(2), open(2), sched_setscheduler(2), syscalls(2), ftime(3), getaddrinfo(3), inet(3), inet_pton(3), scanf(3), strerror(3), random(4), and locale(5).

Tuesday, June 17, 2008

man-pages-3.00 is released

I've uploaded man-pages-3.00 into the release directory (or view the online pages). Notable changes in man-pages-3.00 are:

  • The POSIX man pages (man pages sections 0p, 1p, 3p) are now split out into a separate package, which can be downloaded here. This makes sense because the POSIX pages are logically separate: they are copies of specifications from the POSIX.1 standard (to be precise, we are currently redistributing pages from the 2003 Technical Corrigendum 1), that we have kindly been given permission to publish by the IEEE and The Open Group. Furthermore, the POSIX pages are only updated rarely (e.g., occasional formatting fixes, and periodic updates when a new version of the standard is released, such as the revision due to appear later this year), so there is no need to redistribute them with each release of man-pages.

  • After Stuart Brady noted that the quotes around single-quoted characters were not formatted well in UTF-8 output, I've globally replaced the characters used in the source files for quoting in man-pages in order to try and produce better formatted output for both ASCII and UTF-8 xterms, and also to bring greater consistency in general to the use of single and double quotes in the pages. This may not be the final word on the subject, but at least things should be somewhat improved.

Monday, June 9, 2008

man-pages is now supported

Last month I started on what is, for the moment, my dream job: man-pages finally has a paid, full-time maintainer, thanks to a fellowship from the Linux Foundation. For the foreseeable future, that means I'll be working on:

  • Documenting every new Linux kernel-userland (and glibc) API, and every API change, that is released into the mainline kernel, ideally before actual release. (That's the ideal, but there's a quite a backlog, so I'm not going to achieve the ideal immediately.)

  • Testing new APIs, again ideally before they are released into the mainline kernel, and probably doing some light bug fixing while I'm at it.

  • Design review of new kernel-userland APIs.

  • And of course accepting patches and dealing with bug reports for existing man pages.

Other than that, I'll be helping out with in-kernel documentation (the /Documentation directory) by providing editorial, review, and possibly other assistance.

When I find time, I'll try to post a longer job description (but have a look here in the meantime). In the meantime, I'm looking forward to going to the Linux Plumbers Conference, September 17-19, in Portland, OR, USA, where there is currently a planned track on kernel-userland APIs.

Friday, June 6, 2008

man-pages-2.80 and man-pages-fr-2.80.0 are released

The Linux man-pages maintainer, Michael Kerrisk, and the maintainer of the French translation of man-pages, Alain Portal, proudly announce the simultaneous release of man-pages-2.80 in English original and French translation.

The English version can be downloaded from the usual release directory (or viewed online at the usual location).

The French translation can be downloaded from http://manpagesfr.free.fr/download.html, or viewed online at http://manpagesfr.free.fr/consulter.html.

There is long list of changes in man-pages-2.80. Among the more notable changes are the following:

  • A new random_r(3) page describes random_r(3), srandom_r(3), initstate_r(3), and setstate_r(3), the reentrant equivalents of random(3), srandom(3), initstate(3), and setstate(3).
  • getpriority(2) adds text describing the punchier effect of nice values since kernel 2.6.23.
  • mmap(2) describes improvements for the support of MAP_POPULATE in kernel 2.6.23.
  • sched_setscheduler(2) adds a description of the SCHED_IDLE policy, new in 2.6.23.
  • futimes(3) adds a description of lutimes(3), which was added in glibc 2.6
  • credentials(7) adds some text describing how NPTL maintains the same UIDs and GIDs for all threads in a process, even though the Linux kernel allows each thread to have distinct credentials.
  • inotify(7) documents the SIGIO support that was added for inotify in kernel 2.6.25.
  • standards(7) adds a description of the upcoming POSIX.1 revision.
  • A large number of consistency and formatting fix-ups were made to many pages.