Thursday, October 30, 2008

Recent changes in file descriptor system calls

In recent Linux kernels, especially 2.6.27, a number of system calls have changed, or new versions of existing system calls have been added, to allow more control over the file descriptors created by those system calls. (Most of this work has been done by Ulrich Drepper.) These changes have taken the form of either adding new bits to the flags bit-mask argument of an existing system call, if it had such an argument, or creating a new version of the system call that adds an extra flags argument. In most cases, two new flags have been added: a close-on-exec flag, and a non-blocking flag, which we describe shortly.

The changes are summarized in the table below. In this table, the Kernel column indicates the kernel version where the change occurred, and the Glibc column indicates the version of glibc that adds the corresponding wrapper functions and/or header file definitions. (Note: glibc 2.9 is not yet released.)






















































































































































































Interface

Changes

Kernel

Glibc

Notes
open(2)

New flag: O_CLOEXEC
2.6.23
2.7

Flag also supported for openat(2). These syscalls already supported O_NONBLOCK.



fcntl(2)

New flag: F_DUPFD_CLOEXEC

2.6.24

2.7

Performs a similar task to dup3(2)

recvmsg(2)
New flag: MSG_CMSG_CLOEXEC
2.6.23
2.7

-



dup3(2)
New syscall, like dup2(2), but adds flags argument (O_CLOEXEC)
2.6.27
2.9
Requires new glibc interface

pipe2(2)

New syscall, like pipe(2), but adds flags argument: O_CLOEXEC, O_NONBLOCK

2.6.27

2.9

Requires new glibc interface

socket(2)
New flags in type argument: SOCK_CLOEXEC, SOCK_NONBLOCK

2.6.27

2.9
-

socketpair(2)
New flags in type argument: SOCK_CLOEXEC, SOCK_NONBLOCK

2.6.27

2.9
-

epoll_create1(2)

New syscall, like epoll_create(2), but adds flags argument: EPOLL_CLOEXEC; the new system call drops epoll_create()'s obsolete size argument

2.6.27

2.9

Requires new glibc interface

inotify_init1(2)

New syscall, like inotify_init(2), but adds flags argument: IN_CLOEXEC, IN_NONBLOCK

2.6.27

2.9

Requires new glibc interface

eventfd2(2)

New syscall, like eventfd(2), but adds flags argument: EFD_CLOEXEC, EFD_NONBLOCK

2.6.27

2.9

The glibc eventfd() wrapper already allowed a flags argument, so no new wrapper is required

signalfd4(2)

New syscall, like signalfd(2), but adds flags argument: SFD_CLOEXEC, SFD_NONBLOCK

2.6.27

2.9

The glibc signalfd() wrapper already allowed a flags argument, so no new wrapper is required

timerfd_create(2)

New flags: TFD_CLOEXEC, TFD_NONBLOCK

2.6.27

2.9

-


A proposed analogous change for accept(2), paccept(), supporting flags SOCK_CLOEXEC and SOCK_NONBLOCK and treatment of a signal mask argument like pselect(2), was debated and then spent some time in limbo, but has recently re-emerged in a somewhat modified form, accept4() (which was in fact the original proposal), that will probably go into Linux 2.6.28 or 2.6.29.

Perhaps one day there might even be an analogous change for mq_notify(3), since (on Linux, but not on most other systems) a message queue descriptor is really just a file descriptor.

The close-on-exec flag (*_CLOEXEC)

The addition of a close-on-exec flag was the primary motivator for the system call changes. Specifying this flag causes the file descriptor created by the system call to automatically have its close-on-exec flag set. (This flag causes the file descriptor to automatically be closed if the process does a successful execve(2).)

Before the existence of this flag, it was possible to change the close-on-exec flag of a file descriptor after it has been created, using the fcntl(2) F_GETFL and F_SETFL operations. The fact that this required two additional system calls was not so problematic as the fact that the need for multiple (non-atomic) steps to set the flag on a new file descriptor meant that there were certain race conditions that could lead to races in multithreaded programs where one thread was trying to set a file descriptor's close-on-exec flag at the same time as another thread was performing a fork() plus execve(). Ulrich Drepper explains the resulting security issues in more detail.

The non-blocking flag (*_NONBLOCK)

The *_NONBLOCK flag causes the non-blocking flag to be set on the open file description associated with the new file descriptor. (For a discussion of the relationship of a file descriptor to an open file description, see the open(2) man page.)

Unlike the *_CLOEXEC flag, the *_NONBLOCK flag exists merely as a convenience: it saves two system call operations (fcntl(2) F_GETFL and F_SETFL) if we want to immediately set the non-blocking flag when opening a file descriptor.

Note that there deliberately is no *_NONBLOCK flag for dup3(2). This would not be sensible, since the new file descriptor shares an open file description with the old file descriptor.

There is also deliberately no *_NONBLOCK flag for epoll_create1(2), since equivalent functionality can be obtained with a zero timeout.

Other flags?

The flags argument added for the new system calls allows for other kinds of functionality to be added to these system calls in the future.

Future standards?

Ulrich Drepper already did some work on getting some of these interface changes into the POSIX.1-2008 standard, which includes specifications of the O_CLOEXEC flag for open() and the F_DUPFD_CLOEXEC operation for fcntl(). In the future, some the other changes may also make their way into the standard.

A note on the new system call names

The numbers in the names of the new system calls refer to the number of arguments that each system call has. This is an extension of a convention that was used for some existing Unix system calls, notably dup2(2), wait3(2), and wait4(2). Note that while the wrapper function for signalfd(2) has three arguments, the underlying signalfd4() system call really does have four arguments, as described in the man page. (However, this suggests that, in the end, this naming scheme might not have been the best choice.)

man-pages-3.12 is released

I've uploaded man-pages-3.12 into the release directory (or view the online pages). Notable changes in man-pages-3.12 are:

El jardín de la casa

Tuesday, October 7, 2008

man-pages-3.11 is released

I've uploaded man-pages-3.11 into the release directory (or view the online pages). Notable changes in man-pages-3.11 are:

  • A new umount(2) page has been created by splitting the umount() and umount2() material out of the old mount(2) page.
  • The mount(2) page adds a description of per-process namespaces.
  • Various fixes and improvements in getdents(2), including the addition of an example program.
  • Many improvements and additions in signal(7), the page that provides an overview of signals on Linux.
  • Numerous fixes to many other pages.