Austin Group Defect Tracker

Aardvark Mark IV

Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000768 [1003.1(2013)/Issue7+TC1] System Interfaces Editorial Enhancement Request 2013-10-11 13:24 2015-03-26 13:21
Reporter jlayton View Status public  
Assigned To
Priority normal Resolution Open  
Status New  
Name Jeff Layton
Organization Red Hat
User Reference
Section fcntl
Page Number 814-815
Line Number
Interp Status ---
Final Accepted Text
Summary 0000768: add "fd-private" POSIX locks to spec
Description At this year's Linux Storage and Filesystem summit, there was a lively discussion about what implementors of userland fileservers need. One of the things brought up was the problematic behavior of POSIX locks when a file is closed. The existing spec says:

"All locks associated with a file for a given process shall be removed when a
 file descriptor for that file is closed by that process or the process holding
 that file descriptor terminates."

The problem here is that userland programs may need to open more than one file
descriptor on a file, but they have to keep track of them and refrain from
closing *any* of them until they know that it's ok to release any locks held on
them. This behavior is surprising for most people when they first see it and it greatly complicates writing certain types of userland programs,
Desired Action I've posted an initial rough draft of a patchset here for Linux to address this by adding two new struct flock.l_type values -- F_RDLCKP and F_WRLCKP. In addition to adding this to Linux, I'd like to have it considered for adoption in the formal POSIX spec as well since I'm fairly sure this is an issue on other POSIX-y OS' as well.

The initial patch posting is here: [^]

...and the semantics for the new lock types are layed out there. The basic idea is to have them behave exactly like any other F_RDLCK or F_WRLCK lock, but that these should only be released on close when the file descriptor against which they were opened is closed. Closing other file descriptors shouldn't affect them.

This changes how lock merging works a bit. The 'P' lock types can't be merged with 'non-P' types since they have different semantics when a fd is closed. Similarly, 'P' locks acquired on different file descriptors can't be merged either, since they need to be released separately on close.

At this point, the draft patchset is quite rough and I expect it'll need changes in response to review and comments. In the meantime, I'd like to also open the floor for other POSIX players to add feedback on this while I'm trying to finalize the interface.
Tags No tags attached.
Attached Files

- Relationships
related to 0000824Closed children should not inherit fcntl file locks from parent calling posix_spawn 

-  Notes
jilles (reporter)
2013-10-11 15:27

Another file locking API available on Linux and BSDs is flock(), originally from 4.2BSD. Although it only allows locking whole files (no byte-ranges), it has much better semantics: locks are tied to open file descriptions instead of processes. As a result, not only can processes open and close a file another time without affecting the lock, but a child process can inherit a lock from its parent and mutual exclusion works between threads in the same process if they each open the file themselves.

If fcntl(F_GETLK) returns a lock set with flock() (this is possible on FreeBSD but Linux man pages say it is not possible on Linux because fcntl() and flock() locks are independent), then the l_pid member cannot be a process ID because such a lock is not held by a process as such. FreeBSD sets it to -1 in that case.

The proposed F_RDLCKP and F_WRLCKP types seem to address the close issue only.

So I suggest standardizing flock() and/or a fcntl()-based locking mechanism with the same semantics as flock().
jlayton (reporter)
2013-10-11 15:45

The close() issue is the main one I'm trying to address.

Standardizing flock() wouldn't help much since it doesn't allow you to lock byte-ranges. I've no objection to standardizing that, but it's sort of orthogonal to the problem at hand.

Linux could be altered to return a flock() lock on F_GETLK, but it historically hasn't since flock and POSIX locks operate entirely independently. I don't think that would really buy us much. Do flock and fcntl locks conflict with one another on BSD as well?

The question of inheritance of a lock is more interesting however. It's not 100% clear to me what sort of semantics the people asking for this actually want in that respect, so I'll look into that.

I'm inclined to steer away from introducing an entirely new locking mechanism however, since I think we want these new lock types to conflict vs. "legacy" POSIX locks. IOW, programs written to use the older F_RDLCK/F_WRLCK types should continue to work in parallel with ones written for the new ones and provide the same exclusion guarantees. We could still do that with a new interface, but I worry that it'll be more confusing for developers.
jlayton (reporter)
2013-11-08 12:18

Ok, after looking at this, I think you're correct that emulating flock()'s semantics wrt to close and inheritance is the right thing to do. I've got a new draft patchset that does this.

It does mean that I'll need a new F_UNLCKP variant flag if we stick with the F_SETLK interface, but I think that's a reasonable addition anyway.
Don Cragun (manager)
2013-11-14 16:17

This was discussed during the November 14, 2013 conference call.

Although we agree that a new lock type might be a good idea, before we could consider anything, we would need actual changes to be added to the text to define the behavior of this feature. Furthermore, there would have to be an existing implementation that is shipping as part of a product. This bug will be kept open for a while as a placeholder for such a specification.
jlayton (reporter)
2013-12-10 19:59

Thanks for the consideration so far. I'm still pushing forward with proposed patches on the Linux mailing lists. Once I have the semantics a bit more settled, I'll see about writing up the actual changes to the text.
jlayton (reporter)
2013-12-12 12:05

First rough draft of update to the text in the spec. This is just covers the F_SETLK piece. We may need some updates to the F_GETLK piece. As a side note, documenting this makes it clear the that F_SETLK documentation is pretty complex. Would we be better served by doing this with new cmd values instead? i.e. F_GETLKP and F_SETLKP ? Then we could just reuse the existing definitions for F_RDLCK and F_WRLCK.



...existing text:

    Set or clear a file segment lock according to the lock description
pointed to by the third argument, arg, taken as a pointer to type struct
flock, defined in <fcntl.h>. F_SETLK can establish shared (or read)
locks (F_RDLCK) or exclusive (or write) locks (F_WRLCK), as well as to
remove either type of lock (F_UNLCK). F_RDLCK, F_WRLCK, and F_UNLCK are
defined in <fcntl.h>. If a shared or exclusive lock cannot be set,
fcntl() shall return immediately with a return value of -1. text:


    Set or clear a file segment lock according to the lock description
pointed to by the third argument, arg, taken as a pointer to type struct
flock, defined in <fcntl.h>. F_SETLK can establish shared (or read)
locks (F_RDLCK and F_RDLCKP) or exclusive (or write) locks (F_WRLCK and
F_WRLCKP), as well as to remove either type of lock (F_UNLCK and
are defined in <fcntl.h>. If a shared or exclusive lock cannot be set,
fcntl() shall return immediately with a return value of -1. Locks set
with F_RDLCK and F_WRLCK can only be unset with F_UNLCK, and locks set
with F_RDLCKP and F_WRLCKP can only be unset with F_UNLCKP.

---------------[part 2]---------------

...existing text:

All locks associated with a file for a given process shall be removed
when a file descriptor for that file is closed by that process or the
process holding that file descriptor terminates. Locks are not inherited
by a child process. text:

All locks associated with a file for a given process that were set with
F_RDLCK or F_WRLCK shall be removed when a file descriptor for that file
is closed by that process or the process holding that file descriptor
terminates. Also those sorts of locks are not inherited by a child process.

Locks set with F_RDLCKP and F_WRLCKP are removed when the last reference
to the open file on which they were set is closed. These locks are
inherited by child processes.
jlayton (reporter)
2013-12-26 12:29

Now that I've looked at the complexity of adding the above text, I think it might be better to implement this with a new set of cmd values instead, a'la:


I've made that change to the Linux implementation of this patchset and am testing it now. I'll plan to write up a documentation update once that's complete.
eblake (manager)
2014-02-27 16:20

During discussion of 0000824, the question was raised whether it might be useful to have read locks be inheritable across fork/posix_spawn, so that a child can be started with ownership of the same sections locked as the parent instead of having to re-grab the lock. This should be considered when adding new lock constructs (and the consideration may be that dropping all locks, including read locks, is still the best policy)
jlayton (reporter)
2014-02-27 18:44

With the current Linux implementation, I've taken the suggestion of jilles above and adopted BSD (flock()) lock semantics with respect to inheritance and on close(). Thus, file-private locks are associated with the open file so any entity that gets a reference to that open file won't need to reset locks.

The article here explains the semantics in more depth: [^]

...though this is still not yet merged into Linux kernel so the semantics are not yet set in stone.
jlayton (reporter)
2014-04-18 00:01

There is currently a discussion running on several Linux-related mailing lists about what we should call these new locks. The original name I game them was "file-private" locks but that's not as descriptive as it should be.

The current favorite is "file-description locks" since these follow the open file description, with a corresponding change of the macros to make them more visually distinct:

F_FD_SETLKW would be nice to have some input on this front from the "powers that be" at the austingroup. A link to the discussion on LKML is here: [^]

Please feel free to chime in on the discussion if you can. Unfortunately, the window to rename these is rather short. We only have around 6 weeks before v3.15 of the kernel ships with this feature, and I need to have this fixed well before then.
jlayton (reporter)
2014-12-18 20:57

Sorry for the long delay on this. I had a job change which ended up sidetracking my efforts here. A progress report:

The locks were renamed to "open file description" (OFD) locks, with the constants as:

    #define F_OFD_GETLK 36
    #define F_OFD_SETLK 37
    #define F_OFD_SETLKW 38

The code was merged into v3.15. Both Samba and NFS-Ganesha are looking to use this new facility to simplify their locking code, but both are currently works in progress.

When I did the (trivial) patch for glibc, I also added a section to its manual that describes the new locks and their semantics: [^]

I'm still interested in seeing this adopted into the POSIX standard.
olly (reporter)
2015-03-09 02:52
edited on: 2015-03-09 02:52

Xapian [^] 1.2.20 was released on 2015-03-04 and will use F_OFD_SETLKW if available, which avoids having to fork a subprocess to hold the lock (which seems to be the sanest way to protect a lock obtained with F_SETLKW from being inadvertently released).

My feedback from having worked with the new API is that it perfectly addresses the issues we encountered with the current POSIX fcntl() locking. It would be really great to see this included in POSIX.

I also noticed this request for implementing it on FreeBSD: [^]

mtk (reporter)
2015-03-09 08:08

Just by the bye, Jeff Layton and I produced man page text that describes OFD locks API. If it's useful, I'm sure we(*) could make it available under whatever license Austin needs. The text can be found here: [^]



(*) Speaking for myself here, but I'm pretty sure Jeff would be agreeable also.
ned14 (reporter)
2015-03-09 18:32

@olly: It was me who added the feature request to FreeBSD. Proposed Boost.AFIO should gain portable byte range advisory file locking next release, and on POSIX it currently has to duplicate process-side an implementation of byte range locking to fuse locks from multiple threads into a single set of requests per process. I'd rather have that implementation go away on FreeBSD like it can on Linux, moreover then AFIO doesn't fail when combined with another library which also uses broken POSIX locks.

Regarding the proposal for standardisation, my sole big concern with it is that we really ought to also standardise on a method for asynchronously requesting a lock and being notified when that lock has been granted. Right now, you can only do polling and retry via F_ORD_SETLK or block up a whole dedicated thread using F_ORD_SETLKW. If one could have select() tell us when a lock has been removed for some file offset, that would be enormous.

As one way of doing this (note, I don't necessarily recommend this), right now as per POSIX regular file fds always return ready to read, ready to write and ready with error with select(). What could happen is that if the current file offset in a file descriptor is inside a range locked inside another fd somewhere, select() for reading could block if there were an exclusive lock on that range, and block for writing if there were a non-exclusive lock on that range. The pattern to more efficiently lock could then be:

1. Try F_OFD_SETLK, if success then proceed.

2. Set file offset in fd to start of range we wish to lock, and add to fds being waited upon with select().

3. When select returns ready on our fd, try F_OFD_SETLK again. If success then proceed, else loop to step 2.

As I mentioned, I wouldn't recommend this solution when something kqueues based is enormously superior as you can tell kqueues about entire byte ranges, and it can intelligently notify you when your specifically requested byte range has been granted to you without you needing to repeat F_OFD_SETLK and unnecessarily do many syscalls when none were needed. However POSIX doesn't standardise BSD kqueues, so I suppose the next best alternative is to extend the POSIX aio API with asynchronous byte range locking support. However, the POSIX aio API is not popular, and it seems overkill to require implementations merely needing notification of lock state changes to retool into POSIX aio just for that.

The AWG may be interested in how the NT kernel fairs with byte range locking (ref: NtLockFile() [^] These are fully asynchronous, so you can queue up one or many ranges to lock and it'll signal a win32 event or the HANDLE when you've been awarded the lock. This allows lock wait composure into WaitForMultipleObjects which is the Win32 equivalent to select().

This NT kernel implementation is highly efficient relative to Linux and FreeBSD with O(1) to waiters complexity. See [^] for some benchmarks. Part of that efficiency is an optimal API design. I really wish POSIX, or even any POSIX implementation, had something equivalent.

jlayton (reporter)
2015-03-10 20:00

Yes, I'd definitely be amenable to making the manpage text available for the POSIX spec. There's also a more comprehensive writeup in the glibc manual, and we might be able to make some of that text available as well (I'd have to check with them on that though).

As far as adding an async interface, for Linux we might want to add something to epoll for that. Not a bad idea!

As far as a standard async interface goes though, that's a much taller order. Given that we have a working sync implementation, I'd suggest that we consider that for standardization now and an async interface at a later date.
olly (reporter)
2015-03-17 22:43

Standardising the interface as implemented by Linux (and described above) separately seems the best plan - it's essentially how the fcntl locking API which has long existed really should have worked all along, and it should be easy to update existing code to use it, and also easy to fall back to the older API where the new one isn't yet supported (it certainly was for the Xapian code).

Blocking the standardisation of this while people work on an async locking API doesn't seem helpful - this has an existing working implementation which is already widely deployed, whereas there doesn't appear to even be a proposed design for an async interface yet.
ned14 (reporter)
2015-03-19 13:09

@jlayton: I did have a go at sketching out support for byte range lock notification to epoll + eventfd, but eventfd just doesn't transport enough information for it to work, so epoll would need API changes. A suitable enhancement of BSD kqueues is similarly stymied by insufficient fields to send which byte range lock was just granted, though one can encapsulate the struct flock into a following fake struct kevent. It still isn't great.

Can I therefore suggest a small change now which makes async support much easier later? If struct flock additionally had:

1. A intptr_t data
2. A void *udata

... then the caller of F_ORD_SETLK would have the means to set those members of struct kevent, and we thus solve the problem of lock identification for BSD because the lock requester could supply two unique numbers which make sense to itself. I'm thinking on Linux for epoll + eventfd that the intptr_t data value could specify an eventfd to notify and the udata value be returned by eventfd to indicate which lock request was granted. Not as nice as kqueues, but sufficient.

From the point of view of POSIX however, struct flock simply gains two new fields called data and udata. Nothing more is said about them (for now).

@olly: I wasn't advocating blocking fd-private locks awaiting a proper async locking API. I am advocating thinking fd-private locks through before standardisation without an async locking API. Adding two fields to struct flock to help out vendor proprietary async notification systems might be wise.
jlayton (reporter)
2015-03-19 18:50 thinking was that you could just epoll() on the fd and look for an EPOLLLCK event or something, and then just reattempt getting the lock. That's obviously less cool than having the lock granted to you though.

As far as adding fields to struct flock, won't that cause horrible ABI problems? Think of all of those poor applications already built using struct flock that doesn't have the extra fields.

While I think an async interface is interesting and makes a lot of sense, I don't think we can overload struct flock with that info without causing horrible breakage in a lot of code.

Why not consider adding a new "struct flock_async" or something? a'la:

struct flock_async {
    struct flock fla_flock;
    intptr_t fla_data1;
    void *fla_data2;

...that would allow you to do basically what you want, but we wouldn't break the ABI for existing code. You could add that to FreeBSD in the near future, and then ask POSIX to consider adding your interface in the context of a separate request. My understanding is that they prefer to see this sort of thing in shipping code before standardizing it anyway.
ned14 (reporter)
2015-03-19 19:57

Throwing the idea of a new struct type right back at you, surely the struct flock taken by F_ORD_SETLK could be that exact new struct flock2?

If there is any time to break ABI, it's at the point of adding the new API. So, I propose that F_ORD_SETLK fnctl's don't take the existing struct flock, but a new struct flock_ord:

struct flock_ord {
       short l_type; /* lock operation type */
       short l_whence; /* lock base indicator */
       off_t l_start; /* starting offset from base */
       off_t l_len; /* lock length; l_len == 0 means
                until end of file */
       int l_sysid; /* system ID running process holding lock */
       pid_t l_pid; /* process ID of process holding lock */
       intptr_t data; /* requester supplied fd or data identifying lock */
       void *udata; /* requester supplied data identifying lock */

Does this seem reasonable?
jlayton (reporter)
2015-03-19 20:53

FWIW, the constants used in Linux currently have an F_OFD_ prefix, not F_ORD_

Also, looking at the struct above, l_sysid and l_pid don't really make much sense for an OFD lock. These locks are tied to the open file description, not a particular process.

They certainly could be a flock2 struct, but why would we add these to the current synchronous interface? What's the benefit of doing so? I don't think it's a good idea to add extra fields to the argument struct in the absence of a concrete proposal to add such an interface to POSIX.

If we ever did want to add an asynchronous interface, there's no reason that it would need to use the same argument struct as the synchronous one. It would be perfectly fine for the synchronous fcntl-based interface to use plain old struct flock (as the one shipping on Linux does today) and for the new async interface to use struct flock2 (or whatever).
ned14 (reporter)
2015-03-20 00:32

Sorry about the typo. I use my (email) laptop without my glasses, and F's and R's look very similar. Regarding the l_sysid and l_pid, they came from the FreeBSD man page. They are used with F_GETLK. You surprise me that they don't make much sense for an OFD lock as your own man page says l_pid is supplied. I certainly can see the use of using l_pid to decide if a process is a zombie, or to message or signal it somehow.

You raise the very good point that any async interface would necessarily use a different fcntl number, and therefore that different fcntl can take any kind of struct flock it likes. So I guess the existing proposal is fair enough and we can press ahead with standardisation for the proposal as is.

I would encourage you however to consider extending your Linux implementation using the scheme just outlined, though for Linux it's probably more appropriate to extend the Linux event file IO mechanism via additions to [^] where usefully you get a free hand to schedule and cancel advisory locks asynchronously using any custom struct of your choice, so you're not constrained by eventfd et al. Not only would I find async notification of lock grants very useful in Boost.AFIO, but I suspect so would SQLite3, Samba, NFS and plenty others, anyone who does a lot of file locking. We all still have to write code which deals with the old semantics of course, but being able let the kernel fairly schedule file locks instead of first past the post makes enormous sense for electricity consumption and performance. And it would bring Linux to feature parity with Windows, always a good thing.

Anyway, thank you for your proposal. It's a huge win for everyone.
mtk (reporter)
2015-03-20 07:00

> Regarding the l_sysid and l_pid, they came from
> the FreeBSD man page. They are used with
> F_GETLK. You surprise me that they don't make
> much sense for an OFD lock as your own man page
> says l_pid is supplied.

l_pid can't be meaningful for OFD locks, because they are associated with an open file description, which can be shared by multiple process (fork(), PID passing across UNIX domain sockets). One can have the situation:

    <acquire lock>
    if (fork())
    /* Child holds the lock */

At that point, the lock is still held, but the PID that created it has gone away.


jlayton (reporter)
2015-03-24 01:32

I'd even go so far as to say that fcntl might be the wrong interface for handling locks asynchronously. But, I don't really know for sure. That's the problem, of course. No one has sat down to design a mechanism for handling locking asynchronously. That's a bit trickier than a simple, synchronous interface.

Also, Michael is correct about the l_pid field. Since it has no real meaning with an OFD lock, Linux always fills out that field with '-1' with an OFD lock.

FWIW, that behavior was actually copied from FreeBSD. On FreeBSD, flock and fcntl locks can conflict with one another, and if you do an F_GETLK request and a flock lock would conflict, the kernel also fills out the l_pid field with '-1'.
ned14 (reporter)
2015-03-25 16:33

@jlayton: Jeff would you be interested in collaborating with me on a white paper discussing API options for implementing asynchronous byte range file locks? A rough outline might be:

* How asynchronous byte range file locks are implemented on other platforms (Windows, QNX etc) and benchmarked performance thereof in various use case scenarios.

* How the Linux KAIO API could be extended (io_submit etc).

* How the POSIX AIO API could be extended (FreeBSD, OS X etc).

If you're interested, I don't mind producing a first draft. My next two weeks I am off from work writing conference papers anyway, and this white paper could form part of my CppCon proceedings.

The other question is whether there is any interest in standardising such a thing? I'm sure the vendors of database or package management or network filing system software would be extremely interested, especially in any guaranteed fair kernel implemented algorithm for locking multiple files concurrently, but this is a very niche topic.
jlayton (reporter)
2015-03-26 13:21

Sure, I'd be happy to review such a white paper. Probably best to carry on that discussion via email though -- FWIW, I'm no longer at Red Hat.

I know that the Samba team have wished for an asynchronous file locking mechanism so I can speak to that particular use-case (or track down others who can better than I).

- Issue History
Date Modified Username Field Change
2013-10-11 13:24 jlayton New Issue
2013-10-11 13:24 jlayton Name => Jeff Layton
2013-10-11 13:24 jlayton Organization => Red Hat
2013-10-11 13:24 jlayton Section => fcntl
2013-10-11 13:24 jlayton Page Number => 814-815
2013-10-11 13:39 jlayton Issue Monitored: jlayton
2013-10-11 15:27 jilles Note Added: 0001879
2013-10-11 15:45 jlayton Note Added: 0001880
2013-11-08 12:18 jlayton Note Added: 0001975
2013-11-14 16:17 Don Cragun Note Added: 0001985
2013-11-27 17:44 grawity Issue Monitored: grawity
2013-12-04 11:06 Florian Weimer Issue Monitored: Florian Weimer
2013-12-10 19:59 jlayton Note Added: 0002062
2013-12-12 12:05 jlayton Note Added: 0002069
2013-12-26 12:29 jlayton Note Added: 0002093
2014-02-27 16:20 eblake Note Added: 0002163
2014-02-27 16:22 eblake Relationship added related to 0000824
2014-02-27 18:44 jlayton Note Added: 0002168
2014-04-18 00:01 jlayton Note Added: 0002230
2014-12-18 20:57 jlayton Note Added: 0002508
2015-03-09 02:52 olly Note Added: 0002574
2015-03-09 02:52 olly Note Edited: 0002574
2015-03-09 08:08 mtk Note Added: 0002575
2015-03-09 18:32 ned14 Note Added: 0002576
2015-03-10 20:00 jlayton Note Added: 0002577
2015-03-17 22:43 olly Note Added: 0002589
2015-03-19 13:09 ned14 Note Added: 0002590
2015-03-19 18:50 jlayton Note Added: 0002592
2015-03-19 19:57 ned14 Note Added: 0002593
2015-03-19 20:53 jlayton Note Added: 0002594
2015-03-20 00:32 ned14 Note Added: 0002595
2015-03-20 07:00 mtk Note Added: 0002596
2015-03-24 01:32 jlayton Note Added: 0002602
2015-03-25 16:33 ned14 Note Added: 0002603
2015-03-26 13:21 jlayton Note Added: 0002605

Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker