0000684: posix_fallocate() should be allowed to return ENOTSUP

Notes
(0001554) nico (reporter) 2013-04-27 05:40	Note that there exist applications that either treat all posix_fallocate() as fatal (e.g., SQLite3 3.7.16.2), or have code to fake it up if either posix_fallocate() does not exist (SQLite3 3.7.16.2) or fails (glibc; see comments in SQLite3 3.7.16.2) source code. Note too that SQLite3 apparently will stop using posix_fallocate() (see http://www.sqlite.org/src/info/1bbb4be1a2 [^] ). One option might be for filesystems to treat posix_fallocate() as advisory rather than mandatory, given that this is often how applications handle it (either ignoring its failures or calling a stub that fakes up fallocation on failure, like glibc apparently does).

(0001555) nico (reporter) 2013-04-27 16:56	Also, applications that use posix_fallocate() still have to check for and handle (and recover from) write()/pwrite() failures. Perhaps posix_fallocate() should be considered purely advisory, and should not fail when not implemented.

(0001556) dalias (reporter) 2013-04-27 21:36	If a robust application needs to know that writes will not fail due to insufficient space, it would be harmful for posix_fallocate to report false success. I do not think the standard should permit such behavior, but that's not to say that implementations can't offer it as an option. The simplest solution would be for implementations to simply document that they become non-conforming (and possibly dangerous) if you use filesystems that cannot or will not honor posix_fallocate. Ideally, such filesystems would offer an option to reserve storage (possibly by reserving an upper bound much larger than they'll actually need) so that they can still be used, albeit suboptimally, in deployments with strong robustness requirements. The closest analogy I can come up with is memory overcommit for malloc and writable mappings. Clearly it's non-conforming for a correct application to crash when physical memory can't be instantiated for an allocation for which the implementation already reported success, but implementations do it anyway. To be conforming, and to meet the needs of robust applications, they need to provide a way to turn off overcommit, and need to document the dangerous behavior that can occur when it's enabled. Falsely reporting success for posix_fallocate is basically just overcommit of filesystem storage space, so my view is that it should be treated the same way.

(0001557) nico (reporter) 2013-04-29 18:15	I agree that whether posix_fallocate() lies should be a filesystem option, and in fact I'd already proposed as much on the illumos-zfs list. HOWEVER, a) the application still has to be able to recover from write(2) failures (e.g., EIO) after a posix_fallocate(), b) it is already the case that posix_fallocate() often lies (see glibc). Add to this the idea of a filesystem option to allow posix_fallocate() to lie. This all means that applications that really want to count on posix_fallocate()... can't. The inescapable conclusion is that today a portable application can only count on posix_fallocate() being advisory. It'd be nice if apps that want posix_fallocate() could tell if it will lie (so they can warn or fail). We can't fix it by adding "and we really mean it". But we can fix it by adding a pathconf() to indicate whether posix_fallocate() is advisory or mandatory for the given file. Therefore I think we need not only to allow ENOTSUP and add a NOTES section, but we must also add a pathconf().

(0001558) nico (reporter) 2013-04-30 00:13	The "Desired Action" should be updated (I don't think I can edit it) to say that we need both, to list ENOTSUP as a permitted failure case, and to specify a pathconf for determining whether the filesystem supports posix_fallocate() for a given file (and, if it does, then posix_fallocate() may not lie).

(0001561) nick (manager) 2013-05-02 16:18 edited on: 2013-10-03 15:32	Interpretation response ------------------------ The standard does not speak to this issue, and as such no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor. Rationale: ------------- If a system conforms to the Advisory Information Option it must also provide a filesystem that will support that option (see XSH 2.1.5). However, such systems may also support other, non-conforming filesystems, and the standard currently requires an EINVAL error for these. ENOTSUP is preferred so that this error can be distinguished from the other EINVAL cases. Notes to the Editor (not part of this interpretation): ------------------------------------------------------ On page 1425 line 47084 delete (from the EINVAL error): , or the underlying file system does not support this operation Add on page 1425 after line 47088 (to the "shall fail" errors): [ENOTSUP] The underlying file system does not support this operation. Add to APPLICATION USAGE (page 1426, after the sentence on 47096): Not all filesystems are capable of supporting posix_fallocate(), in which case the function will return ENOTSUP. However, if a system supports the option, there must be at least one filesystem that is capable of supporting this function and is available for use in conforming environments.

(0001813) ajosey (manager) 2013-09-06 04:56	Interpretation Proposed 6 Sep 2013

(0001857) shware_systems (reporter) 2013-09-27 19:06 edited on: 2013-10-02 21:43	I was looking over the pending Interps and this caught my eye. I don't agree with it and I feel if it is applied it will have to be rolled back for Issue 8, but that's doable as part of ER #687. ---------------------------------------------------------------------- Re: #1561 Suggested changes, for clarity, if this interpretation stands: First off, it's missing the closing " at end of text. Change "However, if a system supports the option, there must be at least one filesystem that is capable of supporting this function." to "However, if a system supports the option, there also must be at least one filesystem that is capable of supporting this function provided with the implementation, though at application execution time an instance of this filesystem may not be accessible." At Line 47084: Remove ", or the underlying file system does not support this operation" from EINVAL description or Change "underlying file system" to "file was opened for writing but has been marked read only since and so". ---------------------------------------------------------------------- Note: That last shouldn't happen, but chmod( ) appears to allow it for any writable file system, according to its App Usage section. The 'this will not happen...' part is not normative so can't be relied on. This report seems to me a file system quality-of-implementation issue more than a standard defect, though. Adding ENOTSUP here opens a can of worms for other write interfaces, and not just those that permit arbitrary seeks past end-of-file before attempting writes, which is why I feel the proposed text is a stopgap measure. Using it to indicate just that the posix_fadvise( ) optimizations aren't supported is plausible, as a warning rather than outright error, but the space should still be allocated on the media if other errors aren't detected. I think the Interpretation should be: ---------------------------------------------------------------------- Response: The standard is clear enough and right; implementations exhibiting other behaviors are non-conforming and shall not claim to support the Advisory Information Option. Rationale: The posix_fallocate( ) routine can be implemented using, except one, only the functionality of interfaces from <stdio.h> that all writable file systems are required to support, with the caveat that a thread private FILE record instance allocated as part of the thread state is needed to avoid conflicts with per-process open FILE * limits. This is necessary because there is no 'int flush(int fildes)' interface to complement 'int fflush(FILE stream)' used as part of the Line 47064, '...ensure that any required storage for regular file data starting at offset and continuing for len bytes is allocated on the file system storage media' specification. The exception is also required to be supported as part of the Base conformance requirements. Such a routine is usable to provide the required behavior whether or not a file system supports optimizing the allocation request in accordance with any posix_fadvise( ) settings that may have been applied. Conforming implementations therefore have no need to return EINVAL due to a file system not supporting the interface; only implementations providing non-conforming extensions require this. Notes to the Editor (not part of this interpretation): At Line 47084: Remove ", or the underlying file system does not support this operation" from EINVAL description. or Change "underlying file system" to "file was opened for writing but has been marked read only since and so". After Line 47096, in Application Usage, add: It is up to the application to write lock the segment being allocated using the same handle with fcntl( ) if it is unknown whether a handle open in another process might invalidate the effect of this interface before the application has a chance to write final data to this allocated segment and close that handle. This behavior is not normatively part of the interface so as to not slow down applications where it doesn't apply. Replace in Future Directions, Line 47100, "None." with: "It is recognized that some implementations currently permit use of this interface with some file systems in a manner that does not guarantee the behavior specified here and by other required interfaces because those file system implementations are unreliable when media usage is near its intrinsic or allocated capacity. A future version of the standard may require such file systems to be accessed as if it were read only media by conforming implementations to avoid data loss due to this unreliability." In See Also, Line 47102, insert: ", fcntl( )" after "creat( )" ---------------------------------------------------------------------- In other words, "yes we really mean it" isn't optional or needs to be added. As to details, though I'm not providing the full code, it only needs from <stdio.h>: int fseek(FILE stream, long offset, int whence); size_t fwrite(const void restrict ptr, size_t size, size_t nitems, FILE restrict stream); int fclose(FILE stream); int fflush(FILE stream); and as CX extensions: FILE fdopen(int fildes, const char mode); int fseeko(FILE stream, off_t offset, int whence); From <unistd.h> it would use: int ftruncate(int fildes, off_t length); for a nominally complete basic implementation. All of these are part of the Base requirements. Taking into account posix_fadvise( ) is implementation and file system dependent still, but defaulting to a basic mode when a file system doesn't support the optimizations is supposed to be transparent to all applications, that I can see (Line 47070-1). An internal fdopen( ) that uses a FILE reference as parameter rather than allocating one from the pool would be needed, but the mechanics of filling in the FILE record would be the same; it just wouldn't return allocation errors. If there was a 'int flush(int fildes)' interface xxx(int fildes) versions of these interfaces could be used directly without fdopen/fclose. An internal fdtruncate( ) that operates the same as ftruncate( ) but is passed a FILE parameter might be useful also, but isn't really needed - ftruncate( ) can be attempted before the fdopen( ) to extend the file if necessary. The cases specified in the description are for when the file systems are caching data prior to a flush. After a flush or file close if there is a physical space problem because of metadata reallocation that's an issue for that file system and the hosting environment so is beyond the scope of the standard. What overcommits or undercommits are required is not the standards concern. Such activities are expected to be transparent to an application once the interface reports success, even if optimal performance suffers accommodating this. A range lock might be needed to prevent another thread or process from shrinking the file too much, but that applies with any file system. As I see it, when such problems exist the implementation shouldn't allow any writable mounts of that file system because it doesn't support the expectations of the basic c99 FILE * based stream operations, much less all the POSIX fildes operations where writes are concerned. A copy-on-write file system should return ENOSPC or EIO if it can't fully flush the file extension to disk as the result of one of the internal write( ) commits of the fflush( ) interface. Even for a file system that does sparse allocations when large blocks of zeros are detected the routine can succeed by writing a data value or pattern that disables this behavior. So, IMO those implementations that lie about success are simply non-conforming for Issue 7. Prediction of space requirements isn't necessary; the ability to rollback a flush attempt is, even for a single fputc( ) that appends to a file. They may return from fflush( ) what c99 expects as errors, but not the CX errors POSIX defines and requires to enable rollback detection and act accordingly, so it seems. This affects ER #687 as well. Read only media wouldn't support this, regardless of filesystem, and this should be discernable via f/pathconf( ) separate from the strictly advisory interfaces, as proposed there, before an open( ) of files on that media are attempted. Those file systems that do have issues would be candidates for mounting in a read only mode instead of read write rather than return ENOTSUP, and as proposed here I think the language now should reflect that requirement instead as a future direction. If it does the interface also becomes a candidate for moving to the Base in Issue 8. Then the pathconf( ) could use 'media read only mounted, file is read only, supports basic write modes[ADV], supports fadvise( ) optimizations[]' as its return values.

(0002161) ajosey (manager) 2014-02-21 15:41	Interpretation Proposed 21 Feb 2014

(0002202) ajosey (manager) 2014-03-25 13:42	Interpretation Approved: 25 March 2014

Issue History
Date Modified	Username	Field	Change
2013-04-27 05:14	nico	New Issue
2013-04-27 05:14	nico	Name	=> Nicolas Williams
2013-04-27 05:14	nico	Organization	=> Cryptonector LLC
2013-04-27 05:14	nico	Section	=> posix_fallocate
2013-04-27 05:14	nico	Page Number	=> 1
2013-04-27 05:14	nico	Line Number	=> 1
2013-04-27 05:40	nico	Note Added: 0001554
2013-04-27 05:40	nico	Issue Monitored: nico
2013-04-27 16:56	nico	Note Added: 0001555
2013-04-27 21:36	dalias	Note Added: 0001556
2013-04-29 18:15	nico	Note Added: 0001557
2013-04-30 00:13	nico	Note Added: 0001558
2013-05-02 16:18	nick	Page Number	1 => 1425
2013-05-02 16:18	nick	Line Number	1 => 47078
2013-05-02 16:18	nick	Interp Status	=> ---
2013-05-02 16:18	nick	Note Added: 0001561
2013-05-02 16:19	nick	Interp Status	--- => Pending
2013-05-02 16:19	nick	Final Accepted Text	=> See Note: 0001561
2013-05-02 16:19	nick	Status	New => Interpretation Required
2013-05-02 16:19	nick	Resolution	Open => Accepted As Marked
2013-05-02 16:20	nick	Note Edited: 0001561
2013-05-02 16:21	nick	Note Edited: 0001561
2013-05-02 16:25	nick	Tag Attached: tc2-2008
2013-05-02 17:10	nick	Issue cloned	0000687
2013-05-02 17:10	nick	Relationship added	related to 0000687
2013-09-06 04:56	ajosey	Interp Status	Pending => Proposed
2013-09-06 04:56	ajosey	Note Added: 0001813
2013-09-27 19:06	shware_systems	Note Added: 0001857
2013-10-02 21:43	shware_systems	Note Edited: 0001857
2013-10-03 15:26	geoffclare	Tag Detached: tc2-2008
2013-10-03 15:26	geoffclare	Tag Attached: issue8
2013-10-03 15:31	geoffclare	Note Edited: 0001561
2013-10-03 15:32	geoffclare	Note Edited: 0001561
2013-10-03 15:32	geoffclare	Interp Status	Proposed => Pending
2014-02-21 15:41	ajosey	Interp Status	Pending => Proposed
2014-02-21 15:41	ajosey	Note Added: 0002161
2014-03-25 13:42	ajosey	Interp Status	Proposed => Approved
2014-03-25 13:42	ajosey	Note Added: 0002202
2020-03-24 15:48	geoffclare	Status	Interpretation Required => Applied

Aardvark Mark IV