0001585: kill - add -j option to avoid PID reuse race

Notes
(0005835) geoffclare (manager) 2022-05-16 08:21 edited on: 2022-05-16 08:33	As far as I can see, the only time this -j option would be useful is if an application wants to send a signal just to a process group leader without sending it to the other processes in the group. This seems like a very rare thing for an application to need to do. It would not be useful in the other two possible cases, which are: 1. The signal is to be sent to the whole process group. In this case, the application can just use "kill JOB". 2. The signal is to be sent to one or more of the processes comprising a process group, but not the whole group and not just the leader. In this case the use of -j does not solve the problem, as any of those process IDs (other than the leader) could have been reused even though the leader is still running and thus the job still exists. Anyway, a discussion of the technical merits of the proposal is pointless unless there is a shell which already implements this kill -j option. None of the shells I have available do. Does anybody know of one that does? If not, this request should be rejected as invention.

(0005836) kre (reporter) 2022-05-16 10:08	I agree that this is invention, and should be rejected, but even if some shell did try implementing it, I cannot see how they would do so in a way that would meet the objectives of the issue raised. As best I can tell, doing anything as suggested requires kernel assistance, as no matter how carefully the shell checks, there's no way that it can avoid race conditions, it must check first, and then do the kill sys call, and in the intervening period, things might have altered. If the kernel had a kill_my_child() sts call, then it could be made to work I think, as the shell could check the pids it knows belong to its children, and avoid creating any new ones between that check and doing the kill_my_child() sys call, but since I don't know of any system that implements a sys call like that (orr an option on kill - it cold be done by setting a high order bit in the signal number - then I cannot see how a shell could possibly make this work well enough to make adding such an option sensible. kre ps": such a new sys call would work on pgrps just the same as kill() does.

(0005837) steffen (reporter) 2022-05-16 13:37	First of all: correction: what i really meant was -j Process the kill request only when the given jobs or saved-aways process identifiers are known to the shell, and have not yet terminated, otherwise exit with status 66 (EX_NOINPUT from sysexits.h).

(0005838) steffen (reporter) 2022-05-16 13:54	re 5835 and 5836: It is clear the idea is that the shell's children will remain in the operating system's table of active processes until they have been wait(2)ed for, therefore only the sh(1)ell as the parent process can kill(2) the child safely. -j is thus meant to give the sh(1)ell script writer access to the race-free capability that the sh(1)ell as such has in its internals (anyway). Regarding process groups. Yes, this is true, of course, but i think it is weak reasoning to not offer the hand because somewhere down the process chain such things may happen. Quite the opposite, if i have the possibility everywhere, i can write race-free sh(1)ell scripts on all (subshell) levels. And programs with direct access to wait(2) that start child processes are hopefully doing it right anyway. They at least could using POSIX interfaces. But sh(1)ell script can not, even though the sh(1)ell as such can, or even has to do right. This is what this issue wants to change. Regarding operating system support. Oh, that is true!! They do start implementing this but unfortunately non-portable, POSIX is late and should possibly have tried to set a scent here in the past. Linux has the prctl(2)s PR_SET_CHILD_SUBREAPER and PR_GET_CHILD_SUBREAPER A subreaper fulfills the role of init(1) for its descendant pro‐ cesses. When a process becomes orphaned (i.e., its immediate parent terminates), then that process will be reparented to the nearest still living ancestor subreaper. Subsequently, calls to getppid(2) in the orphaned process will now return the PID of the subreaper process, and when the orphan terminates, it is the subreaper process that will receive a SIGCHLD signal and will be able to wait(2) on the process to discover its termination sta‐ tus. and FreeBSD has an even more sophisticated approach that allows iterating over the "descendants of the reaper", especially as with PROC_REAP_KILL the possibility to kill only a subset of these. It could be that in order to implement timeout(1) properly portions of this functionality need to be implemented kernel-wise. Or already have done so.

(0005839) steffen (reporter) 2022-05-16 23:41	Maybe i do misunderstand. Then i would retract the issue. What this issue wants to achieve is to close the gap in between wait(1) and waitpid(2)/x(2). - A saved-away process identifier will be known to the shell unless wait(1) has been called on it. - The process itself is known to the operating system aka kept in the process table until it has been waitpid(2)ed for. If i kill(1) a process that is still known to the sh(1)ell because wait(1) has not yet been called, but the shell itself has already waitpid(2)ed on the child, after having received SIGCLD or for whatever reason, then the operating system may already have reused the process identifier as such. The -j option to kill(1) should overcome this gap in that the sh(1)ell is forced to check the given identifiers for whether the process identifier has yet been waitpid(2) for or not. In the first case "kill -j -SIG PID" shall fail. I do come here for a reason. Years ago i "saved-away process identifiers", and was under the impression that the sh(1)ell applies special care for those process identifiers until wait(1) has been called on them. But it turned out i could kill(1) a process that was no longer mine, even though i did not have yet wait(1)ed on the process identifier. The shell simply called kill(2) on the process identifier, which in the meantime was reused by the operating system. So maybe this was a sh(1)ell bug, and shells are not allowed to call waitpid(2) on a process where the saved-away process identifier has not been wait(1)ed or (that is, may only call it when the latter is called). If this was so i would retract this issue. If not then the default behaviour of sh(1)ell wait(1) cannot be changed since this could break things in the wild. There should be an option to explicitly request that saved-away process identifiers should be still-alive when kill(1)ing them.

(0005840) geoffclare (manager) 2022-05-17 08:37	As I said before, discussing the technical merits of the proposal here is pointless. It will not even be considered for addition to POSIX until it has been implemented in at least one widely used shell. If a kill -j option is needed, the people you need to convince of that are the various shell authors/maintainers, not the Austin Group.

(0005841) steffen (reporter) 2022-05-17 14:17	re #5840: So i will try this as i have time to, and leave this issue open, 'hoping to come back for good. (Good would mean it finally could become standardized.) Thank you so far.

(0006018) Don Cragun (manager) 2022-10-31 16:06	The timing is such that it will not now be possible for this to be included in Issue 8 draft 3 (which is intended to be feature complete). Therefore this is being rejected for Issue 8 but can be resubmitted for consideration in Issue 9 if and when it is implemented by a widely used shell.

Issue History
Date Modified	Username	Field	Change
2022-05-14 22:05	steffen	New Issue
2022-05-14 22:05	steffen	Name	=> steffen
2022-05-14 22:05	steffen	Section	=> Vol. 3: Shell and Utilities
2022-05-14 22:05	steffen	Page Number	=> 2879
2022-05-14 22:05	steffen	Line Number	=> 94942
2022-05-16 08:21	geoffclare	Note Added: 0005835
2022-05-16 08:33	geoffclare	Note Edited: 0005835
2022-05-16 10:08	kre	Note Added: 0005836
2022-05-16 13:37	steffen	Note Added: 0005837
2022-05-16 13:54	steffen	Note Added: 0005838
2022-05-16 23:41	steffen	Note Added: 0005839
2022-05-17 08:37	geoffclare	Note Added: 0005840
2022-05-17 14:17	steffen	Note Added: 0005841
2022-10-31 16:06	Don Cragun	Interp Status	=> ---
2022-10-31 16:06	Don Cragun	Note Added: 0006018
2022-10-31 16:06	Don Cragun	Status	New => Closed
2022-10-31 16:06	Don Cragun	Resolution	Open => Rejected

Relationships

Aardvark Mark IV