History log of /seL4-refos-master/libs/libmuslc/src/internal/pthread_impl.h
Revision Date Author Comments
# f548fc36 02-Apr-2018 Kent McLeod <Kent.Mcleod@data61.csiro.au>

RISC-V: Revert pthread_impl definitions


# 2435f79d 30-Jul-2016 Hesham Almatary <hesham.almatary@data61.csiro.au>

RISC-V port


# 6f186676 06-Dec-2016 Rich Felker <dalias@aerifal.cx>

remove largish unused field from pthread structure


# 4078a5c3 08-Nov-2016 Rich Felker <dalias@aerifal.cx>

fix build regression on archs with variable page size

commit 31fb174dd295e50f7c5cf18d31fcfd5fe5a063b7 used
DEFAULT_GUARD_SIZE from pthread_impl.h in a static initializer,
breaking build on archs where its definition, PAGE_SIZE, is not a
constant. instead, just define DEFAULT_GUARD_SIZE as 4096, the minimal
page size on any arch we support. pthread_create rounds up to whole
pages anyway, so defining it to 1 would also work, but a moderately
meaningful value is nicer to programs that use
pthread_attr_getguardsize on default-initialized attribute objects.


# 6ba5517a 25-Jun-2015 Rich Felker <dalias@aerifal.cx>

fix local-dynamic model TLS on mips and powerpc

the TLS ABI spec for mips, powerpc, and some other (presently
unsupported) RISC archs has the return value of __tls_get_addr offset
by +0x8000 and the result of DTPOFF relocations offset by -0x8000. I
had previously assumed this part of the ABI was actually just an
implementation detail, since the adjustments cancel out. however, when
the local dynamic model is used for accessing TLS that's known to be
in the same DSO, either of the following may happen:

1. the -0x8000 offset may already be applied to the argument structure
passed to __tls_get_addr at ld time, without any opportunity for
runtime relocations.

2. __tls_get_addr may be used with a zero offset argument to obtain a
base address for the module's TLS, to which the caller then applies
immediate offsets for individual objects accessed using the local
dynamic model. since the immediate offsets have the -0x8000 adjustment
applied to them, the base address they use needs to include the
+0x8000 offset.

it would be possible, but more complex, to store the pointers in the
dtv[] array with the +0x8000 offset pre-applied, to avoid the runtime
cost of adding 0x8000 on each call to __tls_get_addr. this change
could be made later if measurements show that it would help.


# 484194db 06-May-2015 Rich Felker <dalias@aerifal.cx>

fix stack protector crashes on x32 & powerpc due to misplaced TLS canary

i386, x86_64, x32, and powerpc all use TLS for stack protector canary
values in the default stack protector ABI, but the location only
matched the ABI on i386 and x86_64. on x32, the expected location for
the canary contained the tid, thus producing spurious mismatches
(resulting in process termination) upon fork. on powerpc, the expected
location contained the stdio_locks list head, so returning from a
function after calling flockfile produced spurious mismatches. in both
cases, the random canary was not present, and a predictable value was
used instead, making the stack protector hardening much less effective
than it should be.

in the current fix, the thread structure has been expanded to have
canary fields at all three possible locations, and archs that use a
non-default location must define a macro in pthread_arch.h to choose
which location is used. for most archs (which lack TLS canary ABI) the
choice does not matter.


# 01d42747 18-Apr-2015 Rich Felker <dalias@aerifal.cx>

make dlerror state and message thread-local and dynamically-allocated

this fixes truncation of error messages containing long pathnames or
symbol names.

the dlerror state was previously required by POSIX to be global. the
resolution of bug 97 relaxed the requirements to allow thread-safe
implementations of dlerror with thread-local state and message buffer.


# fa807876 14-Apr-2015 Alexander Monakov <amonakov@ispras.ru>

add missing 'void' in prototypes of internal pthread functions


# f08ab9e6 10-Apr-2015 Rich Felker <dalias@aerifal.cx>

redesign and simplify vmlock system

this global lock allows certain unlock-type primitives to exclude
mmap/munmap operations which could change the identity of virtual
addresses while references to them still exist.

the original design mistakenly assumed mmap/munmap would conversely
need to exclude the same operations which exclude mmap/munmap, so the
vmlock was implemented as a sort of 'symmetric recursive rwlock'. this
turned out to be unnecessary.

commit 25d12fc0fc51f1fae0f85b4649a6463eb805aa8f already shortened the
interval during which mmap/munmap held their side of the lock, but
left the inappropriate lock design and some inefficiency.

the new design uses a separate function, __vm_wait, which does not
hold any lock itself and only waits for lock users which were already
present when it was called to release the lock. this is sufficient
because of the way operations that need to be excluded are sequenced:
the "unlock-type" operations using the vmlock need only block
mmap/munmap operations that are precipitated by (and thus sequenced
after) the atomic-unlock they perform while holding the vmlock.

this allows for a spectacular lack of synchronization in the __vm_wait
function itself.


# 204a69d2 10-Mar-2015 Szabolcs Nagy <nsz@port70.net>

copy the dtv pointer to the end of the pthread struct for TLS_ABOVE_TP archs

There are two main abi variants for thread local storage layout:

(1) TLS is above the thread pointer at a fixed offset and the pthread
struct is below that. So the end of the struct is at known offset.

(2) the thread pointer points to the pthread struct and TLS starts
below it. So the start of the struct is at known (zero) offset.

Assembly code for the dynamic TLSDESC callback needs to access the
dynamic thread vector (dtv) pointer which is currently at the front
of the pthread struct. So in case of (1) the asm code needs to hard
code the offset from the end of the struct which can easily break if
the struct changes.

This commit adds a copy of the dtv at the end of the struct. New members
must not be added after dtv_copy, only before it. The size of the struct
is increased a bit, but there is opportunity for size optimizations.


# 56fbaa3b 03-Mar-2015 Rich Felker <dalias@aerifal.cx>

make all objects used with atomic operations volatile

the memory model we use internally for atomics permits plain loads of
values which may be subject to concurrent modification without
requiring that a special load function be used. since a compiler is
free to make transformations that alter the number of loads or the way
in which loads are performed, the compiler is theoretically free to
break this usage. the most obvious concern is with atomic cas
constructs: something of the form tmp=*p;a_cas(p,tmp,f(tmp)); could be
transformed to a_cas(p,*p,f(*p)); where the latter is intended to show
multiple loads of *p whose resulting values might fail to be equal;
this would break the atomicity of the whole operation. but even more
fundamental breakage is possible.

with the changes being made now, objects that may be modified by
atomics are modeled as volatile, and the atomic operations performed
on them by other threads are modeled as asynchronous stores by
hardware which happens to be acting on the request of another thread.
such modeling of course does not itself address memory synchronization
between cores/cpus, but that aspect was already handled. this all
seems less than ideal, but it's the best we can do without mandating a
C11 compiler and using the C11 model for atomics.

in the case of pthread_once_t, the ABI type of the underlying object
is not volatile-qualified. so we are assuming that accessing the
object through a volatile-qualified lvalue via casts yields volatile
access semantics. the language of the C standard is somewhat unclear
on this matter, but this is an assumption the linux kernel also makes,
and seems to be the correct interpretation of the standard.


# 0fc317d8 02-Mar-2015 Rich Felker <dalias@aerifal.cx>

factor cancellation cleanup push/pop out of futex __timedwait function

previously, the __timedwait function was optionally a cancellation
point depending on whether it was passed a pointer to a cleaup
function and context to register. as of now, only one caller actually
used such a cleanup function (and it may face removal soon); most
callers either passed a null pointer to disable cancellation or a
dummy cleanup function.

now, __timedwait is never a cancellation point, and __timedwait_cp is
the cancellable version. this makes the intent of the calling code
more obvious and avoids ugly dummy functions and long argument lists.


# 23614b0f 07-Sep-2014 Rich Felker <dalias@aerifal.cx>

add C11 thread creation and related thread functions

based on patch by Jens Gustedt.

the main difficulty here is handling the difference between start
function signatures and thread return types for C11 threads versus
POSIX threads. pointers to void are assumed to be able to represent
faithfully all values of int. the function pointer for the thread
start function is cast to an incorrect type for passing through
pthread_create, but is cast back to its correct type before calling so
that the behavior of the call is well-defined.

changes to the existing threads implementation were kept minimal to
reduce the risk of regressions, and duplication of code that carries
implementation-specific assumptions was avoided for ease and safety of
future maintenance.


# 5345c9b8 23-Aug-2014 Rich Felker <dalias@aerifal.cx>

fix false ownership of stdio FILEs due to tid reuse

this is analogous commit fffc5cda10e0c5c910b40f7be0d4fa4e15bb3f48
which fixed the corresponding issue for mutexes.

the robust list can't be used here because the locks do not share a
common layout with mutexes. at some point it may make sense to simply
incorporate a mutex object into the FILE structure and use it, but
that would be a much more invasive change, and it doesn't mesh well
with the current design that uses a simpler code path for internal
locking and pulls in the recursive-mutex-like code when the flockfile
API is used explicitly.


# b8ca9eb5 22-Aug-2014 Rich Felker <dalias@aerifal.cx>

fix fallback checks for kernels without private futex support

for unknown syscall commands, the kernel produces ENOSYS, not EINVAL.


# 37195db8 17-Aug-2014 Rich Felker <dalias@aerifal.cx>

redesign cond var implementation to fix multiple issues

the immediate issue that was reported by Jens Gustedt and needed to be
fixed was corruption of the cv/mutex waiter states when switching to
using a new mutex with the cv after all waiters were unblocked but
before they finished returning from the wait function.

self-synchronized destruction was also handled poorly and may have had
race conditions. and the use of sequence numbers for waking waiters
admitted a theoretical missed-wakeup if the sequence number wrapped
through the full 32-bit space.

the new implementation is largely documented in the comments in the
source. the basic principle is to use linked lists initially attached
to the cv object, but detachable on signal/broadcast, made up of nodes
residing in automatic storage (stack) on the threads that are waiting.
this eliminates the need for waiters to access the cv object after
they are signaled, and allows us to limit wakeup to one waiter at a
time during broadcasts even when futex requeue cannot be used.

performance is also greatly improved, roughly double some tests.

basically nothing is changed in the process-shared cond var case,
where this implementation does not work, since processes do not have
access to one another's local storage.


# de7e99c5 16-Aug-2014 Rich Felker <dalias@aerifal.cx>

make pointers used in robust list volatile

when manipulating the robust list, the order of stores matters,
because the code may be asynchronously interrupted by a fatal signal
and the kernel will then access the robust list in what is essentially
an async-signal context.

previously, aliasing considerations made it seem unlikely that a
compiler could reorder the stores, but proving that they could not be
reordered incorrectly would have been extremely difficult. instead
I've opted to make all the pointers used as part of the robust list,
including those in the robust list head and in the individual mutexes,
volatile.

in addition, the format of the robust list has been changed to point
back to the head at the end, rather than ending with a null pointer.
this is to match the documented kernel robust list ABI. the null
pointer, which was previously used, only worked because faults during
access terminate the robust list processing.


# bc09d58c 15-Aug-2014 Rich Felker <dalias@aerifal.cx>

make futex operations use private-futex mode when possible

private-futex uses the virtual address of the futex int directly as
the hash key rather than requiring the kernel to resolve the address
to an underlying backing for the mapping in which it lies. for certain
usage patterns it improves performance significantly.

in many places, the code using futex __wake and __wait operations was
already passing a correct fixed zero or nonzero flag for the priv
argument, so no change was needed at the site of the call, only in the
__wake and __wait functions themselves. in other places, especially
where the process-shared attribute for a synchronization object was
not previously tracked, additional new code is needed. for mutexes,
the only place to store the flag is in the type field, so additional
bit masking logic is needed for accessing the type.

for non-process-shared condition variable broadcasts, the futex
requeue operation is unable to requeue from a private futex to a
process-shared one in the mutex structure, so requeue is simply
disabled in this case by waking all waiters.

for robust mutexes, the kernel always performs a non-private wake when
the owner dies. in order not to introduce a behavioral regression in
non-process-shared robust mutexes (when the owning thread dies), they
are simply forced to be treated as process-shared for now, giving
correct behavior at the expense of performance. this can be fixed by
adding explicit code to pthread_exit to do the right thing for
non-shared robust mutexes in userspace rather than relying on the
kernel to do it, and will be fixed in this way later.

since not all supported kernels have private futex support, the new
code detects EINVAL from the futex syscall and falls back to making
the call without the private flag. no attempt to cache the result is
made; caching it and using the cached value efficiently is somewhat
difficult, and not worth the complexity when the benefits would be
seen only on ancient kernels which have numerous other limitations and
bugs anyway.


# ac31bf27 10-Jun-2014 Rich Felker <dalias@aerifal.cx>

simplify errno implementation

the motivation for the errno_ptr field in the thread structure, which
this commit removes, was to allow the main thread's errno to keep its
address when lazy thread pointer initialization was used. &errno was
evaluated prior to setting up the thread pointer and stored in
errno_ptr for the main thread; subsequently created threads would have
errno_ptr pointing to their own errno_val in the thread structure.

since lazy initialization was removed, there is no need for this extra
level of indirection; __errno_location can simply return the address
of the thread's errno_val directly. this does cause &errno to change,
but the change happens before entry to application code, and thus is
not observable.


# 7356c255 03-Aug-2013 Rich Felker <dalias@aerifal.cx>

fix multiple bugs in SIGEV_THREAD timers

1. the thread result field was reused for storing a kernel timer id,
but would be overwritten if the application code exited or cancelled
the thread.

2. low pointer values were used as the indicator that the timer id is
a kernel timer id rather than a thread id. this is not portable, as
mmap may return low pointers on some conditions. instead, use the fact
that pointers must be aligned and kernel timer ids must be
non-negative to map pointers into the negative integer space.

3. signals were not blocked until after the timer thread started, so a
race condition could allow a signal handler to run in the timer thread
when it's not supposed to exist. this is mainly problematic if the
calling thread was the only thread where the signal was unblocked and
the signal handler assumes it runs in that thread.


# 2c074b0d 26-Apr-2013 Rich Felker <dalias@aerifal.cx>

transition to using functions for internal signal blocking/restoring

there are several reasons for this change. one is getting rid of the
repetition of the syscall signature all over the place. another is
sharing the constant masks without costly GOT accesses in PIC.

the main motivation, however, is accurately representing whether we
want to block signals that might be handled by the application, or all
signals.


# 14a835b3 31-Mar-2013 Rich Felker <dalias@aerifal.cx>

implement pthread_getattr_np

this function is mainly (purely?) for obtaining stack address
information, but we also provide the detach state since it's easy to
do anyway.


# ccc7b4c3 26-Mar-2013 Rich Felker <dalias@aerifal.cx>

remove __SYSCALL_SSLEN arch macro in favor of using public _NSIG

the issue at hand is that many syscalls require as an argument the
kernel-ABI size of sigset_t, intended to allow the kernel to switch to
a larger sigset_t in the future. previously, each arch was defining
this size in syscall_arch.h, which was redundant with the definition
of _NSIG in bits/signal.h. as it's used in some not-quite-portable
application code as well, _NSIG is much more likely to be recognized
and understood immediately by someone reading the code, and it's also
shorter and less cluttered.

note that _NSIG is actually 65/129, not 64/128, but the division takes
care of throwing away the off-by-one part.


# facc6acb 01-Feb-2013 Rich Felker <dalias@aerifal.cx>

replace __wake function with macro that performs direct syscall

this should generate faster and smaller code, especially with inline
syscalls. the conditional with cnt is ugly, but thankfully cnt is
always a constant anyway so it gets evaluated at compile time. it may
be preferable to make separate __wake and __wakeall macros without a
count argument.

priv flag is not used yet; private futex support still needs to be
done at some point in the future.


# 1e21e78b 11-Nov-2012 Rich Felker <dalias@aerifal.cx>

add support for thread scheduling (POSIX TPS option)

linux's sched_* syscalls actually implement the TPS (thread
scheduling) functionality, not the PS (process scheduling)
functionality which the sched_* functions are supposed to have.
omitting support for the PS option (and having the sched_* interfaces
fail with ENOSYS rather than omitting them, since some broken software
assumes they exist) seems to be the only conforming way to do this on
linux.


# efd4d87a 08-Nov-2012 Rich Felker <dalias@aerifal.cx>

clean up sloppy nested inclusion from pthread_impl.h

this mirrors the stdio_impl.h cleanup. one header which is not
strictly needed, errno.h, is left in pthread_impl.h, because since
pthread functions return their error codes rather than using errno,
nearly every single pthread function needs the errno constants.

in a few places, rather than bringing in string.h to use memset, the
memset was replaced by direct assignment. this seems to generate much
better code anyway, and makes many functions which were previously
non-leaf functions into leaf functions (possibly eliminating a great
deal of bloat on some platforms where non-leaf functions require ugly
prologue and/or epilogue).


# dcd60371 05-Oct-2012 Rich Felker <dalias@aerifal.cx>

support for TLS in dynamic-loaded (dlopen) modules

unlike other implementations, this one reserves memory for new TLS in
all pre-existing threads at dlopen-time, and dlopen will fail with no
resources consumed and no new libraries loaded if memory is not
available. memory is not immediately distributed to running threads;
that would be too complex and too costly. instead, assurances are made
that threads needing the new TLS can obtain it in an async-signal-safe
way from a buffer belonging to the dynamic linker/new module (via
atomic fetch-and-add based allocator).

I've re-appropriated the lock that was previously used for __synccall
(synchronizing set*id() syscalls between threads) as a general
pthread_create lock. it's a "backwards" rwlock where the "read"
operation is safe atomic modification of the live thread count, which
multiple threads can perform at the same time, and the "write"
operation is making sure the count does not increase during an
operation that depends on it remaining bounded (__synccall or dlopen).
in static-linked programs that don't use __synccall, this lock is a
no-op and has no cost.


# 9b153c04 04-Oct-2012 Rich Felker <dalias@aerifal.cx>

beginnings of full TLS support in shared libraries

this code will not work yet because the necessary relocations are not
supported, and cannot be supported without some internal changes to
how relocation processing works (coming soon).


# 2f437040 09-Aug-2012 Rich Felker <dalias@aerifal.cx>

fix (hopefully) all hard-coded 8's for kernel sigset_t size

some minor changes to how hard-coded sets for thread-related purposes
are handled were also needed, since the old object sizes were not
necessarily sufficient. things have gotten a bit ugly in this area,
and i think a cleanup is in order at some point, but for now the goal
is just to get the code working on all supported archs including mips,
which was badly broken by linux rejecting syscalls with the wrong
sigset_t size.


# bbbe87e3 12-Jul-2012 Rich Felker <dalias@aerifal.cx>

fix several locks that weren't updated right for new futex-based __lock

these could have caused memory corruption due to invalid accesses to
the next field. all should be fixed now; I found the errors with fgrep
-r '__lock(&', which is bogus since the argument should be an array.


# 819006a8 09-Jun-2012 Rich Felker <dalias@aerifal.cx>

add pthread_attr_setstack interface (and get)

i originally omitted these (optional, per POSIX) interfaces because i
considered them backwards implementation details. however, someone
later brought to my attention a fairly legitimate use case: allocating
thread stacks in memory that's setup for sharing and/or fast transfer
between CPU and GPU so that the thread can move data to a GPU directly
from automatic-storage buffers without having to go through additional
buffer copies.

perhaps there are other situations in which these interfaces are
useful too.


# 13b3645c 02-Jun-2012 Rich Felker <dalias@aerifal.cx>

increase default thread stack size to 80k

I've been looking for data that would suggest a good default, and
since little has shown up, i'm doing this based on the limited data I
have. the value 80k is chosen to accommodate 64k of application data
(which happens to be the size of the buffer in git that made it crash
without a patch to call pthread_attr_setstacksize) plus the max stack
usage of most libc functions (with a few exceptions like crypt, which
will be fixed soon to avoid excessive stack usage, and [n]ftw, which
inherently uses a fair bit in recursive directory searching).

if further evidence emerges suggesting that the default should be
larger, I'll consider changing it again, but I'd like to avoid it
getting too large to avoid the issues of large commit charge and rapid
address space exhaustion on 32-bit machines.


# 7efd14ec 24-May-2012 Rich Felker <dalias@aerifal.cx>

remove cruft from pthread structure (old cancellation stuff)


# 9ae1cf6d 21-May-2012 Rich Felker <dalias@aerifal.cx>

fix out-of-bounds array access in pthread barriers on 64-bit

it's ok to overlap with integer slot 3 on 32-bit because only slots
0-2 are used on process-local barriers.


# 58aa5f45 03-May-2012 Rich Felker <dalias@aerifal.cx>

overhaul SSP support to use a real canary

pthread structure has been adjusted to match the glibc/GCC abi for
where the canary is stored on i386 and x86_64. it will need variants
for other archs to provide the added security of the canary's entropy,
but even without that it still works as well as the old "minimal" ssp
support. eventually such changes will be made anyway, since they are
also needed for GCC/C11 thread-local storage support (not yet
implemented).

care is taken not to attempt initializing the thread pointer unless
the program actually uses SSP (by reference to __stack_chk_fail).


# 5a2e1809 02-Oct-2011 Rich Felker <dalias@aerifal.cx>

synchronize cond var destruction with exiting waits


# 9cee9307 28-Sep-2011 Rich Felker <dalias@aerifal.cx>

improve pshared barriers

eliminate the sequence number field and instead use the counter as the
futex because of the way the lock is held, sequence numbers are
completely useless, and this frees up a field in the barrier structure
to be used as a waiter count for the count futex, which lets us avoid
some syscalls in the best case.

as of now, self-synchronized destruction and unmapping should be fully
safe. before any thread can return from the barrier, all threads in
the barrier have obtained the vm lock, and each holds a shared lock on
the barrier. the barrier memory is not inspected after the shared lock
count reaches 0, nor after the vm lock is released.


# 60164570 27-Sep-2011 Rich Felker <dalias@aerifal.cx>

process-shared barrier support, based on discussion with bdonlan

this implementation is rather heavy-weight, but it's the first
solution i've found that's actually correct. all waiters actually wait
twice at the barrier so that they can synchronize exit, and they hold
a "vm lock" that prevents changes to virtual memory mappings (and
blocks pthread_barrier_destroy) until all waiters are finished
inspecting the barrier.

thus, it is safe for any thread to destroy and/or unmap the barrier's
memory as soon as pthread_barrier_wait returns, without further
synchronization.


# 1fa05210 25-Sep-2011 Rich Felker <dalias@aerifal.cx>

fix lost signals in cond vars

due to moving waiters from the cond var to the mutex in bcast, these
waiters upon wakeup would steal slots in the count from newer waiters
that had not yet been signaled, preventing the signal function from
taking any action.

to solve the problem, we simply use two separate waiter counts, and so
that the original "total" waiters count is undisturbed by broadcast
and still available for signal.


# 729d6368 25-Sep-2011 Rich Felker <dalias@aerifal.cx>

redo cond vars again, use sequence numbers

testing revealed that the old implementation, while correct, was
giving way too many spurious wakeups due to races changing the value
of the condition futex. in a test program with 5 threads receiving
broadcast signals, the number of returns from pthread_cond_wait was
roughly 3 times what it should have been (2 spurious wakeups for every
legitimate wakeup). moreover, the magnitude of this effect seems to
grow with the number of threads.

the old implementation may also have had some nasty race conditions
with reuse of the cond var with a new mutex.

the new implementation is based on incrementing a sequence number with
each signal event. this sequence number has nothing to do with the
number of threads intended to be woken; it's only used to provide a
value for the futex wait to avoid deadlock. in theory there is a
danger of race conditions due to the value wrapping around after 2^32
signals. it would be nice to eliminate that, if there's a way.

testing showed no spurious wakeups (though they are of course
possible) with the new implementation, as well as slightly improved
performance.


# cba4e1c0 25-Sep-2011 Rich Felker <dalias@aerifal.cx>

new futex-requeue-based pthread_cond_broadcast implementation

this avoids the "stampede effect" where pthread_cond_broadcast would
result in all waiters waking up simultaneously, only to immediately
contend for the mutex and go back to sleep.


# 4b153ac4 22-Sep-2011 Rich Felker <dalias@aerifal.cx>

fix deadlock in condition wait whenever there are multiple waiters

it's amazing none of the conformance tests i've run even bothered to
check whether something so basic works...


# 3f72cdac 18-Sep-2011 Rich Felker <dalias@aerifal.cx>

overhaul clone syscall wrapping

several things are changed. first, i have removed the old __uniclone
function signature and replaced it with the "standard" linux
__clone/clone signature. this was necessary to expose clone to
applications anyway, and it makes it easier to port __clone to new
archs, since it's now testable independently of pthread_create.

secondly, i have removed all references to the ugly ldt descriptor
structure (i386 only) from the c code and pthread structure. in places
where it is needed, it is now created on the stack just when it's
needed, in assembly code. thus, the i386 __clone function takes the
desired thread pointer as its argument, rather than an ldt descriptor
pointer, just like on all other sane archs. this should not affect
applications since there is really no way an application can use clone
with threads/tls in a way that doesn't horribly conflict with and
clobber the underlying implementation's use. applications are expected
to use clone only for creating actual processes, possibly with new
namespace features and whatnot.


# 407d9330 12-Aug-2011 Rich Felker <dalias@aerifal.cx>

pthread and synccall cleanup, new __synccall_wait op

fix up clone signature to match the actual behavior. the new
__syncall_wait function allows a __synccall callback to wait for other
threads to continue without returning, so that it can resume action
after the caller finishes. this interface could be made significantly
more general/powerful with minimal effort, but i'll wait to do that
until it's actually useful for something.


# 50304f2e 03-Aug-2011 Rich Felker <dalias@aerifal.cx>

overhaul rwlocks to address several issues

like mutexes and semaphores, rwlocks suffered from a race condition
where the unlock operation could access the lock memory after another
thread successfully obtained the lock (and possibly destroyed or
unmapped the object). this has been fixed in the same way it was fixed
for other lock types.

in addition, the previous implementation favored writers over readers.
in the absence of other considerations, that is the best behavior for
rwlocks, and posix explicitly allows it. however posix also requires
read locks to be recursive. if writers are favored, any attempt to
obtain a read lock while a writer is waiting for the lock will fail,
causing "recursive" read locks to deadlock. this can be avoided by
keeping track of which threads already hold read locks, but doing so
requires unbounded memory usage, and there must be a fallback case
that favors readers in case memory allocation failed. and all of this
must be synchronized. the cost, complexity, and risk of errors in
getting it right is too great, so we simply favor readers.

tracking of the owner of write locks has been removed, as it was not
useful for anything. it could allow deadlock detection, but it's not
clear to me that returning EDEADLK (which a buggy program is likely to
ignore) is better than deadlocking; at least the latter behavior
prevents further data corruption. a correct program cannot invoke this
situation anyway.

the reader count and write lock state, as well as the "last minute"
waiter flag have all been combined into a single atomic lock. this
means all state transitions for the lock are atomic compare-and-swap
operations. this makes establishing correctness much easier and may
improve performance.

finally, some code duplication has been cleaned up. more is called
for, especially the standard __timedwait idiom repeated in all locks.


# ec381af9 02-Aug-2011 Rich Felker <dalias@aerifal.cx>

unify and overhaul timed futex waits

new features:

- FUTEX_WAIT_BITSET op will be used for timed waits if available. this
saves a call to clock_gettime.

- error checking for the timespec struct is now inside __timedwait so
it doesn't need to be duplicated everywhere. cond_timedwait still
needs to duplicate it to avoid unlocking the mutex, though.

- pushing and popping the cancellation handler is delegated to
__timedwait, and cancellable/non-cancellable waits are unified.


# dba68bf9 30-Jul-2011 Rich Felker <dalias@aerifal.cx>

add proper fuxed-based locking for stdio

previously, stdio used spinlocks, which would be unacceptable if we
ever add support for thread priorities, and which yielded
pathologically bad performance if an application attempted to use
flockfile on a key file as a major/primary locking mechanism.

i had held off on making this change for fear that it would hurt
performance in the non-threaded case, but actually support for
recursive locking had already inflicted that cost. by having the
internal locking functions store a flag indicating whether they need
to perform unlocking, rather than using the actual recursive lock
counter, i was able to combine the conditionals at unlock time,
eliminating any additional cost, and also avoid a nasty corner case
where a huge number of calls to ftrylockfile could cause deadlock
later at the point of internal locking.

this commit also fixes some issues with usage of pthread_self
conflicting with __attribute__((const)) which resulted in crashes with
some compiler versions/optimizations, mainly in flockfile prior to
pthread_create.


# acb04806 29-Jul-2011 Rich Felker <dalias@aerifal.cx>

new attempt at making set*id() safe and robust

changing credentials in a multi-threaded program is extremely
difficult on linux because it requires synchronizing the change
between all threads, which have their own thread-local credentials on
the kernel side. this is further complicated by the fact that changing
the real uid can fail due to exceeding RLIMIT_NPROC, making it
possible that the syscall will succeed in some threads but fail in
others.

the old __rsyscall approach being replaced was robust in that it would
report failure if any one thread failed, but in this case, the program
would be left in an inconsistent state where individual threads might
have different uid. (this was not as bad as glibc, which would
sometimes even fail to report the failure entirely!)

the new approach being committed refuses to change real user id when
it cannot temporarily set the rlimit to infinity. this is completely
POSIX conformant since POSIX does not require an implementation to
allow real-user-id changes for non-privileged processes whatsoever.
still, setting the real uid can fail due to memory allocation in the
kernel, but this can only happen if there is not already a cached
object for the target user. thus, we forcibly serialize the syscalls
attempts, and fail the entire operation on the first failure. this
*should* lead to an all-or-nothing success/failure result, but it's
still fragile and highly dependent on kernel developers not breaking
things worse than they're already broken.

ideally linux will eventually add a CLONE_USERCRED flag that would
give POSIX conformant credential changes without any hacks from
userspace, and all of this code would become redundant and could be
removed ~10 years down the line when everyone has abandoned the old
broken kernels. i'm not holding my breath...


# 7779dbd2 13-Jun-2011 Rich Felker <dalias@aerifal.cx>

fix race condition in pthread_kill

if thread id was reused by the kernel between the time pthread_kill
read it from the userspace pthread_t object and the time of the tgkill
syscall, a signal could be sent to the wrong thread. the tgkill
syscall was supposed to prevent this race (versus the old tkill
syscall) but it can't; it can only help in the case where the tid is
reused in a different process, but not when the tid is reused in the
same process.

the only solution i can see is an extra lock to prevent threads from
exiting while another thread is trying to pthread_kill them. it should
be very very cheap in the non-contended case.


# f09e78de 13-Jun-2011 Rich Felker <dalias@aerifal.cx>

fix sigset macro for 64-bit systems (<< was overflowing due to wrong type)


# 11c531e2 29-May-2011 Rich Felker <dalias@aerifal.cx>

implement uselocale function (minimal)


# 4c4e22d7 07-May-2011 Rich Felker <dalias@aerifal.cx>

optimize compound-literal sigset_t's not to contain useless hurd bits


# 99b8a25e 07-May-2011 Rich Felker <dalias@aerifal.cx>

overhaul implementation-internal signal protections

the new approach relies on the fact that the only ways to create
sigset_t objects without invoking UB are to use the sig*set()
functions, or from the masks returned by sigprocmask, sigaction, etc.
or in the ucontext_t argument to a signal handler. thus, as long as
sigfillset and sigaddset avoid adding the "protected" signals, there
is no way the application will ever obtain a sigset_t including these
bits, and thus no need to add the overhead of checking/clearing them
when sigprocmask or sigaction is called.

note that the old code actually *failed* to remove the bits from
sa_mask when sigaction was called.

the new implementations are also significantly smaller, simpler, and
faster due to ignoring the useless "GNU HURD signals" 65-1024, which
are not used and, if there's any sanity in the world, never will be
used.


# f16a3089 06-May-2011 Rich Felker <dalias@aerifal.cx>

completely new barrier implementation, addressing major correctness issues

the previous implementation had at least 2 problems:

1. the case where additional threads reached the barrier before the
first wave was finished leaving the barrier was untested and seemed
not to be working.

2. threads leaving the barrier continued to access memory within the
barrier object after other threads had successfully returned from
pthread_barrier_wait. this could lead to memory corruption or crashes
if the barrier object had automatic storage in one of the waiting
threads and went out of scope before all threads finished returning,
or if one thread unmapped the memory in which the barrier object
lived.

the new implementation avoids both problems by making the barrier
state essentially local to the first thread which enters the barrier
wait, and forces that thread to be the last to return.


# feee9890 17-Apr-2011 Rich Felker <dalias@aerifal.cx>

overhaul pthread cancellation

this patch improves the correctness, simplicity, and size of
cancellation-related code. modulo any small errors, it should now be
completely conformant, safe, and resource-leak free.

the notion of entering and exiting cancellation-point context has been
completely eliminated and replaced with alternative syscall assembly
code for cancellable syscalls. the assembly is responsible for setting
up execution context information (stack pointer and address of the
syscall instruction) which the cancellation signal handler can use to
determine whether the interrupted code was in a cancellable state.

these changes eliminate race conditions in the previous generation of
cancellation handling code (whereby a cancellation request received
just prior to the syscall would not be processed, leaving the syscall
to block, potentially indefinitely), and remedy an issue where
non-cancellable syscalls made from signal handlers became cancellable
if the signal handler interrupted a cancellation point.

x86_64 asm is untested and may need a second try to get it right.


# 016a5dc1 13-Apr-2011 Rich Felker <dalias@aerifal.cx>

use a separate signal from SIGCANCEL for SIGEV_THREAD timers

otherwise we cannot support an application's desire to use
asynchronous cancellation within the callback function. this change
also slightly debloats pthread_create.c.


# 82171d6a 09-Apr-2011 Rich Felker <dalias@aerifal.cx>

greatly improve SIGEV_THREAD timers

calling pthread_exit from, or pthread_cancel on, the timer callback
thread will no longer destroy the timer.


# b2486a89 06-Apr-2011 Rich Felker <dalias@aerifal.cx>

move rsyscall out of pthread_create module

this is something of a tradeoff, as now set*id() functions, rather
than pthread_create, are what pull in the code overhead for dealing
with linux's refusal to implement proper POSIX thread-vs-process
semantics. my motivations are:

1. it's cleaner this way, especially cleaner to optimize out the
rsyscall locking overhead from pthread_create when it's not needed.
2. it's expected that only a tiny number of core system programs will
ever use set*id() functions, whereas many programs may want to use
threads, and making thread overhead tiny is an incentive for "light"
programs to try threads.


# b8be64c4 29-Mar-2011 Rich Felker <dalias@aerifal.cx>

optimize timer creation and possibly protect against some minor races

the major idea of this patch is not to depend on having the timer
pointer delivered to the signal handler, and instead use the thread
pointer to get the callback function address and argument. this way,
the parent thread can make the timer_create syscall while the child
thread is starting, and it should never have to block waiting for the
barrier.


# bf619d82 28-Mar-2011 Rich Felker <dalias@aerifal.cx>

major improvements to cancellation handling

- there is no longer any risk of spoofing cancellation requests, since
the cancel flag is set in pthread_cancel rather than in the signal
handler.

- cancellation signal is no longer unblocked when running the
cancellation handlers. instead, pthread_create will cause any new
threads created from a cancellation handler to unblock their own
cancellation signal.

- various tweaks in preparation for POSIX timer support.


# 70c31c7b 29-Mar-2011 Rich Felker <dalias@aerifal.cx>

some preliminaries for adding POSIX timers


# 83b6c9e0 28-Mar-2011 Rich Felker <dalias@aerifal.cx>

remove useless field in pthread struct (wasted a good bit of space)


# 047e434e 17-Mar-2011 Rich Felker <dalias@aerifal.cx>

implement robust mutexes

some of this code should be cleaned up, e.g. using macros for some of
the bit flags, masks, etc. nonetheless, the code is believed to be
working and correct at this point.


# 93cc986a 17-Mar-2011 Rich Felker <dalias@aerifal.cx>

reorder mutex struct fields to make room for pointers (upcoming robust mutexes)

the layout has been chosen so that pointer slots 3 and 4 fit between
the integer slots on 32-bit archs, and come after the integer slots on
64-bit archs.


# b1c43161 16-Mar-2011 Rich Felker <dalias@aerifal.cx>

unify lock and owner fields of mutex structure

this change is necessary to free up one slot in the mutex structure so
that we can use doubly-linked lists in the implementation of robust
mutexes.


# 5fcebcde 10-Mar-2011 Rich Felker <dalias@aerifal.cx>

optimize pthread termination in the non-detached case

we can avoid blocking signals by simply using a flag to mark that the
thread has exited and prevent it from getting counted in the rsyscall
signal-pingpong. this restores the original pthread create/join
throughput from before the sigprocmask call was added.


# 4820f926 08-Mar-2011 Rich Felker <dalias@aerifal.cx>

fix and optimize non-default-type mutex behavior

problem 1: mutex type from the attribute was being ignored by
pthread_mutex_init, so recursive/errorchecking mutexes were never
being used at all.

problem 2: ownership of recursive mutexes was not being enforced at
unlock time.


# 5fd4a981 07-Mar-2011 Rich Felker <dalias@aerifal.cx>

use the selected clock from the condattr for pthread_cond_timedwait


# e8827563 17-Feb-2011 Rich Felker <dalias@aerifal.cx>

reorganize pthread data structures and move the definitions to alltypes.h

this allows sys/types.h to provide the pthread types, as required by
POSIX. this design also facilitates forcing ABI-compatible sizes in
the arch-specific alltypes.h, while eliminating the need for
developers changing the internals of the pthread types to poke around
with arch-specific headers they may not be able to test.


# 7b2dd223 15-Feb-2011 Rich Felker <dalias@aerifal.cx>

finish unifying thread register handling in preparation for porting


# 0b2006c8 15-Feb-2011 Rich Felker <dalias@aerifal.cx>

begin unifying clone/thread management interface in preparation for porting


# 0b44a031 11-Feb-2011 Rich Felker <dalias@aerifal.cx>

initial check-in, version 0.5.0