density
C++11 library for paged memory management, function queues, heterogeneous queues and lifo memory management
|
Density is a C++ library providing very efficient concurrent heterogeneous queues, non-concurrent stacks, and a page-based memory manager.
Heterogeneous queues can store objects of any type:
This table shows all the available queues:
concurrency strategy | function queue | heterogeneous queue | Consumers cardinality | Producers cardinality |
---|---|---|---|---|
single threaded | function_queue | heter_queue | - | - |
locking | conc_function_queue | conc_heter_queue | multiple | multiple |
lock-free | lf_function_queue | lf_heter_queue | single or multiple | single or multiple |
spin-locking | sp_function_queue | sp_heter_queue | single or multiple | single or multiple |
Elements are stored linearly in memory pages, achieving a great memory locality. Furthermore allocating an element most times means just adding the size to allocate to a pointer.
density provides a lifo_allocator built upon the page allocator, and a thread-local lifo memory pool (the data-stack), which has roughly the same performance of the unsafe, C-ish and non-standard alloca. The data stack can only be used indirectly with lifo_array and lifo_buffer.
This library is tested against these compilers:
The data-stack is a thread-local paged memory pool dedicated to lifo allocations. The lifo ordering implies that a thread may reallocate or deallocate only the most recently allocated living block. A violation of this constraint causes undefined behavior. The data stack is actually composed by a set of memory pages and an internal thread-local pointer (the top of the stack). The underlying algorithm is very simple: allocations just add the requested size to the top pointer, and deallocations set the top pointer to the block to deallocate. In case of page switch, the execution jumps to a non-inlined slow path. The data stack has constant initialization, so it doesn't slow down thread creation or require dynamic any initialization guard by the compiler on access. The first time a thread uses the data stack it allocates the first memory page.
The data stack can be accessed only indirectly, with lifo_array and lifo_buffer.
lifo_array
is the easiest and safest way of using the data-stack. A lifo_array
is very similar to a raw array, but its size is not a compile time constant. The elements are allocated on the data-stack, rather than on the call-stack, so there is no risk of stack overflow. In case of out of system memory, a std::bad_alloc
is thrown.
To avoid breaking the lifo constraint and therefore causing the undefined behavior, lifo_array
s should be instantiated only in the automatic storage (locally in a function or code block). Following this simple rule there is no way to way to break the lifo constraint.
Actually it is possible instantiating a lifo_array
anywhere, as long as the lifo constraint on the calling thread is not broken. The C++ language is very LIFO friendly, as members of structs and elements of arrays are destroyed in the reverse order they are constructed. Anyway this may be dangerous, so it's not recommended.
Just like built-in arrays and std::array
, lifo_array
does not initialize elements if they have POD type, unless an explicit initialization value is provided to the constructor. This is a big difference with std::vector
.
A lifo_buffer
allocates on the data stack an untyped raw memory block with dynamic size. Unlike lifo_array
it supports resizing, but only on the most recently instantiated living instance, and only if a more recent living lifo_array
doesn't exist. If the resize changes the address of the block, the surviving content is preserved memcpy
'ing it.
Internally the data stack is a thread-local instance of lifo_allocator, a class template that provides lifo memory management. This class can be used to exploit performances of lifo memory in other contexts, but it is a low-level class: it should be wrapped in some data structure, rather than used directly.
A function queue is an heterogeneous FIFO pseudo-container that stores callable objects, each with a different type, but all invokable with the same signature. Fundamentally a function queue is a queue of std::function
-like objects that uses an in-page linear allocation for the storage of the captures.
If the return type of the signature is void
, the function try_consume
returns a boolean indicating whether a function has been called. Otherwise it returns an optional
containing the return value of the function, or empty if no function was present. Any callable object can be pushed on the queue: a lambda, the result of an std::bind
, a (possibly local) class defining an operator (), a pointer to member function or variable. The actual signature of the callable does not need to match the one of the queue, as long as an implicit conversion exists:
All queues in density have a very simple implementation: a tail pointer, an head pointer, and a set of memory pages. Paging is a kind of quantization of memory: all pages have a fixed size (by default slightly less than 64 kibibyte) and a fixed alignment (by default 64 kibibyte), so that the allocator can be simple and efficient. Values are arranged in the queue by linear allocation: allocating a value means just adding its size to the tail pointer. If the new tail pointer would point outside the last page, a new page is allocated, and the linear allocation is performed from there. In case of very huge values, only a pointer is allocated in the pages, and the actual storage for the value is allocated in the heap.
The functions function_queue::push
and function_queue::emplace
append a callable at the end of the queue. This is the quick way of doing a put transaction. We can have more control breaking it using the start_ put functions:
The start_* put functions return a put_transaction by which the caller can:
put_transaction
becomes empty).put_transaction
becomes empty).char
s, float
s) that are linearly allocated in the pages, right after the last allocated value. There is no function to deallocate raw blocks: they are automatically deallocated after the associated element has been canceled or consumed.When a non-empty put_transaction
is destroyed, the bound transaction is canceled. As a consequence, if the raw_allocate_copy
in the code above throws an exception, the transaction is discarded with no side effects.
Raw memory blocks are handled in the same way of canceled and consumed values (they are referred as dead values). Internally the storage of dead values is deallocated when the whole page is returned to the page allocator, but this is an implementation detail. When a consume is committed, the head pointer is advanced, skipping any dead element.
Internally instant puts are implemented in terms of transactional puts, so there is no performance difference between them:
... = omissis
template <...> void emplace(...)
{
start_emplace(...).commit();
}
Consume operations have the start_*
variant in heterogeneous queues (but not in function queues). Anyway this operation is not a transaction, as the element disappears from the queue when the operation starts, and will reappear if the operation is canceled.
During a put or a consume operation an heterogeneous or function queue is not in a consistent state for the caller thread. So accessing the queue in any way, in the between of a start_* function and the cancel/commit, causes undefined behavior. Anyway, especially for consumers, reentrancy is sometimes necessary: a callable object, during the invocation, may need to push another callable object to the same queue. For every put or consume function, in every queue, there is a reentrant variant.
If an operation is not reentrant the implementation can do some optimizations: single-thread queues start puts in the committed state, so that the commit is a no-operation. Furthermore reentrancy can affect thread synchronization: conc_function_queue
and conc_heter_queue
, when executing a non-reentrant operation, lock their internal mutex when starting, and unlock it when committing or canceling. In contrast, when they execute a reentrant operation, they have to lock and unlock when starting, and lock and unlock again when committing or canceling. The internal mutex is not recursive, so if a thread starts a non-reentrant operation and then tries to access the queue, it causes undefined behavior.
Function queues use type erasure to handle callable objects of heterogeneous types. By default two operations are captured for each element type: invoke-destroy, and just-destroy. The third template parameter of all function queues is an enum of type function_type_erasure that controls the type erasure: the value function_manual_clear excludes the second operation, so that the function queue will not be able to destroy a callable without invoking it. This gives a performance benefit, at the price that the queue can't be cleared, and that the user must ensure that the queue is empty when destroyed. Internally, in manual-clean mode, the layout of a value in the function queue is composed by:
So if you put a capture-less lambda or a pointer to a function, you are advancing the tail pointer by the space required by 2 pointers. Anyway lock-free queues and spin-locking queues align their values to density::destructive_interference_size, so they are less dense than the other queues.
All queues but function_queue
and heter_queue
are concurrency enabled. By default they allow multiple producers and multiple consumers. The class templates lf_function_queue, lf_hetr_queue, sp_function_queue and sp_hetr_queue allow to specify, with 2 independent template arguments of type concurrency_cardinality, whether multiple threads are allowed to produce, and whether multiple threads are allowed to consume:
When dealing with a multiple producers, the tail pointer is an atomic variable. Otherwise it is a plain variable. When dealing with a multiple consumers, the head pointer is an atomic variable. Otherwise it is a plain variable.
The class templates lf_function_queue and lf_hetr_queue allow a further optimization with a template argument of type consistency_model: by default the queue is sequential consistent (that is all threads observe the operations happening in the same order). If consistency_relaxed is specified, this guarantee is removed, with a great performance benefit.
For all queues, the functions try_consume
and try_reentrant_consume
have 2 variants:
There is no functional difference between the two consumes. Anyway, currently only for lock-free and spin-locking queues supporting multi-producers, the second consume can be much faster. The reason has to do with the way they ensure that a consumer does not deallocate a page while another consumer is reading it. When a consumer needs to access a page, it increments a ref-count in the page (it pins the page), to notify to the allocator that it is using it. When it has finished, the consumer decrements the ref-count (it unpins the page). If a thread performs many consecutive consumes, it will ends up doing many atomic increments and decrements of the same page (which is a somewhat expensive operation). Since the pin logic is encapsulated in the consume_operation
, if the consumer thread keeps the consume_operation
alive, pinning and unpinning will be performed only in case of page switch. Note: a forgotten consume_operation
which has pinned a page prevents the page from being recycled by the page allocator, even if it was deallocated by another consumer.
Every function queue is actually an adaptor for the corresponding heterogeneous pseudo-container. Heterogeneous queues have put and consume functions, just like function queues, but elements are not required to be callable objects.
The first 3 template parameters are the same for all the heterogeneous queues.
Template parameter | Meaning | Constraints | Default |
---|---|---|---|
typename COMMON_TYPE | Common type of elements | Must decay to itself (see std::decay) | void |
typename RUNTIME_TYPE | Type eraser type | Must model RuntimeType | runtime_type |
typename ALLOCATOR_TYPE | Allocator | Must model both PageAllocator and UntypedAllocator | default_allocator |
An element can be pushed on a queue if its address is is implicitly convertible to COMMON_TYPE*
. By default any type is allowed in the queue. The RUNTIME_TYPE
type allows much more customization than the function_type_erasure template parameter of function queues. Even using the built-in runtime_type you can select which operations the elements of the queue should support, and add your own.
Output:
This document describes some of the internals of density.
RuntimeType | runtime_type TypeFeature | f_size, f_alignment, f_copy_construct, f_hash, f_rtti UntypedAllocator | default_allocator, basic_default_allocator PagedAllocator | default_allocator, basic_default_allocator