Density Overview

Density is a C++ library providing very efficient concurrent heterogeneous queues, non-concurrent stacks, and a page-based memory manager.

Heterogeneous queues can store objects of any type:

    // lock-free queue
    lf_heter_queue<> lf_queue;
    lf_queue.push(42);
    lf_queue.emplace<std::complex<double>>(1, 2);
    using type = runtime_type<default_type_features, f_ostream>;
    // non-concurrent queue
    heter_queue<type> queue;
    queue.push(42);
    queue.emplace<std::string>("abc");
    queue.emplace<std::complex<double>>(1.2, 3.4);
    for (const auto & val : queue)
        std::cout << val << std::endl;

This table shows all the available queues:

concurrency strategy	function queue	heterogeneous queue	Consumers cardinality	Producers cardinality
single threaded	function_queue	heter_queue	-	-
locking	conc_function_queue	conc_heter_queue	multiple	multiple
lock-free	lf_function_queue	lf_heter_queue	single or multiple	single or multiple
spin-locking	sp_function_queue	sp_heter_queue	single or multiple	single or multiple

Elements are stored linearly in memory pages, achieving a great memory locality. Furthermore allocating an element most times means just adding the size to allocate to a pointer.

density provides a lifo_allocator built upon the page allocator, and a thread-local lifo memory pool (the data-stack), which has roughly the same performance of the unsafe, C-ish and non-standard alloca. The data stack can only be used indirectly with lifo_array and lifo_buffer.

This library is tested against these compilers:

Msc (Visual Studio 2017)
g++-4.9 and g++-7
clang++-4.0 and clang++-5.0

Github Repository

The data stack

The data-stack is a thread-local paged memory pool dedicated to lifo allocations. The lifo ordering implies that a thread may reallocate or deallocate only the most recently allocated living block. A violation of this constraint causes undefined behavior. The data stack is actually composed by a set of memory pages and an internal thread-local pointer (the top of the stack). The underlying algorithm is very simple: allocations just add the requested size to the top pointer, and deallocations set the top pointer to the block to deallocate. In case of page switch, the execution jumps to a non-inlined slow path. The data stack has constant initialization, so it doesn't slow down thread creation or require dynamic any initialization guard by the compiler on access. The first time a thread uses the data stack it allocates the first memory page.

The data stack can be accessed only indirectly, with lifo_array and lifo_buffer.

lifo_array is the easiest and safest way of using the data-stack. A lifo_array is very similar to a raw array, but its size is not a compile time constant. The elements are allocated on the data-stack, rather than on the call-stack, so there is no risk of stack overflow. In case of out of system memory, a std::bad_alloc is thrown.

    void concat_and_print(const char * i_str_1, const char * i_str_2)
    {
        using namespace density;
        auto const len_1 = strlen(i_str_1);
        auto const len_2 = strlen(i_str_2);
        lifo_array<char> string(len_1 + len_2 + 1);
        memcpy(string.data(), i_str_1, len_1);
        memcpy(string.data() + len_1, i_str_2, len_2);
        string[len_1 + len_2] = 0;
        std::cout << string.data() << std::endl;
    }

To avoid breaking the lifo constraint and therefore causing the undefined behavior, lifo_arrays should be instantiated only in the automatic storage (locally in a function or code block). Following this simple rule there is no way to way to break the lifo constraint.

Actually it is possible instantiating a lifo_array anywhere, as long as the lifo constraint on the calling thread is not broken. The C++ language is very LIFO friendly, as members of structs and elements of arrays are destroyed in the reverse order they are constructed. Anyway this may be dangerous, so it's not recommended.

            struct MyStruct
            {
                lifo_array<std::string> m_strings{6};
                lifo_array<std::string> m_other_strings{6};
            };
            struct MyStruct1
            {
                lifo_array<MyStruct> m_structs{6};
                lifo_array<MyStruct> m_other_structs{6};
            };
            // In C++ array elements and struct members have lifo-compliant lifetime
            lifo_array<MyStruct> structs{10};
            // Still legal, but don't go too far
            lifo_array<std::unique_ptr<MyStruct1>> other_structs{10};
            std::generate(other_structs.begin(), other_structs.end(), []() {
                return std::unique_ptr<MyStruct1>(new MyStruct1);
            });

Just like built-in arrays and std::array, lifo_array does not initialize elements if they have POD type, unless an explicit initialization value is provided to the constructor. This is a big difference with std::vector.

A lifo_buffer allocates on the data stack an untyped raw memory block with dynamic size. Unlike lifo_array it supports resizing, but only on the most recently instantiated living instance, and only if a more recent living lifo_array doesn't exist. If the resize changes the address of the block, the surviving content is preserved memcpy'ing it.

    void func(size_t i_size)
    {
        using namespace density;
        lifo_buffer buffer_1(i_size);
        assert(buffer_1.size() == i_size);
        lifo_buffer buffer_2; // now buffer_1 can't be resized until buffer_2 is destroyed
        assert(buffer_2.size() == 0);
        auto mem = buffer_2.resize(sizeof(int));
        assert(mem == buffer_2.data());
        *static_cast<int *>(mem) = 5;
        mem = buffer_2.resize(sizeof(int) * 20);
        assert(*static_cast<int *>(mem) == 5);
        lifo_array<int> other_numbers(7);
        // buffer_2.resize(20); <---- violation of the lifo constraint, other_numbers is more recent!
    }

Internally the data stack is a thread-local instance of lifo_allocator, a class template that provides lifo memory management. This class can be used to exploit performances of lifo memory in other contexts, but it is a low-level class: it should be wrapped in some data structure, rather than used directly.

About function queues

A function queue is an heterogeneous FIFO pseudo-container that stores callable objects, each with a different type, but all invokable with the same signature. Fundamentally a function queue is a queue of std::function-like objects that uses an in-page linear allocation for the storage of the captures.

            // put a lambda
            function_queue<void()> queue;
            queue.push([] { std::cout << "Printing..." << std::endl; });
            // we can have a capture of any size
            double pi = 3.1415;
            queue.push([pi] { std::cout << pi << std::endl; });
            // now we execute the functions
            int executed = 0;
            while (queue.try_consume())
                executed++;

If the return type of the signature is void, the function try_consume returns a boolean indicating whether a function has been called. Otherwise it returns an optional containing the return value of the function, or empty if no function was present. Any callable object can be pushed on the queue: a lambda, the result of an std::bind, a (possibly local) class defining an operator (), a pointer to member function or variable. The actual signature of the callable does not need to match the one of the queue, as long as an implicit conversion exists:

function_queue<std::string(const char * i_prefix)> queue;

queue.push([](std::string i_prefix) { return i_prefix + "..."; });

All queues in density have a very simple implementation: a tail pointer, an head pointer, and a set of memory pages. Paging is a kind of quantization of memory: all pages have a fixed size (by default slightly less than 64 kibibyte) and a fixed alignment (by default 64 kibibyte), so that the allocator can be simple and efficient. Values are arranged in the queue by linear allocation: allocating a value means just adding its size to the tail pointer. If the new tail pointer would point outside the last page, a new page is allocated, and the linear allocation is performed from there. In case of very huge values, only a pointer is allocated in the pages, and the actual storage for the value is allocated in the heap.

Transactional puts and raw allocations

The functions function_queue::push and function_queue::emplace append a callable at the end of the queue. This is the quick way of doing a put transaction. We can have more control breaking it using the start_ put functions:

            struct Message
            {
                const char * m_text;
                void operator()() { std::cout << m_text << std::endl; }
            };
            function_queue<void()> queue;
            auto transaction             = queue.start_emplace<Message>();
            transaction.element().m_text = transaction.raw_allocate_copy("Hello world");
            transaction.commit();
            bool const invoked = queue.try_consume();
            assert(invoked);

The start_* put functions return a put_transaction by which the caller can:

access the object being pushed before it becomes observable
commit the transaction so that the element becomes observable to consumers (the put_transaction becomes empty).
cancel the transaction, in which case the transaction is discarded with no observable side effects (the put_transaction becomes empty).
allocate raw memory blocks, that is uninitialized memory, or arrays of a trivially destructible type (chars, floats) that are linearly allocated in the pages, right after the last allocated value. There is no function to deallocate raw blocks: they are automatically deallocated after the associated element has been canceled or consumed.

When a non-empty put_transaction is destroyed, the bound transaction is canceled. As a consequence, if the raw_allocate_copy in the code above throws an exception, the transaction is discarded with no side effects.

Raw memory blocks are handled in the same way of canceled and consumed values (they are referred as dead values). Internally the storage of dead values is deallocated when the whole page is returned to the page allocator, but this is an implementation detail. When a consume is committed, the head pointer is advanced, skipping any dead element.

Internally instant puts are implemented in terms of transactional puts, so there is no performance difference between them:

... = omissis template <...> void emplace(...) { start_emplace(...).commit(); }

Consume operations have the start_* variant in heterogeneous queues (but not in function queues). Anyway this operation is not a transaction, as the element disappears from the queue when the operation starts, and will reappear if the operation is canceled.

Reentrancy

During a put or a consume operation an heterogeneous or function queue is not in a consistent state for the caller thread. So accessing the queue in any way, in the between of a start_* function and the cancel/commit, causes undefined behavior. Anyway, especially for consumers, reentrancy is sometimes necessary: a callable object, during the invocation, may need to push another callable object to the same queue. For every put or consume function, in every queue, there is a reentrant variant.

                conc_function_queue<void(), default_allocator, ERASURE> queue;
                auto func1 = [&queue] {
                    std::cout << (queue.empty() ? "The queue is empty" : "The queue is not empty")
                              << std::endl;
                };
                auto func2 = [&queue, func1] { queue.push(func1); };
                queue.push(func1);
                queue.push(func2);
                /* The callable objects we are going to invoke will access the queue, so we
                    must use a reentrant consume. Note: during the invoke of the last function
                    the queue is empty to any observer. */
                while (queue.try_reentrant_consume())
                    ;
                // Output:
                // The queue is not empty
                // The queue is empty

If an operation is not reentrant the implementation can do some optimizations: single-thread queues start puts in the committed state, so that the commit is a no-operation. Furthermore reentrancy can affect thread synchronization: conc_function_queue and conc_heter_queue, when executing a non-reentrant operation, lock their internal mutex when starting, and unlock it when committing or canceling. In contrast, when they execute a reentrant operation, they have to lock and unlock when starting, and lock and unlock again when committing or canceling. The internal mutex is not recursive, so if a thread starts a non-reentrant operation and then tries to access the queue, it causes undefined behavior.

Relaxed guarantees

Function queues use type erasure to handle callable objects of heterogeneous types. By default two operations are captured for each element type: invoke-destroy, and just-destroy. The third template parameter of all function queues is an enum of type function_type_erasure that controls the type erasure: the value function_manual_clear excludes the second operation, so that the function queue will not be able to destroy a callable without invoking it. This gives a performance benefit, at the price that the queue can't be cleared, and that the user must ensure that the queue is empty when destroyed. Internally, in manual-clean mode, the layout of a value in the function queue is composed by:

an overhead pointer (that points to the next value, and keeps the state of the value in the least significant bits)
the runtime type, actually a pointer to the invoke-destroy function
the eventual capture

So if you put a capture-less lambda or a pointer to a function, you are advancing the tail pointer by the space required by 2 pointers. Anyway lock-free queues and spin-locking queues align their values to density::destructive_interference_size, so they are less dense than the other queues.

All queues but function_queue and heter_queue are concurrency enabled. By default they allow multiple producers and multiple consumers. The class templates lf_function_queue, lf_hetr_queue, sp_function_queue and sp_hetr_queue allow to specify, with 2 independent template arguments of type concurrency_cardinality, whether multiple threads are allowed to produce, and whether multiple threads are allowed to consume:

            // single producer, multiple consumers:
            using Lf_SpMc_FuncQueue = lf_function_queue<
              void(),
              default_allocator,
              function_standard_erasure,
              concurrency_single,
              concurrency_multiple>;
            // multiple consumers, single producer:
            using Lf_MpSc_FuncQueue = lf_function_queue<
              void(),
              default_allocator,
              function_standard_erasure,
              concurrency_multiple,
              concurrency_single>;
            // multiple producer, multiple consumers (the default):
            using Lf_MpMc_FuncQueue = lf_function_queue<void()>;

When dealing with a multiple producers, the tail pointer is an atomic variable. Otherwise it is a plain variable. When dealing with a multiple consumers, the head pointer is an atomic variable. Otherwise it is a plain variable.

The class templates lf_function_queue and lf_hetr_queue allow a further optimization with a template argument of type consistency_model: by default the queue is sequential consistent (that is all threads observe the operations happening in the same order). If consistency_relaxed is specified, this guarantee is removed, with a great performance benefit.

For all queues, the functions try_consume and try_reentrant_consume have 2 variants:

one returning a consume operation. If the queue was empty, an empty consume is returned.
one taking a reference to a consume as parameter and returning a boolean, recommended if a thread performs many consecutive consumes.

There is no functional difference between the two consumes. Anyway, currently only for lock-free and spin-locking queues supporting multi-producers, the second consume can be much faster. The reason has to do with the way they ensure that a consumer does not deallocate a page while another consumer is reading it. When a consumer needs to access a page, it increments a ref-count in the page (it pins the page), to notify to the allocator that it is using it. When it has finished, the consumer decrements the ref-count (it unpins the page). If a thread performs many consecutive consumes, it will ends up doing many atomic increments and decrements of the same page (which is a somewhat expensive operation). Since the pin logic is encapsulated in the consume_operation, if the consumer thread keeps the consume_operation alive, pinning and unpinning will be performed only in case of page switch. Note: a forgotten consume_operation which has pinned a page prevents the page from being recycled by the page allocator, even if it was deallocated by another consumer.

Heterogeneous queues

Every function queue is actually an adaptor for the corresponding heterogeneous pseudo-container. Heterogeneous queues have put and consume functions, just like function queues, but elements are not required to be callable objects.

            heter_queue<> queue;
            queue.push(19);                     // the parameter can be an l-value or an r-value
            queue.emplace<std::string>(8, '*'); // pushes "********"

The first 3 template parameters are the same for all the heterogeneous queues.

Template parameter	Meaning	Constraints	Default
typename `COMMON_TYPE`	Common type of elements	Must decay to itself (see std::decay)	`void`
typename `RUNTIME_TYPE`	Type eraser type	Must model RuntimeType	runtime_type
typename `ALLOCATOR_TYPE`	Allocator	Must model both PageAllocator and UntypedAllocator	default_allocator

An element can be pushed on a queue if its address is is implicitly convertible to COMMON_TYPE*. By default any type is allowed in the queue. The RUNTIME_TYPE type allows much more customization than the function_type_erasure template parameter of function queues. Even using the built-in runtime_type you can select which operations the elements of the queue should support, and add your own.

    /* This feature calls an update function on any object. The update has not to be virtual, as
        type erasure already is a kind of virtualization. */
    struct feature_call_update
    {
        void (*m_func)(void * i_object, float i_elapsed_time);
        void operator()(void * i_object, float i_elapsed_time) const
        {
            m_func(i_object, i_elapsed_time);
        }
        template <typename TARGET_TYPE> constexpr static feature_call_update make() noexcept
        {
            return feature_call_update{&invoke<TARGET_TYPE>};
        }
      private:
        template <typename TARGET_TYPE>
        static void invoke(void * i_object, float i_elapsed_time) noexcept
        {
            static_cast<TARGET_TYPE *>(i_object)->update(i_elapsed_time);
        }
    };

            struct ObjectA
            {
                void update(float i_elapsed_time)
                {
                    std::cout << "ObjectA::update(" << i_elapsed_time << ")" << std::endl;
                }
            };
            struct ObjectB
            {
                void update(float i_elapsed_time)
                {
                    std::cout << "ObjectB::update(" << i_elapsed_time << ")" << std::endl;
                }
            };
            // concatenates feature_call_update to the default features (f_destroy, f_size, f_alignment)
            using MyFeatures = feature_list<default_type_features, feature_call_update>;
            // create a queue with 3 objects
            heter_queue<runtime_type<MyFeatures>, default_allocator> my_queue;
            my_queue.emplace<ObjectA>();
            my_queue.emplace<ObjectB>();
            my_queue.emplace<ObjectB>();
            // call update on all the objects
            auto const end_it = my_queue.end();
            for (auto it = my_queue.begin(); it != end_it; ++it)
            {
                auto const update_func = it->type().get_feature<feature_call_update>();
                update_func(it->address(), 1.f / 60.f);
            }

Output:

ObjectA::update(0.0166667)
ObjectB::update(0.0166667)
ObjectB::update(0.0166667)

Implementation details

This document describes some of the internals of density.

Named requirements

RuntimeType | runtime_type TypeFeature | f_size, f_alignment, f_copy_construct, f_hash, f_rtti UntypedAllocator | default_allocator, basic_default_allocator PagedAllocator | default_allocator, basic_default_allocator

Benchmarks

function_queue_benchmarks

lifo_array_benchmarks