# A small toy-sized lab for vectorization of Many-body codes pattern This repository contains the shared work done by (some) participants in the [SPACE CoE](https://www.space-coe.eu/) in experimenting the vectorization achieved when different data layouts (AoS, SoA, SoAos) are adopted. What we address here is the many-body problem, i.e. the pattern from codes that solve the gravitational interaction among particles that are _not_ “true bodies” but instead represent matter samples in the phase-space. A such, they do not need to deploy highest order $\propto N^2$ integration methods to correclty solve close encounters. At odds, to speed-up the calculation approximate methods can be used while obtaining an adequately small error. Typically, approximate methods are coupled with a tree structure and a Barnes&Hut algorithm. Hence, the contribution to the force of the “closest” neighbouring particles is accounted in direct summation, as in N-Body codes, while the contribution of “distant” particles is accounted using different approximation (multipole expansion, PM, …). Here we focus on the loop over the close particles, that resembles the N-body typical loop but for the ffollowing facts: - the array of neighbours is in general unique per each particles, by definition; then, every particle loops over a different array of partners - the fact that $n_i$ particles are neighbour of a target particle does not mean that they are close in memory; on the contrary, the very likely pattern is that looping over the neighbours results in memory jumps - a loop of the kind ```c for ( int = 0; i < Nneighbours; i++ ) { int k = neighbours_list[k]; force += contributtion( Particles[k] ); } ``` is difficult to be vectorized, and definitely not automatically vectorizable by the compilers. The main focus of the toy codes here is to reprodice the pattern aforementioned: - $N_p$ particles are randomly generated - A subset of them, of size $N_A$, will be considered “active” - it means that in a real code, in which the timesteps are not all equal, they would need to calculate the force, update velocities, etc. - Every active particles $i$ is assigned with $N_n^i$ neighbours (where $N_n^i \simeq \pm 10\%$) and has to calculate the $\sum_j1/r_{ij}^2$ contribution. We are experimenting 2 data layouts SoA and SoAos, using $(i)$ a **plain** implementation, $(ii)$ using **builtin vector types** nowadays supported by all the compilers and $(iii)$ **using vector intrinsics**. Since the memory toll due to the non-cache friendly pattern, we compare 3 different solutions: **v1**. trying to hide the memory latency by prefetching of “next” neigbours **v2**. copying all the neighbours into a memory buffer so that the loop is then linear on consequent memory addresses (of course could hardly been done in a real code) **v3**. evolving the v2 in a more realistic way, having a thread that fetches neighbours particles in a limited buffer while other threads process them