# Result

**Some snippets from your converted document:**

TITLE ftests Some GPU-based algorithms Taken from a particle based fluid simulation Constraint based Calculate a constraint violation for each particle Violation is based on neighboring particles Use violations to create a system of equations Solve system to find restoration forces Update particle state Required operations Collision detection between single-sized spheres Vector operations Sparse matrix operations Time integration Collision detection Based on spatial hashing in uniform grid Implemented in four steps Calculate hash value for each particle Sort hash values Reorder particles to match new ordering of hash list Particle-particle near-field tests Collision detection Collision detection Collision detection Collision detection Collision detection Collision detection ReorderParticlesAndFindCellBoundaries Requires cooperation One thread per particle Read hash value to shared memory Compare with the neighbor's hash If different, we found the start of a cell Read particle from the index found in hash list Write particle to thread ID index Collision detection void reorderDataAndFindCellStartD(...) { uint2 hash; hash = particleHash[global_tid]; sharedHash[local_tid+1] = hash.x; if (local_tid == 0 && global_tid > 0) { uint2 prevData = particleHash[global_tid-1]; sharedHash[0] = prevData.x; } barrier(CLK_LOCAL_MEM_FENCE); if (index < numParticles) { if (index == 0 || hash.x != sharedHash[threadIdx.x]) { cellStart[hash.x] = index; if (index > 0) cellEnd[sharedHash[threadIdx.x]] = index; } if (index == numParticles - 1) { cellEnd[hash.x] = index + 1; } float4 pos = FETCH(particlePositions, hash.y); float4 vel = FETCH(particleVelocities, hash.y); sortedPos[index] = pos; sortedVel[index] = vel; } } Collision detection void reorderDataAndFindCellStartD(...) { uint index = __umul24(blockIdx.x,blockDim.x) + threadIdx.x; uint2 hash; hash = particleHash[index]; sharedHash[threadIdx.x+1] = hash.x; if (index > 0 && threadIdx.x == 0) { uint2 prevData = particleHash[index-1]; sharedHash[0] = prevData.x; } __syncthreads(); if (global_tid == 0 || hash.x != sharedHash[local_tid]) { cellStart[hash.x] = global_tid; if (index > 0) cellEnd[sharedHash[local_tid]] = global_tid; } if (global_tid == numParticles - 1) { cellEnd[hash.x] = global_tid + 1; } float4 pos = particlePositions[hash.y]; float4 vel = particleVelocities[hash.y]; sortedPos[index] = pos; sortedVel[index] = vel; } Collision detection Near-field tests One thread per particle Ignore symmetries, i.e. find all collisions twice Storage of results hard, don't know how many each thread will write Memory operations “mostly” coalesced Collision detection Near-field tests - pseudo code read my particle's position calculate containing cell loop over all 27 cells that contains collision candidates calculate cell hash read cell start and end indices if cell was empty next cell loop over indices defined by cell range if index points to my particle next particle read other particle's position calculated distance between particles if distance < 2*particle radius record collision Matrix operations Matrix storage format Matrix operations Matrix-vector multiplication One thread per output element, i.e. one per row Very good memory access pattern for matrix data Very bad memory access pattern for input vector data 2-step look-up for input vector data No usage of shared memory -> candidate for optimization Matrix-vector multiplication uint index = get_global_id(0); if (index >= numRows) return; float4 G_block; float4 x_block; float4 acc; // Start with the diagonal block, which uses the same element from the input vector G_block = diagonalBlocks[index]; x_block = inputVector[index]; acc = G_block * x_block; // Then loop over the off-diagonal blocks, adding the results to the accumulator uint rowLength = rowLengths[index]; for (uint itr=0; itr<rowLength; ++itr) { uint bufferIndex = itr*numRows + index; uint column = columnIndicies[bufferIndex]; G_block = offDiagonalBlocks[bufferIndex]; x_block = inputVector[column]; acc += G_block * x_block; } //Done, write result to global memory outputVector[index] = acc.x+acc.y+acc.z;

**Recently converted files (publicly available):**