As the exploration of the state space will increase in scale, SIMD code is required so several trajectories are computed by the same instruction. It's a nice speedup but in the CPU, the next roadblock is the constat testing of the results, despite branch prediction with very high success rate (since hits are rare). In a GPU, the computations must always take the same amount of instructions so a branchless approach is required.
As the PEAC width increases, the probability of encountering a crossing drops dramatically so the code can be optimised for bulk scanning, for example 64 cycles at once, with 8 parallel computations (so overall a block of 512 computations). After this chunk is computed, a reduced "trace" value is analysed by the scalar core, to find which SIMD lane has a crossing and recompute the trajectory step by step to find the crossing iteration.
64 iterations are good because it's no too long (so the computations can be resumed), and the SIMD instructions can detect a 32-bit word that is cleared on one lane. The corresponding time-step is flagged on one 64-bit register, at the respective position so the outer loop can inspect the result faster. The other "trace" register indicates the hit lane.
Later, the time to recompute the intermediary steps from the start of the chunk to the crossing will remain low compared to the bulk of the scanning with a massively parallel GPU. The CPU will spend relatively little time to recompute the last steps of a trajectory. However, the more GPU ALU, the more frequent a crossing will be found so the length of a chunk should be kept small. To keep the machine going full steam, a sort of double- or tripl-buffering algo is required, or else the GPU will idle too much. One batch is computed while another transmits the results and the third is being analysed by the CPU (or something like that).