The efficient parallelization of fast multipole-based algorithms for the N-body problem is one of the most challenging topics in high performance scientific computing. The emergence of non-local, irregular communication patterns generated by these algorithms can easily create an insurmountable bottleneck on supercomputers with hundreds of thousands of cores. To overcome this obstacle we have developed an innovative parallelization strategy for Barnes–Hut tree codes on present and upcoming HPC multicore architectures. This scheme, based on a combined MPI–Pthreads approach, permits an efficient overlap of computation and data exchange. We highlight the capabilities of this method on the full IBM BlueGene/P system JUGENE at Jülich Supercomputing Centre and demonstrate scaling across 294,912 cores with up to 2,048,000,000 particles. Applying our implementation pepc to laser–plasma interaction and vortex particle methods close to the continuum limit, we demonstrate its potential for ground-breaking advances in large-scale particle simulations.