Troubleshooting
Common issues and their solutions.
Self-test: 64-bit Atomic Operations Failed
Symptom: THOR aborts at startup with a self-test failure:
- FAIL: Device #0 [CPU] ... - 64-bit atomic FAILED: bad_function_call
-- Hint: This device does not support 64-bit atomics required by THOR.
Try switching to the OpenMP backend instead.
You may also see AdaptiveCpp errors mentioning Symbols not found: [ _Z8atom_addPU3AS4Vll ].
Cause: The selected SYCL device (typically an OpenCL CPU or integrated GPU) does not support 64-bit atomic operations (cl_khr_int64_base_atomics), which THOR requires. This commonly happens when building with -DACPP_TARGETS="generic" (SSCP/JIT compilation) and the runtime picks an OpenCL device.
Solutions:
-
Switch to the OpenMP backend (recommended for CPU execution):
-
Build with the OpenMP target instead of generic (CPU-only build, no GPU support — and never combine targets like
"generic;omp", see Compiler Settings): -
Disable self-tests (not recommended — the simulation will likely crash later):
GPU: "Could not open file libkernel-sscp-*.bc" / "Code object construction failed"
Symptom: GPU runs abort at startup or first kernel launch with a message
like Could not open file libkernel-sscp-ptx-full.bc or
Code object construction failed.
Cause: The AdaptiveCpp runtime cannot find its own runtime files —
the llvm-to-backend libraries and bitcode files installed under
<prefix>/lib/hipSYCL/. This happens with incomplete AdaptiveCpp
installations (some versions' make install have been observed to skip these
files) or when LD_LIBRARY_PATH points at a different AdaptiveCpp than the
one used at build time.
Solution: Verify the files exist in the install prefix:
ls <prefix>/lib/hipSYCL/llvm-to-backend/libllvm-to-ptx.so
ls <prefix>/lib/hipSYCL/bitcode/libkernel-sscp-ptx-full.bc
If they are missing, look into properly reinstalling AdaptiveCpp.
CMake: "Could NOT find MPI (missing: MPI_CXX_FOUND)"
Symptom: CMake configuration fails to find MPI even though an MPI installation is present.
Cause: CMake's find_package(MPI) compiles a small test program, which
can fail when the C++ compiler is acpp.
Solution: First, point CMake at your MPI compiler wrapper — this is the supported route and usually fixes detection on its own:
If the compile probe itself still fails under acpp, bypass it and provide
the pieces explicitly (using the supported FindMPI variables —
MPI_CXX_INCLUDE_PATH/MPI_CXX_LIBRARIES are deprecated):
cmake -S . -B build \
-DMPI_CXX_WORKS=TRUE \
-DMPI_CXX_LIB_NAMES="mpi_cxx;mpi" \
-DMPI_mpi_cxx_LIBRARY=/path/to/openmpi/lib/libmpi_cxx.so \
-DMPI_mpi_LIBRARY=/path/to/openmpi/lib/libmpi.so \
-DMPI_CXX_ADDITIONAL_INCLUDE_DIRS=/path/to/openmpi/include \
...
Loader: Missing Header Attributes (Units, Cosmology)
Symptom: The gadget loader aborts with a message like
GadgetReader: UnitLength_in_cm is in neither the HDF5 Header nor the YAML loader config. ...
GadgetReader: Omega0 is in neither the HDF5 Header nor the YAML loader config. ...
Cause: Not all Gadget-family codes write unit and cosmology attributes
into the snapshot HDF5 header. GADGET-4, for example, keeps them in
param.txt only. THOR refuses to guess — silently assuming default units
would corrupt all distances and densities by orders of magnitude.
Solution: Supply the values in the loader subsection of your config (values from your simulation's parameter file):
pointcloud_voronoi:
gadget:
UnitLength_in_cm: 3.085678e24 # e.g. Mpc
UnitMass_in_g: 1.989e43 # e.g. 10^10 Msun
UnitVelocity_in_cm_per_s: 1e5 # e.g. km/s
Omega0: 0.272
OmegaLambda: 0.728
HubbleParam: 0.702
See Unit Normalization for the h-factor
conventions these values are interpreted with.
Self-test: ASSERT Device Compatibility Failed
Symptom: THOR aborts at startup with:
- FAIL: Device #0 ... - ASSERT device compatibility FAILED
-- Hint: The ASSERT macro is using ASSERT_THROW which is not device-compatible.
Cause: In debug builds with -DACPP_TARGETS="generic", the ASSERT_THROW macro pulls std::ostringstream and spdlog into device code, which cannot be JIT-compiled for GPU/OpenCL backends.
Solutions:
-
Use the OpenMP target for debug builds:
-
Switch to simple asserts:
MPI Segfault at Startup
Symptom: Segmentation fault in MPI_Comm_size immediately on launch:
user@host:~$ mpirun -np 1 ./thor ./config.yaml
*** Process received signal ***
Signal: Segmentation fault (11)
Failing at address: 0x440000e8
...libmpi.so.40(MPI_Comm_size+0x3b)...
Cause: MPI library mismatch - the executable was compiled against a different MPI library than the one loaded at runtime.
Solution: Ensure consistent MPI environment:
# Check which MPI the executable was linked against
ldd /path/to/thor | grep mpi
# Verify mpirun comes from the same installation
which mpirun
Rebuild Thor using the MPI installation you intend to use at runtime.
Voronoi Tessellation Hangs
Symptom: One or more THOR processes appear to hang while initializing a
pointcloud_voronoi dataset. Logs usually stop during Voronoi/Delaunay
tessellation. This can happen across separate THOR processes, not just MPI
ranks.
Cause: CGAL's parallel Delaunay (THOR_CGAL_PARALLEL=ON) defaults to a
spin-yield lock policy (Tag_priority_blocking). When two or more THOR
processes oversubscribe the same node, the workers' sched_yield() calls
cascade between processes — every yield picks another spinner instead of a
lock holder, and forward progress stalls. CPU sits at 100% with near-zero
throughput.
Detection + policy: every THOR entering parallel CGAL construction
writes a marker at /tmp/thor-cgal-<pid>.active for the duration of the
build, and scans the directory at startup. The policy when other live
THORs are detected is configurable per run:
pointcloud_voronoi:
construction:
cgal_marker_mode: warn # off | warn | wait (default: warn)
cgal_marker_wait_poll_seconds: 5 # 'wait' mode only
cgal_marker_wait_timeout_seconds: 1800 # 'wait' mode safety net; 0 disables
off— disable the marker mechanism entirely (no file, no scan).warn(default) — proceed regardless, but log a warning naming the conflicting PIDs:wait— block until no older-start_time THOR remains. Useful when jobs are launched in parallel by a script but must serialise their CGAL phase. The older process always wins; newer arrivals queue behind it. Poll interval is configurable. If the timeout is reached (blocker hard-crashed on a node where its PID isn't reused), THOR logs a warning and proceeds anyway rather than hanging forever.
You can also enumerate live constructions from a shell:
Stale markers (process crashed) are validated against /proc/<pid>/stat's
start-time field and cleaned up automatically on the next scan, so old
files never produce false warnings.
Same-MPI-job ranks (mpirun-launched on the same node, sharing parent or
process group) are filtered out — they will not warn about each other,
since intra-job rank contention isn't something the user can avoid. This
heuristic catches the common mpirun/orted, hydra and srun-default cases;
it may miss srun --mpi=pmix if each task gets a private session.
Cache-loaded meshes (pointcloud_voronoi.mesh_cache_mode: load or
auto with a valid cache) skip parallel CGAL construction entirely, so
they don't write a marker, don't scan, and don't appear in another
THOR's marker scan. Coordination is per-CGAL-build, not per-run.
The marker directory defaults to /tmp, which is per-namespace inside
Slurm task scratch / containers. To coordinate across processes that
don't share /tmp, set the env var THOR_CGAL_MARKER_DIR to a shared
bind-mounted directory before launching THOR.
Limitation — PID namespaces. Markers are keyed by pid + process
start-time, both validated against /proc/<pid>/stat in the scanning
process's namespace. If the THORs you want to coordinate run in
different PID namespaces (independent Docker containers, podman pods,
etc.), the same numeric PID in those namespaces refers to unrelated
processes, and cross-namespace markers will be misclassified as stale
and silently removed. For container-to-container coordination, run the
THORs in the same PID namespace (e.g., docker run --pid=host or share
a pod) or use a coordination mechanism outside this subsystem.
Solutions:
-
Avoid concurrent CGAL mesh construction on the same node. Serialize startup, or launch jobs so only one
pointcloud_voronoirun builds its mesh at a time. The marker scan above makes this enforceable from a wrapper script. -
Opt in to non-blocking locks (
THOR_CGAL_NONBLOCKING_LOCKS=ON). Rebuild with-DTHOR_CGAL_NONBLOCKING_LOCKS=ON. This switches CGAL's spatial-lock policy toTag_non_blocking, removing thesched_yieldcascade — concurrent runs will at least make forward progress.
Trade-off: this option is OFF by default because the fail-fast lock primitive imposes a measurable single-process tax (~25% on SPH zoom-in data) and can degrade further if any cross-process contention occurs during the run (busy-spin on hot-spot cells). Use it only when concurrent CGAL jobs on the same machine are unavoidable.
-
Cache the mesh and reuse it for repeated runs:
Let one run save the mesh, then have later runs load it instead of entering CGAL construction again. -
For experiments only, consider the SYCL Voronoi backend:
This backend is still experimental, and the defaultuse_cgal_fallback: truecan still call CGAL for unconverged cells. Only use it after validating the specific dataset.
Apptainer: fuse2fs not found / gocryptfs not found
Symptom: When running THOR via Apptainer, you see one or both of these warnings:
INFO: fuse2fs not found, will not be able to mount EXT3 filesystems
INFO: gocryptfs not found, will not be able to use gocryptfs
Cause: Apptainer checks for optional host utilities at startup. These are only needed for EXT3 overlay filesystems and encrypted containers, respectively.
Solution: These are harmless informational messages and can be safely ignored. THOR containers use Docker-based images and do not require either feature.
TBB Not Found
Symptom: CMake fails with an error about TBB not being found:
CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:XXX:
Could NOT find TBB (missing: TBB_DIR)
Cause: TBB (Threading Building Blocks) is required when THOR_CGAL_PARALLEL is enabled (the default), which allows parallel CGAL triangulation construction.
Solutions:
- Disable parallel CGAL construction (easiest):
- Install TBB: Install TBB via your distribution's package manager where available.
TBB has native CMake support via
TBBConfig.cmake. If CMake can't find TBB after installation, add the TBB installation directory toCMAKE_PREFIX_PATH. For example, with Intel oneAPI this might be: