Troubleshooting

Common issues and their solutions.

Self-test: 64-bit Atomic Operations Failed

Symptom: THOR aborts at startup with a self-test failure:

- FAIL: Device #0 [CPU] ... - 64-bit atomic FAILED: bad_function_call
-- Hint: This device does not support 64-bit atomics required by THOR.
         Try switching to the OpenMP backend instead.

You may also see AdaptiveCpp errors mentioning Symbols not found: [ _Z8atom_addPU3AS4Vll ].

Cause: The selected SYCL device (typically an OpenCL CPU or integrated GPU) does not support 64-bit atomic operations (cl_khr_int64_base_atomics), which THOR requires. This commonly happens when building with -DACPP_TARGETS="generic" (SSCP/JIT compilation) and the runtime picks an OpenCL device.

Solutions:

Switch to the OpenMP backend (recommended for CPU execution):

# Option A: Set the device in your config.yaml
device: "cpu-openmp"

# Option B: Hide non-OpenMP backends from the AdaptiveCpp runtime
export ACPP_VISIBILITY_MASK="omp"   # AdaptiveCpp env var

Build with the OpenMP target instead of generic (CPU-only build, no GPU support — and never combine targets like "generic;omp", see Compiler Settings):
```
cmake -S . -B build -DACPP_TARGETS="omp" ...
```
Disable self-tests (not recommended — the simulation will likely crash later):
```
selftest: false
```

GPU: "Could not open file libkernel-sscp-*.bc" / "Code object construction failed"

Symptom: GPU runs abort at startup or first kernel launch with a message like Could not open file libkernel-sscp-ptx-full.bc or Code object construction failed.

Cause: The AdaptiveCpp runtime cannot find its own runtime files — the llvm-to-backend libraries and bitcode files installed under <prefix>/lib/hipSYCL/. This happens with incomplete AdaptiveCpp installations (some versions' make install have been observed to skip these files) or when LD_LIBRARY_PATH points at a different AdaptiveCpp than the one used at build time.

Solution: Verify the files exist in the install prefix:

ls <prefix>/lib/hipSYCL/llvm-to-backend/libllvm-to-ptx.so
ls <prefix>/lib/hipSYCL/bitcode/libkernel-sscp-ptx-full.bc

If they are missing, look into properly reinstalling AdaptiveCpp.

CMake: "Could NOT find MPI (missing: MPI_CXX_FOUND)"

Symptom: CMake configuration fails to find MPI even though an MPI installation is present.

Cause: CMake's find_package(MPI) compiles a small test program, which can fail when the C++ compiler is acpp.

Solution: First, point CMake at your MPI compiler wrapper — this is the supported route and usually fixes detection on its own:

cmake -S . -B build -DMPI_CXX_COMPILER=/path/to/openmpi/bin/mpicxx ...

If the compile probe itself still fails under acpp, bypass it and provide the pieces explicitly (using the supported FindMPI variables — MPI_CXX_INCLUDE_PATH/MPI_CXX_LIBRARIES are deprecated):

cmake -S . -B build \
  -DMPI_CXX_WORKS=TRUE \
  -DMPI_CXX_LIB_NAMES="mpi_cxx;mpi" \
  -DMPI_mpi_cxx_LIBRARY=/path/to/openmpi/lib/libmpi_cxx.so \
  -DMPI_mpi_LIBRARY=/path/to/openmpi/lib/libmpi.so \
  -DMPI_CXX_ADDITIONAL_INCLUDE_DIRS=/path/to/openmpi/include \
  ...

Loader: Missing Header Attributes (Units, Cosmology)

Symptom: The gadget loader aborts with a message like

GadgetReader: UnitLength_in_cm is in neither the HDF5 Header nor the YAML loader config. ...
GadgetReader: Omega0 is in neither the HDF5 Header nor the YAML loader config. ...

Cause: Not all Gadget-family codes write unit and cosmology attributes into the snapshot HDF5 header. GADGET-4, for example, keeps them in param.txt only. THOR refuses to guess — silently assuming default units would corrupt all distances and densities by orders of magnitude.

Solution: Supply the values in the loader subsection of your config (values from your simulation's parameter file):

pointcloud_voronoi:
  gadget:
    UnitLength_in_cm: 3.085678e24    # e.g. Mpc
    UnitMass_in_g: 1.989e43          # e.g. 10^10 Msun
    UnitVelocity_in_cm_per_s: 1e5    # e.g. km/s
    Omega0: 0.272
    OmegaLambda: 0.728
    HubbleParam: 0.702

See Unit Normalization for the h-factor conventions these values are interpreted with.

Self-test: ASSERT Device Compatibility Failed

Symptom: THOR aborts at startup with:

- FAIL: Device #0 ... - ASSERT device compatibility FAILED
-- Hint: The ASSERT macro is using ASSERT_THROW which is not device-compatible.

Cause: In debug builds with -DACPP_TARGETS="generic", the ASSERT_THROW macro pulls std::ostringstream and spdlog into device code, which cannot be JIT-compiled for GPU/OpenCL backends.

Solutions:

Use the OpenMP target for debug builds:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug -DACPP_TARGETS="omp" ...

Switch to simple asserts:

cmake -S . -B build -DTHOR_DEBUG_ASSERT=OFF ...

MPI Segfault at Startup

Symptom: Segmentation fault in MPI_Comm_size immediately on launch:

user@host:~$ mpirun -np 1 ./thor ./config.yaml
*** Process received signal ***
Signal: Segmentation fault (11)
Failing at address: 0x440000e8
...libmpi.so.40(MPI_Comm_size+0x3b)...

Cause: MPI library mismatch - the executable was compiled against a different MPI library than the one loaded at runtime.

Solution: Ensure consistent MPI environment:

# Check which MPI the executable was linked against
ldd /path/to/thor | grep mpi

# Verify mpirun comes from the same installation
which mpirun

Rebuild Thor using the MPI installation you intend to use at runtime.

Voronoi Tessellation Hangs

Symptom: One or more THOR processes appear to hang while initializing a pointcloud_voronoi dataset. Logs usually stop during Voronoi/Delaunay tessellation. This can happen across separate THOR processes, not just MPI ranks.

Cause: CGAL's parallel Delaunay (THOR_CGAL_PARALLEL=ON) defaults to a spin-yield lock policy (Tag_priority_blocking). When two or more THOR processes oversubscribe the same node, the workers' sched_yield() calls cascade between processes — every yield picks another spinner instead of a lock holder, and forward progress stalls. CPU sits at 100% with near-zero throughput.

Detection + policy: every THOR entering parallel CGAL construction writes a marker at /tmp/thor-cgal-<pid>.active for the duration of the build, and scans the directory at startup. The policy when other live THORs are detected is configurable per run:

pointcloud_voronoi:
  construction:
    cgal_marker_mode: warn                  # off | warn | wait   (default: warn)
    cgal_marker_wait_poll_seconds: 5        # 'wait' mode only
    cgal_marker_wait_timeout_seconds: 1800  # 'wait' mode safety net; 0 disables

off — disable the marker mechanism entirely (no file, no scan).

warn (default) — proceed regardless, but log a warning naming the conflicting PIDs:

[warning] 1 other THOR process(es) already in parallel CGAL construction
          on this node — concurrent CGAL Delaunay will be slow.
[warning]   - PID 12345 (exe /path/to/thor)

wait — block until no older-start_time THOR remains. Useful when jobs are launched in parallel by a script but must serialise their CGAL phase. The older process always wins; newer arrivals queue behind it. Poll interval is configurable. If the timeout is reached (blocker hard-crashed on a node where its PID isn't reused), THOR logs a warning and proceeds anyway rather than hanging forever.

You can also enumerate live constructions from a shell:

ls /tmp/thor-cgal-*.active 2>/dev/null

Stale markers (process crashed) are validated against /proc/<pid>/stat's start-time field and cleaned up automatically on the next scan, so old files never produce false warnings.

Same-MPI-job ranks (mpirun-launched on the same node, sharing parent or process group) are filtered out — they will not warn about each other, since intra-job rank contention isn't something the user can avoid. This heuristic catches the common mpirun/orted, hydra and srun-default cases; it may miss srun --mpi=pmix if each task gets a private session.

Cache-loaded meshes (pointcloud_voronoi.mesh_cache_mode: load or auto with a valid cache) skip parallel CGAL construction entirely, so they don't write a marker, don't scan, and don't appear in another THOR's marker scan. Coordination is per-CGAL-build, not per-run.

The marker directory defaults to /tmp, which is per-namespace inside Slurm task scratch / containers. To coordinate across processes that don't share /tmp, set the env var THOR_CGAL_MARKER_DIR to a shared bind-mounted directory before launching THOR.

Limitation — PID namespaces. Markers are keyed by pid + process start-time, both validated against /proc/<pid>/stat in the scanning process's namespace. If the THORs you want to coordinate run in different PID namespaces (independent Docker containers, podman pods, etc.), the same numeric PID in those namespaces refers to unrelated processes, and cross-namespace markers will be misclassified as stale and silently removed. For container-to-container coordination, run the THORs in the same PID namespace (e.g., docker run --pid=host or share a pod) or use a coordination mechanism outside this subsystem.

Solutions:

Avoid concurrent CGAL mesh construction on the same node. Serialize startup, or launch jobs so only one pointcloud_voronoi run builds its mesh at a time. The marker scan above makes this enforceable from a wrapper script.
Opt in to non-blocking locks (THOR_CGAL_NONBLOCKING_LOCKS=ON). Rebuild with -DTHOR_CGAL_NONBLOCKING_LOCKS=ON. This switches CGAL's spatial-lock policy to Tag_non_blocking, removing the sched_yield cascade — concurrent runs will at least make forward progress.

Trade-off: this option is OFF by default because the fail-fast lock primitive imposes a measurable single-process tax (~25% on SPH zoom-in data) and can degrade further if any cross-process contention occurs during the run (busy-spin on hot-spot cells). Use it only when concurrent CGAL jobs on the same machine are unavoidable.

Cache the mesh and reuse it for repeated runs:
```
pointcloud_voronoi:
  mesh_cache_mode: "auto"
  mesh_save_path: /path/to/voronoi_mesh_cache.h5
```
Let one run save the mesh, then have later runs load it instead of entering CGAL construction again.
For experiments only, consider the SYCL Voronoi backend:
```
pointcloud_voronoi:
  construction:
    use_sycl_voronoi: true
```
This backend is still experimental, and the default use_cgal_fallback: true can still call CGAL for unconverged cells. Only use it after validating the specific dataset.

Apptainer: `fuse2fs not found` / `gocryptfs not found`

Symptom: When running THOR via Apptainer, you see one or both of these warnings:

INFO:    fuse2fs not found, will not be able to mount EXT3 filesystems
INFO:    gocryptfs not found, will not be able to use gocryptfs

Cause: Apptainer checks for optional host utilities at startup. These are only needed for EXT3 overlay filesystems and encrypted containers, respectively.

Solution: These are harmless informational messages and can be safely ignored. THOR containers use Docker-based images and do not require either feature.

TBB Not Found

Symptom: CMake fails with an error about TBB not being found:

CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:XXX:
  Could NOT find TBB (missing: TBB_DIR)

Cause: TBB (Threading Building Blocks) is required when THOR_CGAL_PARALLEL is enabled (the default), which allows parallel CGAL triangulation construction.

Solutions:

Disable parallel CGAL construction (easiest):

cmake -S . -B build -DTHOR_CGAL_PARALLEL=OFF ...

Install TBB: Install TBB via your distribution's package manager where available. TBB has native CMake support via TBBConfig.cmake. If CMake can't find TBB after installation, add the TBB installation directory to CMAKE_PREFIX_PATH. For example, with Intel oneAPI this might be:
```
cmake -S . -B build -DCMAKE_PREFIX_PATH=/opt/intel/oneapi/tbb/latest ...
```

Troubleshooting

Self-test: 64-bit Atomic Operations Failed

GPU: "Could not open file libkernel-sscp-*.bc" / "Code object construction failed"

CMake: "Could NOT find MPI (missing: MPI_CXX_FOUND)"

Loader: Missing Header Attributes (Units, Cosmology)

Self-test: ASSERT Device Compatibility Failed

MPI Segfault at Startup

Voronoi Tessellation Hangs

Apptainer: fuse2fs not found / gocryptfs not found

TBB Not Found

Apptainer: `fuse2fs not found` / `gocryptfs not found`