Show HN: Benchmarking Apple Silicon unified mem for GPU-accelerated SQL analysis

Does Apple Silicon's unified memory architecture eliminate the CPU-to-GPU data transfer bottleneck that plagues GPU-accelerated database analytics?

This project benchmarks six TPC-H analytical queries across three execution engines -- DuckDB (optimized SQL), NumPy (CPU kernels), and MLX (GPU kernels) -- on an Apple M4 to quantify the real-world impact of unified memory on analytical query performance. It also includes an optional GPU-favorable showcase query (QX) for stress-testing scan + arithmetic + high-cardinality grouped aggregation.

GPU-accelerated databases (BlazingSQL, HeavyDB, Crystal) promise order-of-magnitude speedups for analytical workloads, but a fundamental bottleneck limits their effectiveness: the PCIe bus.

On discrete GPU systems, data must be explicitly transferred from host RAM to GPU VRAM before compute can begin:

For scan-heavy analytical queries processing hundreds of megabytes to gigabytes of columnar data, this transfer overhead can dominate total execution time -- often exceeding the GPU compute time itself.

Apple Silicon's unified memory architecture eliminates this bus entirely. CPU and GPU cores share a single physical memory pool at 120 GB/s bandwidth. There is no device-to-device copy; the GPU reads from the same memory addresses to the CPU wrote to. The question is: how much does this actually matter in practice?

Results from benchmarking on Apple M4 (10 GPU cores, 16 GB RAM, 120 GB/s unified memory):

MLX GPU vs NumPy CPU (same algorithms, fair comparison):

At SF1 (~6M lineitem rows), MLX GPU kernels outperform equivalent NumPy CPU kernels on compute-heavy queries. At SF10 (~60M rows), the GPU advantage grows for aggregation-heavy queries but join-heavy queries show diminishing returns due to gather/scatter overhead of sparse index lookups on GPU.

Supplemental GPU-showcase query (QX):

QX is not a TPC-H query. It is intentionally designed to favor GPU throughput: full lineitem scan, arithmetic feature synthesis, and 4096-bucket scatter-add aggregation without joins.

The QX result demonstrates that MLX can outperform both NumPy and DuckDB when the workload is dominated by massively parallel arithmetic aggregation and avoids join-heavy optimizer-dependent logic.

Three-baseline comparison at SF1:

DuckDB's vectorized engine with a full query optimizer consistently outperforms hand-written kernels, especially at larger scale factors. At SF1, DuckDB is 1.5-4.6x faster than MLX on most queries. This is expected -- DuckDB benefits from decades of query optimization research, adaptive execution strategies, and C++ implementation, while the MLX/NumPy kernels use straightforward scatter-add algorithms without query planning.

Unified memory transfer overhead:

The scenario model shows that if the M4 GPU were behind a PCIe 4.0 bus, data transfer would add 10-36% overhead to GPU compute time depending on query data volume.

MLX GPU scaling across data sizes:

The benchmark pipeline has three layers:

What each baseline measures:

Benchmark protocol:

The project depends on:

The full pipeline (run_all.py) executes 9 steps:

The benchmark uses six TPC-H queries selected to cover a range of analytical patterns:

Q6 is the purest memory bandwidth test -- four columns, conjunction of filters, single scalar reduction. Q1 tests scatter-add group-by performance with a high-selectivity filter (~98% of rows pass). Q3 and Q5 test join-heavy workloads that stress the GPU's ability to handle sparse index lookups. Q12 and Q14 combine joins with conditional aggregation.

Scale factors and data sizes:

The GPU kernels work around several MLX constraints:

No boolean fancy indexing. array[bool_mask] raises ValueError in MLX. All filter operations use mx.where(mask, value, zero) with an overflow bin pattern: masked-out rows are routed to a discard group at index N, and results are sliced to [:N] before returning.

No argwhere or nonzero. Single-argument mx.where(mask) is not supported. When extracting indices of matching elements is required (e.g., Q3's top-k selection), the code converts to NumPy: np.where(np_array > 0)[0].

mx.array(numpy_array) is a copy, not zero-copy. Confirmed by MLX documentation. The copy occurs at unified memory bandwidth (~120 GB/s) rather than PCIe bandwidth (25-50 GB/s), but it is still a real copy with real overhead. The loading pipeline times each stage separately to honestly account for this.

Float32 only for practical purposes. All columns are downcast from DuckDB's float64/int64 to float32/int32 before GPU transfer. This halves memory usage and bandwidth requirements but introduces precision loss.

No lexsort. Multi-key sorting (needed for Q3: revenue DESC, orderdate ASC) falls back to NumPy.

Scatter-add over millions of rows in float32 accumulates rounding errors. At SF1, Q1 aggregate sums show ~0.03-0.08% relative error versus DuckDB's float64 results. The validation framework uses a 0.1% relative tolerance (FLOAT_RTOL = 1e-3) for float aggregates while requiring exact matches for counts and dimension keys.

The project pre-computes memory budgets before execution to prevent OOM crashes. At SF10, the largest queries (Q1, Q5) require ~3-5 GB of MLX arrays. The loading pipeline uses per-column conversion (load one column to MLX, then delete the NumPy source) to minimize transient peak memory, keeping it to roughly 2.5x the final array size rather than 3x.

Key memory thresholds:

String columns are encoded to integers inside DuckDB using SQL CASE expressions before extraction, since MLX cannot process string data. The full pipeline is:

Each stage is individually timed. The transfer benchmark module repeats this 100 times to measure the conversion overhead with statistical confidence.

The pipeline produces 7 chart types in results/:

MIT

If you use this benchmark in research:

Marco Rodriguez