Edge AI SDK
Hand-tuned CUDA kernels for hybrid neuromorphic workloads. 23 ms inference, 111× speedup and 4× memory reduction on a single Jetson AGX Orin — under a hard 60 W envelope.
Inference latency
0ms
64 M params · warm
Speedup vs baseline
0×
vs PyTorch reference
Memory footprint
0× lower
fused SNN-SSM kernels
GPU utilization
0%
sustained · MAXN
Direct measurement · Jetson AGX Orin 64 GB · Power mode MAXN · CUDA 12.x
What it is
NeuraTensor is the GPU acceleration layer that powers the NeuratronLLM-Edge family. It replaces stock PyTorch and TensorFlow operators with hand-tuned CUDA kernels designed for spiking, state-space and fusion workloads — squeezing the full envelope of NVIDIA Ampere edge silicon.
Fused SNN-SSM kernels
Spiking and state-space layers compile into single kernels. Eliminates intermediate tensor traffic and cuts memory bandwidth pressure by 4×.
Hardware-aware scheduling
Occupancy is tuned for SM 8.7 (Ampere edge). Tensor Cores are exploited for FP16 / INT8 mixed precision; warps are aligned to the spike-router topology.
Zero-copy edge runtime
Unified-memory aware. Inputs and KV cache live in shared LPDDR5 — no host↔device shuttling — so the 204.8 GB/s bus is never the bottleneck.
Architecture
From the Ampere GPU all the way up to the industrial REST surface. Layer 3 — the custom CUDA kernel set — is where NeuraTensor extracts the order-of-magnitude speedup. Everything above it is regular Python.
Application interface
REST API + WebSocket server · Python
Orchestration & control
Execution loop, monitoring, scheduling
Industrial integration
SLA monitor · SEMI/GEM protocols · Python/C++
Model architecture
Hybrid neural network implementation · Python
CUDA acceleration
Custom GPU kernels · CUDA C++ · the NeuraTensor core
Hardware abstraction
Runtime + device management · Python/Shell
Physical hardware
Jetson AGX Orin 64 GB · Ampere SM 8.7
Measured performance
Benchmarks run on the same SDK build that ships to customers. The PyTorch baseline executes the identical 64 M-parameter model with stock operators. NeuraTensor delivers the same numerics with two orders of magnitude less wall-clock time.
Inference latency
0ms
64 M params · warm
PyTorch baseline
0ms
same model · stock ops
Speedup
0×
kernel fusion + occupancy
Memory reduction
0×
vs reference path
GPU utilization
0%
sustained
Resident GPU memory
0GB
LLM + vision co-resident
Mean power
0W
under 60 W TDP
Junction temp
0°C
peak 42.06 °C
Cold start
0s
disk → GPU
Hardware platform
NeuraTensor targets the Ampere edge stack directly — SM 8.7, Tensor Cores, unified LPDDR5 memory. The kernel set is portable across the Jetson Orin family; the performance numbers below are anchored to the AGX Orin 64 GB validation host.
Where NeuraTensor matters
Anywhere latency, power and on-device autonomy are non-negotiable. NeuraTensor is the substrate beneath every Neuramorphic foundation model deployment — and is licensable standalone for partner workloads.
Foundation models on edge
Ships under NeuratronLLM-Edge (Caroline). Runs a 4 B-parameter LLM and a real-time vision pipeline concurrently on a single Jetson AGX Orin under 60 W.
Industrial control & robotics
Sub-25 ms inference closes the loop tight enough for motion control, line-side perception and adaptive process control on tool-mounted compute.
Defense & autonomous systems
Disconnected operation by design. The kernel set runs without internet, telemetry or driver phone-home — the entire compute path stays on the device.
Healthcare imaging
Bedside inference where PHI cannot leave the room. The 4× memory reduction lets larger diagnostic models fit on Orin-class compute next to the patient.
Semiconductor fab tools
Tool-mounted reasoning and anomaly detection co-located with the equipment. NeuraTensor is the acceleration layer behind regulated industrial deployments.
Energy & grid edge
Substations, wind turbines, remote pipelines. Inference budgets in single-digit watts; NeuraTensor keeps the kernels inside the envelope without sacrificing model size.
Supported targets
FAQ
NeuraTensor is a custom CUDA kernel set built for hybrid neuromorphic inference at the edge. It replaces stock PyTorch / TensorFlow operators with hand-tuned, fused kernels that target the NVIDIA Ampere edge stack (SM 8.7) and deliver 23 ms inference, 111× speedup and 4× memory reduction on Jetson AGX Orin.
cuDNN and TensorRT accelerate standard transformer operators. NeuraTensor accelerates hybrid neuromorphic operators — fused spiking + state-space + sparse attention paths — that have no production-grade equivalent in vendor libraries. The two are complementary; NeuraTensor sits at Layer 3 of the stack and falls back to cuBLAS / cuDNN where appropriate.
Primary target is the NVIDIA Jetson AGX Orin 64 GB (Ampere, SM 8.7, 2,048 CUDA cores, 64 Tensor Cores, 204.8 GB/s LPDDR5). The kernel set is portable across the Jetson Orin family (Nano, NX, AGX) and runs on workstation Ampere/Hopper for pre-deployment profiling.
The PyTorch reference path issues hundreds of small ops per token, each with allocation and launch overhead. NeuraTensor fuses spiking, state-space and attention into a handful of compiled kernels, aligns occupancy to SM 8.7, exploits Tensor Cores for FP16 / INT8 mixed precision and saturates the unified memory bus. Same numerics, two orders of magnitude faster wall-clock.
Yes. The kernel set runs entirely on-device with no network, telemetry or driver phone-home. It is the substrate behind the Caroline air-gap deployment, which is launched inside a Linux network namespace and validated to block 4 of 4 external egress probes at the kernel level.
On Jetson AGX Orin under sustained concurrent LLM + vision workload at 98.25% GPU utilization: estimated mean power 44.52 W, peak 45.08 W — inside the published 60 W platform envelope. Junction temperature mean 40.92 °C, peak 42.06 °C.
NeuraTensor SDK is licensed standalone for partner workloads and ships embedded inside the NeuratronLLM-Edge foundation models. Source, kernel internals and per-target tuning profiles are released under NDA. Contact support@neuramorphic.ai for an evaluation build.
NeuraTensor exposes a Python API surface that mirrors PyTorch tensor operations for the supported neuromorphic ops, plus a C++ runtime for low-overhead embedded deployment. A reference Docker image for JetPack 6.x is available for partners under NDA.
Patent pending
The NeuraTensor kernel set, the fused SNN-SSM compiler path and the hardware-aware scheduler are proprietary to Neuramorphic, Inc. and are protected by USPTO patent applications.
Implementation details, kernel source, occupancy maps and per-target tuning profiles are confidential and shared under NDA. Contact support@neuramorphic.ai for SDK access.