Moshe Tanach, CEO and Co-Founder at NeuReality – Interview Series

Moshe Tanach is the CEO & co-founder of NeuReality. Before founding NeuReality, Moshe served as Director of Engineering at Marvell and Intel, where he led the development of complex wireless and networking products to mass production. He also served as AVP of R&D at DesignArt Networks (later acquired by Qualcomm), where he contributed to the development of 4G base station products.

NeuReality’s mission is to simplify AI adoption. By taking a system-level approach to AI, NeuReality’s team of industry experts delivers AI inference holistically, identifying pain points and providing purpose-built, silicon-to-software AI inference solutions that make AI both affordable and accessible.

With your extensive experience leading engineering projects at Marvell, Intel, and DesignArt-Networks, what inspired you to co-found NeuReality, and how did your previous roles influence the vision and direction of the company?

NeuReality was built from inception to solve for the future cost, complexity and climate problems that would be inevitable AI inferencing – which is the deployment of trained AI models and software into production-level AI data centers. Where AI training is how AI is created; AI inference is how it is used and how it interacts with billions of people and devices around the world.

We are a team of systems engineers, so we look at all angles, all the multiple facets of end-to-end AI inferencing including GPUs and all classes of purpose-built AI accelerators. It became clear to us going back to 2015 that CPU-reliant AI chips and systems – which is every GPU, TPU, LPU, NRU, ASIC and FPGA out there – would hit a significant wall by 2020. Its system limitations where the AI accelerator has become better and faster in terms of raw performance, but the underlying infrastructure did not keep up.

As a result, we decided to break away from the big giants riddled with bureaucracy that protect successful businesses, like CPU and NIC manufacturers, and disrupt the industry with a better AI architecture that is open, agnostic, and purpose-built for AI inference. One of the conclusions of reimagining ideal AI inference is that in boosting GPU utilization and system-level efficiency, our new AI compute and network infrastructure – powered by our novel NR1 server-on-chip that replaces the host CPU and NICs. As an ingredient brand and companion to any GPU or AI accelerator, we can remove market barriers that deter 65% of organizations from innovating and adopting AI today – underutilized GPUs which leads to buying more than what’s really needed (because they run idle > 50% of the time) – all the while reducing energy consumption, AI data center real-estate challenge, and operational costs.

This is a once in a lifetime opportunity to really transform AI system architecture for the better based on everything I learned and practiced for 30 years, opening the doors for new AI innovators across industries and removing CPU bottlenecks, complexity, and carbon footprints.

NeuReality’s mission is to democratize AI. Can you elaborate on what “AI for All” means to you and how NeuReality plans to achieve this vision?

Our mission is to democratize AI by making it more accessible and affordable to all organizations big and small – by unleashing the maximum capacity of any GPU or any AI accelerator so you get more from your investment; in other words, get MORE from the GPUs you buy, rather than buying more GPUs that run idle >50% of the time. We can boost AI accelerators up to 100% full capability, while delivering up to 15X energy-efficiency and slashing system costs by up to 90%. These are order of magnitude improvements. We plan to achieve this vision with our NR1 AI Inference Solution, the world’s first data center system architecture tailored for the AI age. It runs high-volume, high-variety AI data pipelines affordably and efficiently with the added benefit of a reduced carbon footprint.

Achieving AI for all also means making it easy to use. At NeuReality, we simplify AI infrastructure deployment, management, and scalability, enhance business processes and profitability, and advance sectors such as public health, safety, law enforcement and customer service. Our impact spans sectors such as medical imaging, clinical trials, fraud detection, AI content creation and many more.

Currently, our first commercially available NR1-S AI Inference Appliances are available with Qualcomm Cloud AI 100 Ultra accelerators and through Cirrascale, a cloud service provider.

The NR1 AI Inference Solution is touted as the first data center system architecture tailored for the AI age, and purpose-built for AI inference. What were the key innovations and breakthroughs that led to the development of the NR1?

NR1™ is the name of the entire silicon-to-software system architecture we’ve designed and delivered to the AI industry – as an open, fully compatible AI compute and networking infrastructure that fully complements any AI accelerator and GPUs. If I had to break it down to the top-most unique and exciting innovations that led to this end-to-end NR1 Solution and differentiates us, I’d say:

  • Optimized AI Compute Graphs: The team designed a Programmable Graph Execution Accelerator to optimize the processing of Compute Graphs, which are crucial for AI and various other workloads like media processing, databases, and more. Compute Graphs represent a series of operations with dependencies, and this broader applicability positions NR1 as potentially disruptive beyond just super boosting GPUs and other AI accelerators. It simplifies AI model deployment by generating optimized Compute Graphs (CGs) based on pre-processed AI data and software APIs, leading to significant performance gains.
  • NR1 NAPU™ (Network Addressable Processing Unit): Our AI inference architecture is powered by the NR1 NAPU™ – a 7nm server-on-chip that enables direct network access for AI pre- and post-processing. We pack 6.5x more punch on a smaller NR1 chip than a typical general-purpose, host CPU. Traditionally, pre-processing tasks (like data cleaning, formatting, and feature extraction) and post-processing tasks (like result interpretation and formatting) are handled by the CPU. By offloading these tasks to the NR1 NAPU™, we displace both the CPUs and NIC. This reduces bottlenecks allowing for faster overall processing, lightning-fast response times and lower cost per AI query. This reduces bottlenecks and allows for faster overall processing.
  • NR1™ AI-Hypervisor™ technology: The NR1’s patented hardware-based AI-Hypervisor™ optimizes AI task orchestration and resource utilization, improving efficiency and reducing bottlenecks.
  • NR1™ AI-over-Fabric™ Network Engine: The NR1 incorporates a unique AI-over-Fabric™ network engine that ensures seamless network connectivity and efficient scaling of AI resources across multiple NR1 chips – which are coupled with any GPU or AI Accelerator – within the same inference server or NR1-S AI inference appliance.

NeuReality’s recent performance data highlights significant cost and energy savings. Could you provide more details on how the NR1 achieves up to 90% cost savings and 15x better energy efficiency compared to traditional systems?

NeuReality’s NR1 slashes the cost and energy consumption of AI inference by up to 90% and 15x, respectively. This is achieved through:

  • Specialized Silicon: Our purpose-built AI inference infrastructure is powered by the NR1 NAPU™ server-on-chip, which absorbs the functionality of the CPU and NIC into one – and eliminates the need for CPUs in inference. Ultimately the NR1 maximizes the output of any AI accelerator or GPU in the most efficient way possible.
  • Optimized Architecture: By streamlining AI data flow and incorporating AI pre- and post-processing directly within the NR1 NAPU™, we offload and replace the CPU. This results in reduced latency, linear scalability, and lower cost per AI query.
  • Flexible Deployment: You can buy the NR1 in two primary ways: 1) inside the NR1-M™ Module which is a PCIe card that houses multiple NR1 NAPUs (typically 10) designed to pair with your existing AI accelerator cards. 2) inside the NR1-S™ Appliance, which pairs NR1 NAPUs with an equal number of AI accelerators (GPU, ASIC, FPGA, etc.) as a ready-to-go AI Inference system.

At Supercomputing 2024 in November, you will see us demonstrate an NR1-S Appliance with 4x NR1 chips per 16x Qualcomm Cloud AI 100 Ultra accelerators. We’ve tested the same with Nvidia AI inference chips. NeuReality is revolutionizing AI inference with its open, purpose-built architecture.

 How does the NR1-S AI Inference Appliance match up with Qualcomm® Cloud AI 100 accelerators compare against traditional CPU-centric inference servers with Nvidia® H100 or L40S GPUs in real-world applications?

NR1, combined with Qualcomm Cloud AI 100 or NVIDIA H100 or L40S GPUs, delivers a substantial performance boost over traditional CPU-centric inference servers in real-world AI applications across large language models like Llama 3, computer vision, natural language processing and speech recognition. In other words, running your AI inference system with NR1 optimizes the performance, system cost, energy efficiency and response times across images, sound, language, and text – both separately (single modality) or together (multi-modality).

The end-result? When paired with NR1, a customer gets MORE from the expensive GPU investments they make, rather than BUYING more GPUs to achieve desired performance.

Beyond maximizing GPU utilization, the NR1 delivers exceptional efficiency, resulting in 50-90% better price/performance and up to 13-15x greater energy efficiency. This translates to significant cost savings and a reduced environmental footprint for your AI infrastructure.

The NR1-S demonstrates linear scalability with no performance drop-offs. Can you explain the technical aspects that allow such seamless scalability?

The NR1-S Appliance, coupling our NR1 chips with AI accelerators of any type or quantity, redefines AI infrastructure. We’ve moved beyond CPU-centric limitations to achieve a new level of performance and efficiency.

Instead of the traditional NIC-to-CPU-to-accelerator bottleneck, the NR1-S integrates direct network access, AI pre-processing, and post-processing within our Network Addressable Processing Units (NAPUs). With typically 10 NAPUs per system, each handling tasks like vision, audio, and DSP processing, and our AI-Hypervisor™ orchestrating workloads, streamlined AI data flow is achieved. This translates to linear scalability: add more accelerators, get proportionally more performance.

The result? 100% utilization of AI accelerators is consistently observed. While overall cost and energy efficiency vary depending on the specific AI chips used, maximized hardware investment, and improved performance are consistently delivered. As AI inference needs scale, the NR1-S provides a compelling alternative to traditional architectures.

NeuReality aims to address the barriers to widespread AI adoption. What are the most significant challenges businesses face when adopting AI, and how does your technology help overcome these?

When poorly implemented, AI software and solutions can become troublesome. Many businesses cannot adopt AI due to the cost and complexity of building and scaling AI systems. Today’s AI solutions are not optimized for inference, with training pods typically having poor efficiency and inference servers having high bottlenecks. To take on this challenge and make AI more accessible, we have developed the first complete AI inference solution – a compute and networking infrastructure powered by our NAPU – which makes the most of its companion AI accelerator and reduces market barriers around excessive cost and energy consumption.

Our system-level approach to AI inference – versus trying to develop a better GPU or AI accelerator where there is already a lot of innovation and competition – means we are filling a significant industry gap for dozens of AI inference chip and system innovators. Our team attacked the shortcomings in AI Inference systemically and holistically, by determining pain points, architecture gaps and AI workload projections — to deliver the first purpose-built, silicon-to-software, CPU-free AI inference architecture. And by developing a top-to-bottom AI software stack with open standards from Python and Kubernetes combined with NeuReality Toolchain, Provisioning, and Inference APIs, our integrated set of software tools combines all components into a single high-quality UI/UX.

In a competitive AI market, what sets NeuReality apart from other AI inference solution providers?

To put it simply, we’re open and accelerator-agnostic. Our NR1 inference infrastructure supercharges any AI accelerator – GPU, TPU, LPU, ASIC, you name it – creating a truly optimized end-to-end system. AI accelerators were initially brought in to help CPUs handle the demands of neural networks and machine learning at large, but now the AI accelerators have become so powerful, they’re now held back by the very CPUs they were meant to assist.

Our solution? The NR1. It’s a complete, reimagined AI inference architecture. Our secret weapon? The NR1 NAPU™ was designed as a co-ingredient to maximize AI accelerator performance without guzzling extra power or breaking the bank. We’ve built an open ecosystem, seamlessly integrating with any AI inference chip and popular software frameworks like Kubernetes, Python, TensorFlow, and more.

NeuReality’s open approach means we’re not competing with the AI landscape; we’re here to complement it through strategic partnerships and technology collaboration. We provide the missing piece of the puzzle: a purpose-built, CPU-free inference architecture that not only unlocks AI accelerators to benchmark performance, but also makes it easier for businesses and governments to adopt AI. Imagine unleashing the full power of NVIDIA H100s, Google TPUs, or AMD MI300s – giving them the infrastructure they deserve.

NeuReality’s open, efficient architecture levels the playing field, making AI more accessible and affordable for everyone. I’m passionate about seeing different industries – fintech, biotech, healthtech – experience the NR1 advantage firsthand. Compare your AI solutions on traditional CPU-bound systems versus the modern NR1 infrastructure and witness the difference. Today, only 35% of businesses and governments have adopted AI and that is based on incredibly low qualifying criteria. Let’s make it possible for over 50% of enterprise customers to adopt AI by this time next year without harming the planet or breaking the bank.

Looking ahead, what is NeuReality’s long-term vision for the role of AI in society, and how do you see your company contributing to this future?

I envision a future where AI benefits everyone, fostering innovation and improving lives. We’re not just building technology; we’re building the foundation for a better future.

Our NR1 is key to that vision. It’s a complete AI inference solution that starts to shatter the cost and complexity barriers hindering mass AI business adoption. We’ve reimagined both the infrastructure and the architecture, delivering a revolutionary system that maximizes the output of any GPU, any AI accelerator, without increasing operational costs or energy consumption.

The business model really matters to scale and give end-customers real choices over concentrated AI autocracy as I’ve written on before. So instead, we’re building an open ecosystem where our silicon works with other silicon, not against it. That’s why we designed NR1 to integrate seamlessly with all AI accelerators and with open models and software, making it as easy as possible to install, manage and scale.

But we’re not stopping there. We’re collaborating with partners to validate our technology across various AI workloads and deliver “inference-as-a-service” and “LLM-as-a-service” through cloud service providers, hyper scalers, and directly with companion chip makers. We want to make advanced AI accessible and affordable to all.

Imagine the possibilities if we could boost AI inference performance, energy efficiency, and affordability by double-digit percentages. Imagine a robust, AI-enabled society with more voices and choices becoming a reality. So, we must all do the demanding work of proving business impact and ROI when AI is implemented in daily data center operations. Let’s focus on revolutionary AI implementation, not just AI model capability.

This is how we contribute to a future where AI benefits everyone – a win for profit margins, people, and the planet.

Thank you for the great interview, readers who wish to learn more should visit NeuReality.