Jason Knight is Co-founder and Vice President of Machine Learning at OctoAI, the platform delivers a complete stack for app builders to run, tune, and scale their AI applications in the cloud or on-premises.
OctoAI was spun out of the University of Washington by the original creators of Apache TVM, an open source stack for ML portability and performance. TVM enables ML models to run efficiently on any hardware backend, and has quickly become a key part of the architecture of popular consumer devices like Amazon Alexa.
Can you share the inspiration behind founding OctoAI and the core problem you aimed to solve?
AI has traditionally been a complex field accessible only to those comfortable with the mathematics and high-performance computing required to make something with it. But AI unlocks the ultimate computing interfaces, that of text, voice, and imagery programmed by examples and feedback, and brings the full power of computing to everyone on Earth. Before AI, only programmers were able to get computers to do what they wanted by writing arcane programming language texts.
OctoAI was created to accelerate our path to that reality so that more people can use and benefit from AI. And people, in turn, can use AI to create yet more benefits by accelerating the sciences, medicine, art, and more.
Reflecting on your experience at Intel, how did your previous roles prepare you for co-founding and leading the development at OctoAI?
Intel and the AI hardware and biotech startups before it gave me the perspective to see how hard AI is for even the most sophisticated of technology companies, and yet how valuable it can be to those who have figured out how to use it. And seeing that the gap between those benefiting from AI compared to those who aren’t yet is primarily one of infrastructure, compute, and best practices—not magic.
What differentiates OctoStack from other AI deployment solutions available in the market today?
OctoStack is the industry’s first complete technology stack designed specifically for serving generative AI models anywhere. It offers a turnkey production platform that provides highly optimized inference, model customization, and asset management at an enterprise scale.
OctoStack allows organizations to achieve AI autonomy by running any model in their preferred environment with full control over data, models, and hardware. It also delivers unmatched performance and cost efficiency, with savings of up to 12X compared to other solutions like GPT-4.
Can you explain the advantages of deploying AI models in a private environment using OctoStack?
Models these days are ubiquitous, but assembling the right infrastructure to run those models and apply them with your own data is where the business-value flywheel truly starts to spin. Using these models on your most sensitive data, and then turning that into insights, better prompt engineering, RAG pipelines, and fine-tuning is where you can get the most value out of generative AI. But it’s still difficult for all but the most sophisticated companies to do this alone, which is where a turnkey solution like OctoStack can accelerate you and bring the best practices together in one place for your practitioners.
Deploying AI models in a private environment using OctoStack offers several advantages, including enhanced security and control over data and models. Customers can run generative AI applications within their own VPCs or on-premises, ensuring that their data remains secure and within their chosen environments. This approach also provides businesses with the flexibility to run any model, be it open-source, custom, or proprietary, while benefiting from cost reductions and performance improvements.
What challenges did you face in optimizing OctoStack to support a wide range of hardware, and how were these challenges overcome?
Optimizing OctoStack to support a wide range of hardware involved ensuring compatibility and performance across various devices, such as NVIDIA and AMD GPUs and AWS Inferentia. OctoAI overcame these challenges by leveraging its deep AI systems expertise, developed through years of research and development, to create a platform that continuously updates and supports additional hardware types, GenAI use cases, and best practices. This allows OctoAI to deliver market-leading performance and cost efficiency.
Additionally, getting the latest capabilities in generative AI, such as multi-modality, function calling, strict JSON schema following, efficient fine-tune hosting, and more into the hands of your internal developers will accelerate your AI takeoff point.
OctoAI has a rich history of leveraging Apache TVM. How has this framework influenced your platform’s capabilities?
We created Apache TVM to make it easy for sophisticated developers to write efficient AI libraries for GPUs and accelerators more easily. We did this because getting the most performance from GPU and accelerator hardware was critical for AI inference then as it is now.
We’ve since leveraged that same mindset and expertise for the entire Gen AI serving stack to deliver automation for a broader set of developers.
Can you discuss any significant performance improvements that OctoStack offers, such as the 10x performance boost in large-scale deployments?
OctoStack offers significant performance improvements, including up to 12X savings compared to other models like GPT-4 without sacrificing speed or quality. It also provides 4X better GPU utilization and a 50 percent reduction in operational costs, enabling organizations to run large-scale deployments efficiently and cost-effectively.
Can you share some notable use cases where OctoStack has significantly improved AI deployment for your clients?
A notable use case is Apate.ai, a global service combating telephone scams using generative conversational AI. Apate.ai leveraged OctoStack to efficiently run their suite of language models across multiple geographies, benefiting from OctoStack’s flexibility, scale, and security. This deployment allowed Apate.ai to deliver custom models supporting multiple languages and regional dialects, meeting their performance and security-sensitive requirements.
In addition, we serve hundreds of fine-tunes for our customer OpenPipe. Were they to spin up dedicated instances for each of these, their customers’ use cases would be infeasible as they grow and evolve their use cases and continuously re-train their parameter-efficient fine-tunes for maximum output quality at cost-effective prices.
Thank you for the great interview, readers who wish to learn more should visit OctoAI.