Saurabh Vij is the CEO and co-founder of MonsterAPI. He previously worked as a particle physicist at CERN and recognized the potential for decentralized computing from projects like LHC@home.
MonsterAPI leverages lower cost commodity GPUs from crypto mining farms to smaller idle data centres to provide scalable, affordable GPU infrastructure for machine learning, allowing developers to access, fine-tune, and deploy AI models at significantly reduced costs without writing a single line of code.
Before MonsterAPI, he ran two startups, including one that developed a wearable safety device for women in India, in collaboration with the Government of India and IIT Delhi.
Can you share the genesis story behind MonsterGPT?
Our Mission has always been “to help software developers fine-tune and deploy AI models faster and in the easiest manner possible.” We realised that there are multiple complex challenges that they face when they want to fine-tune and deploy an AI model.
From dealing with code to setting up Docker containers on GPUs and scaling them on demand
And the pace at which the ecosystem is moving, just fine-tuning is not enough. It needs to be done the right way: Avoiding underfitting, overfitting, hyper-parameter optimization, incorporating latest methods like LORA and Q-LORA to perform faster and more economical fine-tuning. Once fine-tuned, the model needs to be deployed efficiently.
It made us realise that offering just a tool for a small part of the pipeline is not enough. A developer needs the entire optimised pipeline coupled with a great interface they are familiar with. From fine-tuning to evaluation and final deployment of their models.
I asked myself a question: As a former particle physicist, I understand the profound impact AI could have on scientific work, but I don’t know where to start. I have innovative ideas but lack the time to learn all the skills and nuances of machine learning and infrastructure.
What if I could simply talk to an AI, provide my requirements, and have it build the entire pipeline for me, delivering the required API endpoint?
This led to the idea of a chat-based system to help developers fine-tune and deploy effortlessly.
MonsterGPT is our first step towards this journey.
There are millions of software developers, innovators, and scientists like us who could leverage this approach to build more domain-specific models for their projects.
Could you explain the underlying technology behind the Monster API’s GPT-based deployment agent?
MonsterGPT leverages advanced technologies to efficiently deploy and fine-tune open source Large Language Models (LLMs) such as Phi3 from Microsoft and Llama 3 from Meta.
- RAG with Context Configuration: Automatically prepares configurations with the right hyperparameters for fine-tuning LLMs or deploying models using scalable REST APIs from MonsterAPI.
- LoRA (Low-Rank Adaptation): Enables efficient fine-tuning by updating only a subset of parameters, reducing computational overhead and memory requirements.
- Quantization Techniques: Utilizes GPT-Q and AWQ to optimize model performance by reducing precision, which lowers memory footprint and accelerates inference without significant loss in accuracy.
- vLLM Engine: Provides high-throughput LLM serving with features like continuous batching, optimized CUDA kernels, and parallel decoding algorithms for efficient large-scale inference.
- Decentralized GPUs for scale and affordability: Our fine-tuning and deployment workloads run on a network of low-cost GPUs from multiple vendors from smaller data centres to emerging GPU clouds like coreweave for, providing lower costs, high optionality and availability of GPUs to ensure scalable and efficient processing.
Check out this latest blog for Llama 3 deployment using MonsterGPT:
How does it streamline the fine-tuning and deployment process?
MonsterGPT provides a chat interface with ability to understand instructions in natural language for launching, tracking and managing complete finetuning and deployment jobs. This ability abstracts away many complex steps such as:
- Building a data pipeline
- Figuring out right GPU infrastructure for the job
- Configuring appropriate hyperparameters
- Setting up ML environment with compatible frameworks and libraries
- Implementing finetuning scripts for LoRA/QLoRA efficient finetuning with quantization strategies.
- Debugging issues like out of memory and code level errors.
- Designing and Implementing multi-node auto-scaling with high throughput serving engines such as vLLM for LLM deployments.
What kind of user interface and commands can developers expect when interacting with Monster API’s chat interface?
User interface is a simple Chat UI in which users can prompt the agent to finetune an LLM for a specific task such as summarization, chat completion, code generation, blog writing etc and then once finetuned, the GPT can be further instructed to deploy the LLM and query the deployed model from the GPT interface itself. Some examples of commands include:
- Finetune an LLM for code generation on X dataset
- I want a model finetuned for blog writing
- Give me an API endpoint for Llama 3 model.
- Deploy a small model for blog writing use case
This is extremely useful because finding the right model for your project can often become a time-consuming task. With new models emerging daily, it can lead to a lot of confusion.
How does Monster API’s solution compare in terms of usability and efficiency to traditional methods of deploying AI models?
Monster API’s solution significantly enhances usability and efficiency compared to traditional methods of deploying AI models.
For Usability:
- Automated Configuration: Traditional methods often require extensive manual setup of hyperparameters and configurations, which can be error-prone and time-consuming. MonsterAPI automates this process using RAG with context, simplifying setup and reducing the likelihood of errors.
- Scalable REST APIs: MonsterAPI provides intuitive REST APIs for deploying and fine-tuning models, making it accessible even for users with limited machine learning expertise. Traditional methods often require deep technical knowledge and complex coding for deployment.
- Unified Platform: It integrates the entire workflow, from fine-tuning to deployment, within a single platform. Traditional approaches may involve disparate tools and platforms, leading to inefficiencies and integration challenges.
For Efficiency:
MonsterAPI offers a streamlined pipeline for LoRA Fine-Tuning with in-built Quantization for efficient memory utilization and vLLM engine powered LLM serving for achieving high throughput with continuous batching and optimized CUDA kernels, on top of a cost-effective, scalable, and highly available Decentralized GPU cloud with simplified monitoring and logging.
This entire pipeline enhances developer productivity by enabling the creation of production-grade custom LLM applications while reducing the need for complex technical skills.
Can you provide examples of use cases where Monster API has significantly reduced the time and resources needed for model deployment?
An IT consulting company needed to fine-tune and deploy the Llama 3 model to serve their client’s business needs. Without MonsterAPI, they would have required a team of 2-3 MLOps engineers with a deep understanding of hyperparameter tuning to improve the model’s quality on the provided dataset, and then host the fine-tuned model as a scalable REST API endpoint using auto-scaling and orchestration, likely on Kubernetes. Additionally, to optimize the economics of serving the model, they wanted to use frameworks like LoRA for fine-tuning and vLLM for model serving to improve cost metrics while reducing memory consumption. This can be a complex challenge for many developers and can take weeks or even months to achieve a production-ready solution. With MonsterAPI, they were able to experiment with multiple fine-tuning runs within a day and host the fine-tuned model with the best evaluation score within hours, without requiring multiple engineering resources with deep MLOps skills.
In what ways does Monster API’s approach democratize access to generative AI models for smaller developers and startups?
Small developers and startups often struggle to produce and use high-quality AI models due to a lack of capital and technical skills. Our solutions empower them by lowering costs, simplifying processes, and providing robust no-code/low-code tools to implement production-ready AI pipelines.
By leveraging our decentralized GPU cloud, we offer affordable and scalable GPU resources, significantly reducing the cost barrier for high-performance model deployment. The platform’s automated configuration and hyperparameter tuning simplify the process, eliminating the need for deep technical expertise.
Our user-friendly REST APIs and integrated workflow combine fine-tuning and deployment into a single, cohesive process, making advanced AI technologies accessible even to those with limited experience. Additionally, the use of efficient LoRA fine-tuning and quantization techniques like GPT-Q and AWQ ensures optimal performance on less expensive hardware, further lowering entry costs.
This approach empowers smaller developers and startups to implement and manage advanced generative AI models efficiently and effectively.
What do you envision as the next major advancement or feature that Monster API will bring to the AI development community?
We are working on a couple of innovative products to further advance our thesis: Help developers customise and deploy models faster, easier and in the most economical way.
Immediate next is a Full MLOps AI Assistant that performs research on new optimisation strategies for LLMOps and integrates them into existing workflows to reduce the developer effort on building new and better quality models while also enabling complete customization and deployment of production grade LLM pipelines.
Let’s say you need to generate 1 million images per minute for your use case. This can be extremely expensive. Traditionally, you would use the Stable Diffusion model and spend hours finding and testing optimization frameworks like TensorRT to improve your throughput without compromising the quality and latency of the output.
However, with MonsterAPI’s MLOps agent, you won’t need to waste all those resources. The agent will find the best framework for your requirements, leveraging optimizations like TensorRT tailored to your specific use case.
How does Monster API plan to continue supporting and integrating new open-source models as they emerge?
In 3 major ways:
- Bring Access to the latest open source models
- Provide the most simple interface for fine-tuning and deployments
- Optimise the entire stack for speed and cost with the most advanced and powerful frameworks and libraries
Our mission is to help developers of all skill levels adopt Gen AI faster, reducing their time from an idea to the well polished and scalable API endpoint.
We would continue our efforts to provide access to the latest and most powerful frameworks and libraries, integrated into a seamless workflow for implementing end-to-end LLMOps. We are dedicated to reducing complexity for developers with our no-code tools, thereby boosting their productivity in building and deploying AI models.
To achieve this, we continuously support and integrate new open-source models, optimization frameworks, and libraries by monitoring advancements in the AI community. We maintain a scalable decentralized GPU cloud and actively engage with developers for early access and feedback. By leveraging automated pipelines for seamless integration, enhancing flexible APIs, and forming strategic partnerships with AI research organizations, we ensure our platform remains cutting-edge.
Additionally, we provide comprehensive documentation and robust technical support, enabling developers to quickly adopt and utilize the latest models. MonsterAPI keeps developers at the forefront of generative AI technology, empowering them to innovate and succeed.
What are the long-term goals for Monster API in terms of technology development and market reach?
Long term, we want to help the 30 million software engineers become MLops developers with the help of our MLops agent and all the tools we are building.
This would require us to build not just a full-fledged agent but a lot of fundamental proprietary technologies around optimization frameworks, containerisation method and orchestration.
We believe that a combination of great, simple interfaces, 10x more throughput and low cost decentralised GPUs has the potential to transform a developer’s productivity and thus accelerate GenAI adoption.
All our research and efforts are in this direction.
Thank you for the great interview, readers who wish to learn more should visit MonsterAPI.