Quick bio
This is your second interview at The Sequence. Please tell us a bit about yourself. Your background, current role and how did you get started in AI?
I grew up in the suburbs of Montreal and I have always been passionate about mathematics. I left Montreal to study math and computer science at the University of Waterloo in Canada. I currently live in Toronto with my wonderful girlfriend and our little dog Skywalker who enjoys kayaking with us around the beaches and bluffs. I am a Principal Software Engineer at Microsoft, where I have worked on various AI projects and have been a core contributor in the development of Microsoft Copilot. While my colleagues recognize me as a diligent engineer, but only a few have had the opportunity to witness my prowess as a skier.
For my career, I have been dedicated to building AI applications since I was in university 15 years ago. During my studies, I joined Maluuba as one of the early engineers. We developed personal assistants for phones, TVs, and cars that handled a wide range of commands. We started with using classical machine learning models such as SVMs, Naive Bayes, and CRFs before adapting to use deep learning. We sold Maluuba to Microsoft in 2017 to help Microsoft in their journey to incorporate AI into more products. I have been working on a few AI projects at Microsoft including some research to put trainable models in Ethereum smart contracts, which we talked about in our last interview. Since 2020, I have been working on a chat system built into Bing which evolved into Microsoft Copilot. I am currently a Principal Software Engineer on the Copilot Platform team at Microsoft, where we’re focused on developing a generalized platform for copilots at Microsoft.
🛠 ML Work
Your recent work includes working on Microsoft Copilot, which is a central piece of Microsoft’s AI vision. Tell us about the vision and core capabilities of the platform.
We have built a platform for copilots and apps that want to leverage large language models (LLMs) and easily take advantage of the latest developments in AI. Many products use our platform such as Windows, Edge, Skype, Bing, SwiftKey, and many Office products. Their needs and customization points vary. It’s a fun engineering challenge to build a system that’s designed to work well for many different types of clients in different programming languages and that scales from simple LLM usage to more sophisticated integrations with plugins and custom middlewares. Many teams benefit not only from the power of our customizable platform, but they also benefit from the many Responsible AI (RAI) and security guards built into our system.
Are copilots/agents the automation paradigm of the AI era? How would you compare copilots with previous automation trends, such as robotic process automation (RPA), middleware platforms, and others?
Copilots help us automate many types of tasks and get our work done more quickly in a breadth of scenarios, but right now, we still often need to review their work such as emails or code they write. Other types of automation might be hard for an individual to configure, but once it’s configured, it’s designed to run autonomously and be trusted because its scope is limited. Another big difference with using LLMs compared to previous automation trends is that we can now use the same model to help with many different types of tasks when given the right instructions and examples. When given the right instructions and grounding information, LLMs can often generalize to new scenarios.
There are capabilities such as planning/reasoning, memory, and action execution that are fundamental building blocks of copilots or agents. What are some of the most exciting research and technologies you have seen in these areas?
AutoGen is an interesting paradigm that’s adapting classical ideas like ensembling techniques from previous eras of AI for the new more generalized LLMs. AutoGen can use multiple different LLMs to collaborate and solve a task.
Semantic Kernel is a great tool for aiding in orchestrating LLMs and easily integrating plugins, RAG, and different models. It also works well with my favorite tool to easily run models locally: Ollama.
Here’s a somewhat controversial question: copilots/agents are typically constructed as an LLM surrounded by features like RAG, action execution, memory, etc. How much of those capabilities do you foresee becoming part of the LLM (maybe fine-tuned) themselves versus what remains external? In other words, does the model become the copilot?
It’s really helpful to have features like RAG available as external integrations for brand new scenarios and to ensure that we cite the right data. When training models, we talk about the ‘cold start’ problem: how do we get data and examples for new use cases? Very large models can learn about certain desired knowledge, but it’s hard to foresee what will be required in this quickly changing space. Many teams using our Copilot Platform expect to use RAG and plugins to easily integrate their stored knowledge from various sources that update often, such as content from the web based on news, or documentation that changes daily. It would be outlandish to tell them to collect lots of training data, even if it’s unlabeled or unstructured data, and to fine-tune a model hourly or even more often as the world changes. We’re not ready for that yet. Citing the right data is also important. Without RAG, current models hallucinate too much and cannot yet be trusted to cite the original source of information. With RAG, we know what information is available for the model to cite at runtime and we include those links in the UI along with a model’s response, even if the model did not choose to cite them, because they’re helpful as references for us to learn more about a topic.
One of the biggest challenges with using models such as GPT-4 for domain-specific agents or copilots is that they are just too big and expensive. Microsoft has recently been pushing the small language model (SLMs) trend with models like Orca or Phi. Are SLMs generally better suited for business copilot scenarios?
SLMs are very useful for specific use cases and can be more easily fine-tuned. The biggest caveat for SLMs is that many only work well in fewer languages than GPT-4 which knows many languages. I’ve had a great time playing around with them using Ollama. It’s easy to experiment and build an application with a local SLM, especially while you’re more focused on traditional engineering problems and designing parts of your project. Once you’re ready to scale to many languages and meet the foray of customer needs, a more powerful model might be more useful. I think the real answer will be hybrid systems that find ways to leverage small and large models.
How important are guardrails in Microsoft’s Copilots, and how do you think about that space?
We have many important guardrails for Responsible AI (RAI) built into our Copilot Platform from inspecting user input to verifying model output. These protections are one of the main reasons that many teams use our platform. The guardrails and shields that we set up for RAI are very important in our designs. RAI is a core part of every design review, and we standardize how RAI works for everything that goes into and comes out of our platform. We work with many teams across Microsoft to standardize what to validate and share knowledge. We also ensure that the long prompt with special instructions, examples, and formatting is treated securely, just like code, and not exposed outside of our platform.
Your team was very vocal about their work in the Copilot user experience, using technologies like SignalR. What makes the UX for copilots/agents different from previous paradigms?
We built new user experiences for our copilots to integrate them into existing products and we wrote a blog to share some of our design choices such as how we stream responses and designed the platform to work with many different types of clients in different programming languages. I also did a podcast to discuss some topics mentioned in the blog post more. One of the biggest noticeable differences with previous assistants or agents is how an answer is streamed word by word, or token by token, as the response is generated. The largest and most powerful models can also be the slowest ones and it can take many seconds or sometimes minutes to generate a full response with grounding data and references, so it’s important for us to start to show the user an answer as quickly as possible. We use SignalR to help us simplify streaming the answer to the client. SignalR automatically detects and chooses the best transport method among the various web standard protocols and techniques. WebSockets are used as the transport method by default for most of our applications and we can gracefully fall back to Server-Sent Events or long polling. SignalR also simplifies bidirectional communication, such as when the application needs to send information to the service to interrupt the streaming of a response.
We use Adaptive Cards and Markdown to easily scale to displaying responses in multiple different applications or different programming languages. We use the new object-basin library that we built to generalize and simplify streaming components of JSON to modify the JSON in the Adaptive Cards that were already streamed to the application. This gives the service a lot of control over what is displayed in the applications and the application can easily tweak how the response is formatted, for example, by changing CSS.
💥 Miscellaneous – a set of rapid-fire questions
What is your favorite area of research outside of generative AI?
Quantum Computing.
Is reasoning and planning the next big thing for LLMs and, consequently, copilots? What other research areas are you excited about?
Reasoning and planning are important for some complex scenarios beyond question answering where multiple steps are involved such as planning a vacation or determining the phases of a project. I’m also excited about ways that we can use smaller and simpler local models securely for simple scenarios.
How far can the LLM scaling laws take us? All the way to AGI?
I’m confident that we will get far with LLMs because we’ve seen them do awesome things already. My personal observation is that we tend to make giant leaps in AI every few years and then the progress is slower and more incremental in the years between the giant leaps. I think at least one more giant leap will be required before AGI is achieved, but I’m confident that LLMs will help us make that giant leap sooner by making us more productive. Language is just one part of intelligence. Models will need to understand the qualia associated with sensory experiences to become truly intelligent.
Describe a programmer’s world with copilots in five years.
Copilots will be integrated more into the development experience, but I hope they don’t eliminate coding completely. Copilots will help us even more with our tasks and going back to not having a copilot already feels weird and lonely to me. I like coding and feeling like I built something, but I’m happy to let a copilot take over with more tedious tasks or help me discover different techniques. Copilots will help us get more done faster as they get more powerful and increase in context size to understand more of a project instead of just a couple of sections or files. Copilots will also need to become more proactive and less reactive to respond only when prompted. We will have to be careful to build helpful systems that are not pestering.
Who are your favorite mathematicians and computer scientists, and why?
I don’t think I can pick a specific person that I fully admire, but right now, even though we wouldn’t typically call them mathematicians, Amos Tversky and Daniel Kahneman come to mind. People have been talking more about them lately because Daniel Kahneman passed away a few months ago. I think about them, system 1 vs. system 2 thinking, and slowing down to apply logic, a deep kind of mathematics, as I read “The Undoing Project” and “Thinking, Fast and Slow” a few years ago.