AI

Implementing response streaming from LLMs

Implementing response streaming from LLMs

Introduction

Large Language Models (LLMs) like OpenAI's GPT and Google's Gemini are AI models trained to generate human-like text. These models are the foundation for many chat applications, enabling users to have conversations with AI in a natural and interactive way.

Recently, we built an AI chat app that allows users to compare responses from multiple LLMs concurrently. This project was developed in React.js using the Next.js framework with TypeScript. It was purely a client-side app, which means AI-related API calls are sent directly from the client's browser. In this post, I’ll walk you through how the app integrates with LLMs like OpenAI and Gemini. I'll also cover how to handle streaming responses to improve the chat experience.

Handling OpenAI API calls without real-time streaming

First, let's look at how the chat features was implemented using the OpenAI API. There is an OpenAI API endpoint called chat/completions, which accepts a user's prompt and responds with an answer. For TypeScript/JavaScript, OpenAI provides an official NPM package called openai, which offers convenient access to the OpenAI REST API.

You can install it via:

npm install openai

To build the chat app, we started with the following code:

import OpenAI from "openai";

export const chatCompletion = async ({
  message,
  apiKey,
}: { message: string, apiKey: string }) => {

  const client = new OpenAI({
    apiKey,
    dangerouslyAllowBrowser: true,
  });

  try {
    const completion = await client.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [
        {
          role: "user",
          content: message,
        },
      ],
    });

    return completion.choices[0].message.content;
  } catch (error) {
    console.error(error);
    throw new Error("API error");
  }
};

Our chatCompletion function accepts a parameter called message, which is the user's prompt that we allow the user to enter in our chat app. Since the API call is made from the client side, we have to set the option dangerouslyAllowBrowser: true when configuring the OpenAI client. Then we call await client.chat.completions.create, which sends the user's prompt to OpenAI and waits for the response. The actual response content is returned to the caller.

Enhance user experience with real-time response streaming

Sometimes OpenAI takes a significant amount of time to respond, especially when the expected answer is long. As a result, the user is left waiting until the full response arrives. To improve the user experience, we explored a streaming approach where partial responses are updated on the screen progressively.

Here is our code snippet using the API in streaming mode:

export const chatStream = async ({
  message,
  apiKey,
  onStream,
}: { message: string, apiKey: string, onStream: (chunk: any) => void }) => {

  const client = new OpenAI({
    apiKey,
    dangerouslyAllowBrowser: true,
  });

  try {
    const stream = await client.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [
        {
          role: "user",
          content: message,
        },
      ],
      stream: true,                             // Option to enable streaming
      stream_options: { include_usage: true },  // Include usage data in the streaming response
    });

    for await (const chunk of stream) {
      onStream(chunk);
    }
  } catch (error) {
    console.error(error);
    throw new Error("API error");
  }
};

To enable streaming, we need to set the option stream: true when calling client.chat.completions.create. By default, in streaming mode, usage data, including the number of tokens used, is excluded from the responses. To include the usage data in the response, you can set the option stream_options: { include_usage: true }. You can then obtain the token usage data in the last chunk response.

Our chatStream function accepts a parameter called onStream. This is a callback function that needs to be defined by the caller to process the chunk responses for display. Below is an example of how we define the onStream callback:

import { useState } from "react";

const [response, setResponse] = useState("");
...

try {
  await chatStream({
    message: message,
    apiKey,
    onStream: (chunk) => {
      const chunkContent = chunk.choices[0]?.delta?.content || "";
      setResponse((prev) => prev + chunkContent);
    },
  });
} catch (error) {
  console.error(error);
}

We use a React state to represent the response: const [response, setResponse] = useState("");. When a streaming response arrives, we append it to the previous response and set it to the React state: setResponse((prev) => prev + chunkContent);. Our React component then uses the response state to display the response to the user. With this approach, we can display the response progressively on the screen without waiting for the full response from the API call.

Setting up response streaming for Gemini API

Next, we'll show how to implement a similar streaming approach with a Gemini API call. We use @google/generative-ai, which is the official Node.js/TypeScript library for the Google Gemini API.

You can install it via:

npm install @google/generative-ai

Here’s our code snippet for making Gemini API calls:

import { GoogleGenerativeAI } from "@google/generative-ai";
import { useState } from "react";

...
const [response, setResponse] = useState("");

const genAI = new GoogleGenerativeAI(apiKey);
const model = genAI.getGenerativeModel({
  model: 'gemini-1.5-flash',
});

const chat = model.startChat();

let result = await chat.sendMessageStream(message);

for await (const chunk of result.stream as AsyncIterable<GeminiChunkResponse>) {
  const chunkContent = chunk.text();
  setResponse((prev) => prev + chunkContent);
}

model.startChat() returns a chat session that can be reused for multi-turn chats. Then we send the user's prompt to the API via chat.sendMessageStream(message);, and the response is sent back in a streaming manner. The responses are processed in a loop, where the new response is appended to the previous response and set to a React state: setStreamedResponse((prev) => prev + chunkContent);. The UI displays the response using the response React state.

Rendering the streamed response

Since React will re-render a component for every state change, we can now just simply output the response variable and let React handle the DOM updates.

import { useState } from "react";

...
const [response, setResponse] = useState("");

...
export funtion MessageBubble({message} : {message: string}) {

...
  return <p>{response}</p>
}

One interesting detail that we can take advantage of with OpenAI and Gemini’s responses is that they respond in markdown (a lightweight markup language that allows the formatting of text without much fancy tokens like HTML or XML does). This means we can further improve the user experience by converting the markdown response into proper HTML, which allows displaying (among many other things) code blocks, lists, emphasized text, headings, etc.

We can thus wrap our response with a component that can render markdown as HTML; in our case we used a package called react-markdown and a package called highlightjs for the conversion.

import ReactMarkdown, { ExtraProps } from "react-markdown";
import hljs from "highlight.js";

...
function highlightedCode(props: ReactMarkdownComponentProps) {
  const { children, className, node, ...rest } = props;
  const match = /language-(\w+)/.exec(className || "");
  const chosenLanguage = match ? match[1] : "plaintext";
  const detectedLanguage = hljs.getLanguage(chosenLanguage);
  const language = detectedLanguage?.scope?.toString() ?? "plaintext";
  const highlighted = hljs.highlight(String(children), {
    language,
    ignoreIllegals: true,
  });

  return (
    <code
      {...rest}
      className={`${className ?? "language-plaintext font-bold"} hljs my-4`}
      dangerouslySetInnerHTML={{ __html: highlighted.value }}
    ></code>
  );
}

...

export funtion MessageBubble({message} : {message: string}) {

...
  return (
    <ReactMarkdown
      components={{
        code(props) {
          return highlightedCode(props);
        },
      }}
    >
      {response}
    </ReactMarkdown>
  );
}

The function highlightedCode is used to run highlightJs for blocks that are determined to be code blocks by ReactMarkdown. We need to use dangerouslySetInnerHTML because the result of the call to highlightJs will result in HTML output and we want to use the response as-is without escaping.

Challenges and Considerations

This works all well and good, but we noticed a performance issue: after some time and when the conversations get longer and longer, the whole UI becomes very slow and unresponsive.

We realized this was because as there was more and more text that needed to be converted to markdown, the cost of component rerender gets higher.

Our solution was to useMemo to cache the converted markdown so it only changes whenever the message changes, and not when the component is re-rendered due to changes in the component’s state.

const highlightedMarkdownText = useMemo(() => {
  return (
    <ReactMarkdown
      components={{
        code(props) {
          return highlightedCode(props);
        },
      }}
    >
      {message}
    </ReactMarkdown>
  );
}, [message]);

Conclusion

In this article, I have showcased how to implement streaming responses from both the OpenAI and Gemini APIs. By using the streaming approach, we can significantly enhance the user experience in a chat application powered by LLMs. We also solved a possible issue with React and streaming where the UI slows down, and proposed a fix using useMemo.

Need help building your product?

Reach out to us by filling out the form on our contact page. If you need an NDA, just let us know, and we’ll gladly provide one!

Top software development company Malaysia awards
Loading...