Introduction
Nowadays you can't seem to escape the mention of AI anywhere online; from Microsoft shoehorning CoPilot buttons into laptops and keyboards, to CPU manufacturers jamming in an NPU in an otherwise cramped silicon. Even something as benign as thermal paste failed to escape the AI-fication of products and services.
In all seriousness though, it really is only a matter of time that we software developers will be called on to work on products that require some sort of AI for its features. Just like git became the de facto version control system over the last decade, we're seeing the rise of AI (along with the alphabet soup of ML, LLM, ANN, CV, NLP, etc) reshaping how we write code, do business, and interact with services.
One of the biggest hurdles with AI (specifically, LLM) development however is cost: unlike storage or server resources whose costs are generally predictable, costs associated with LLMs are primarily charged by the token which can in turn vary widely (and wildly) depending on the query. Sometimes the variations even depend not just on what the query is, but on who is doing the query (e.g. those users that don't just pose questions and instead ask to generate entire articles). Multiply this cost uncertainty with the iterative nature of AI/LLM development, and we've got ourselves a recipe for ballooning expenses and likely overblown budgets.
We experienced this firsthand when working on an internal project where we quickly blew through our (admittedly intentionally small) budget while trying to deal with auto-scroll and its various edge cases. This was specifically because we wanted the UI to automatically scroll when the AI model responds, but stop scrolling once the user manually scrolls the viewport (probably to read up or refer to earlier paragraphs); since this particular interaction requires a lengthy streaming response, we thought to try and iterate the solution and save time by directly interfacing with the API instead of having to jig a mock stream and spend time debugging the mock.
We did have to adjust the budget to complete the feature (we just added a couple more dollars to our account) and while this was a calculated decision for us, I can easily see development teams being caught flat-footed in similar situations. Our own situation was only small in scope, but the costs can be substantial in a proportionately larger project, especially when dealing with more complex features and lengthened iteration cycles.
I've hinted at it previously but one of the ways to keep costs under control is to find out ways to abstract out certain areas of the pipeline (e.g. creating a mock stream for the autoscroll feature iteration) in order to focus the expenses on work that necessarily require full integration with paid-for APIs (e.g. fine tuning prompts that are geared towards a specific LLM such as Google's Gemini or OpenAI's ChatGPT). That way the developer can test, iterate, and improve on workflow code cheaply and only start accumulating expenses during production or closer to to release.
We tried evaluating this very issue and developed a small proof of concept application in Ruby on Rails to try and explore both benefits and downsides with this approach to controlling expenses. I'll also try to recommend some practical tips to help development teams avoid these unnecessary costs and explore some ways to mitigate them.
Use the Machines you Already Have to run Local AI (e.g. Ollama)
A really simple way to save up on costs associated with working with AI/LLM models is to run the models yourself locally on your machine. We've reached a point where running local AI has become more accessible than ever before; the work that used to take specialized hardware and expertise only available to large research labs and universities can now be done on a device as common as a gaming laptop.
One such way to run models locally is through Ollama. It is an open source platform designed to make it easy to run large language models on your local machine.
I run Ubuntu on my development machine so my examples will be from that perspective, but the Ollama installation instructions should cover your situations.
curl -fsSL <https://ollama.com/install.sh> | sh
This will install Ollama on your local machine and (at least on my Ubuntu install) set it up so it runs as a daemon process (managed via systemctl
)
One big downside of running LLMs locally is that you'd need some relatively modern hardware if you want to to get good performance out of it. You don't necessarily need the latest and greatest computer, but it is a good rule of thumb to think that the more powerful your system is, the better the LLM performance would be.
In my case I was able to repurpose my old system for this task:
- 11th gen i7 11700 (8 cores, 16 threads)
- NVIDIA 2060 super (8GB GDDR6)
- 128GB DDR4
Note that an LLMs performance on your system somewhat correlates with the amount of its parameters. For example, the gemma2 model (which is btw built from the same research and technology used in Google's commercial offering Gemini) is available in three sizes: 2B, 9B, and 27B. I am able to comfortably and quickly run the 2B version on this system, but struggle to run the 9B and the 27B versions; while it still does run, the server offloads the spillover data that doesn't fit in the GPU memory onto the much slower system RAM and thus affects performance.
In any case, with ollama installed you can now try to run it and check its help message:
$ ollama
Usage:
ollama [flags]
ollama [command]
Available Commands:
serve Start ollama
create Create a model from a Modelfile
show Show information for a model
run Run a model
pull Pull a model from a registry
push Push a model to a registry
list List models
ps List running models
cp Copy a model
rm Remove a model
help Help about any command
Flags:
-h, --help help for ollama
-v, --version Show version information
Use "ollama [command] --help" for more information about a command.
I like to run the gemma2:2b
model since it has the best performance on my system and it allows me to quickly iterate on my application without having to wait too long for the LLM to respond.
ollama run gemma2:2b
pulling manifest
pulling 7462734796d6... 100% ▕████████████████▏ 1.6 GB
pulling e0a42594d802... 100% ▕████████████████▏ 358 B
pulling 097a36493f71... 100% ▕████████████████▏ 8.4 KB
pulling 2490e7468436... 100% ▕████████████████▏ 65 B
pulling e18ad7af7efb... 100% ▕████████████████▏ 487 B
verifying sha256 digest
writing manifest
removing any unused layers
success
>>> Send a message (/? for help)
This will then pull the model from the ollama servers and install it in /usr/share/ollama/.ollama/models
and give you an interactive console to try things out:
>>> who are you?
I am Gemma, an AI assistant. 😊
Think of me as a friendly and helpful computer program that can understand
your questions and generate human-like text in response. I'm still under
development, but I can already do some pretty cool things like:
* **Answer your questions:** Whether you need information or just want to
chat, I can help!
* **Write different kinds of creative content:** I can write stories,
poems, summaries, and even code in different languages.
* **Help with brainstorming:** If you're stuck on a project, I can offer
ideas and suggestions.
How can I help you today? 😊
As I mentioned previously, system specs play a big role in the performance of an LLM. The reason the 2b
version of gemma2
runs well on my system is that the whole model fits neatly inside the GPU's ram.
If the model does not fit entirely inside the GPU's ram, it gets offloaded into the relatively slower system memory. It can be seen clearly when doing ollama ps
:
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
gemma2:27b 53261bc9c192 18 GB 63%/37% CPU/GPU 4 minutes from now
The 27b
version of gemma2
requires 18GB of ram which doesn't fully fit the 2060 Super's 8GB so part of the model is offloaded to the system ram (here marked as CPU
)
Here's how it looks like with the 9b
version:
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
gemma2:latest ff02c3702f32 8.0 GB 11%/89% CPU/GPU 4 minutes from now
Now that we've seen how we can run an LLM locally and save on extra expenses by using the assets we already possess, let's take a look at how to leverage this.
Use an Abstraction Framework (e.g. LangChain)
Running a local LLM with ollama is all well and good, but writing your code to directly interface with ollama's rest api would cause rework once you're ready to start using commercial models such as Gemini and OpenAI. You can write your own abstraction layer so you can easily switch LLM providers, but why reinvent the wheel?
LangChain is a framework that's often used to abstract components that are usually used when building AI LLM-powered applications, like vector databases used for storing embeddings as well as vector search. By using such components, you are able to easily interface with different llm providers but you are also able to connect (or "chain") various data transformations and outputs to your pipeline.
Talking about all the various benefits LangChain affords us will be another topic itself, so we'll limit our discussion to just the basic features necessary to get things working for our example. In this case, it would be the ability to switch between LLMs: from Local AI to something hosted like OpenAI or Gemini.
The original LangChain framework has official support for Python and JavaScript/TypeScript, but there are also unofficial implementations that take the spirit of LangChain and bring them to previously unsupported languages. Since the language I am most familiar with is Ruby, I was glad I was able to find such an implementation in langchain.rb which (conveniently) "wraps LLMs in a unified interface allowing you to easily swap out and test out different models."
As a sidenote, I tried out a way to develop a Ruby on Rails applications in a single file (in the spirit of https://github.com/rails/rails/tree/main/guides/bug_report_templates) and I thought it was an interesting way to quickly bootstrap a working application for tutorials. So that's what I did and I think it makes writing articles like this a lot easier because all of the necessary items are present and there are no extraneous files or folders that can sometimes be distracting.
Preparing and Seeding the Dataset for the Meeting Assistant
To show off the langchainrb framework, I decided to use a dataset called meetingbank which is a "... dataset created from the city councils of 6 major U.S. cities ... [and] contains 1,366 meetings with over 3,579 hours of video, as well as transcripts ...." The amount of data in this dataset also highlights the possible cost issues that arise when developing with AI: having this much data to iterate development on can be very costly when using paid-for hosted services.
I've already downloaded and stored the whole dataset which you can look at here: https://github.com/Hivekind/docschat/blob/main/train.json
To start off, here's the Gemfile
and seeds.rb
file I made to quickly generate summaries and action items for each meeting transcript:
Gemfile
source "<https://rubygems.org>"
gem "rails", "~> 7.1"
gem "pg", "~> 1.1"
gem "faraday", "~> 2.11"
gem "uri", "~> 0.13"
gem "redis", "~> 4.2.5"
gem "puma", "~> 5.2.2"
gem "dotenv"
gem "langchainrb", "~> 0.15"
gem "ruby-progressbar"
seeds.rb
# frozen_string_literal: true
require 'rails/all'
require "dotenv"
require "langchain"
require "faraday"
require 'ruby-progressbar'
Dotenv.load
ENV["DATABASE_URL"] ||= "postgres://#{ENV['DB_USER']}:#{ENV['DB_PASS']}@#{ENV['DB_HOST']}:#{ENV['DB_PORT']}/#{ENV['DB_NAME']}?schema=public"
ActiveRecord::Base.establish_connection
# ActiveRecord::Base.logger = Logger.new(STDOUT)
ActiveRecord::Schema.define do
enable_extension "plpgsql"
create_table :meetings, if_not_exists: true do |t|
t.string :topic
t.text :entry
t.string :unit
t.date :date
t.string :uid
t.text :ai_summary
t.text :ai_action_items
t.timestamps
end
end
Langchain.logger.level = :warn
class Logger
end
model = "gemma2:2b"
llm = Langchain::LLM::Ollama.new(url: "<http://localhost:11434>", default_options: {
chat_completion_model_name: model,
completion_model_name: model,
embeddings_model_name: model,
})
def prompt_summary(text)
"
Write an accurate summary of the following TEXT. Do not include the word summary, just provide the summary.
TEXT: #{text}
CONCISE SUMMARY:
"
end
def prompt_action_items(text)
"
Compile a list of action items or concerns, with their owners or speakers, for the following TEXT. For every key point, mention the likely owner. Infer as much as you can. If the transcript does not explicitly list owners, try to infer the likely owners. It is very important to infer the likely owners of each action item. Do not show an action item if you are unable to match it with an owner. I don't want a summary. Try your very best. Please respond in a formal manner.
TEXT: #{text}
ACTION ITEMS WITH THEIR OWNERS:
"
end
class Meeting < ActiveRecord::Base
end
line_count = `wc -l "#{"train.json"}"`.strip.split(' ')[0].to_i
File.open("train.json", "r") do |f|
puts "Processing #{line_count} entries"
progressbar = ProgressBar.create(total: line_count, format: '%a <%B> %p%% %t Processed: %c from %C %e')
f.each_line do |line|
# topic group date entry
json = JSON.parse(line)
topic = json["summary"]
uid = json["uid"]
unless (m = Meeting.find_by(uid: uid)).nil?
puts "skipping #{uid} #{m.id}"
progressbar.total -= 1
next
end
transcript = json["transcript"]
unit_date = uid.split("_")
puts "Generating ai summary"
ai_summary = ""
llm.chat(messages: [{role: "user", content: prompt_summary(transcript)}]) do |r|
resp = r.chat_completion
ai_summary += "#{resp}"
print resp
end
puts "Generating ai action items"
ai_action_items = ""
llm.chat(messages: [{role: "user", content: prompt_action_items(transcript)}]) do |r|
resp = r.chat_completion
ai_action_items += "#{resp}"
print resp
end
puts
Meeting.create!(
topic: topic,
entry: transcript,
unit: unit_date[0],
date: Date.strptime(unit_date[1], "%m%d%Y"),
uid: uid,
ai_summary: ai_summary,
ai_action_items: ai_action_items,
)
puts "Processed #{json["uid"]}"
puts
progressbar.increment
puts
end
end
Just spin up a postgres db, fill in the env vars, make sure ollama is working and has already downloaded the gemma2:2b
model, and then run ruby seeds.rb
to generate meeting summaries and action items for the
I'll bring your attention to these lines:
model = "gemma2:2b"
llm = Langchain::LLM::Ollama.new(url: "<http://localhost:11434>", default_options: {
chat_completion_model_name: model,
completion_model_name: model,
embeddings_model_name: model,
})
We're using Ollama
with the gemma2:2b
model in this example, but as we can see from https://github.com/patterns-ai-core/langchainrb?tab=readme-ov-file#supported-llms-and-features it is quite simple to switch "engines" as needed.
I ran the seeding on my development machine and it took around 9 hours total to complete the seeding process. I've exported the seeds I generated in a sql file here which you can import in your own db later in case you want to follow along but don't want to spend the time seeding your db.
Running and Testing the Ruby Application
Now that we have a local test db with some seed data, we can now try and make use of it.
Here's the single file Rails application which you can run with ruby app.rb
:
app.rb
# frozen_string_literal: true
require "action_controller/railtie"
require "rails/command"
require "rails/commands/server/server_command"
require "rails/all"
require "dotenv"
require "langchain"
require "faraday"
require "json"
Dotenv.load
ENV[
"DATABASE_URL"
] ||= "postgres://#{ENV["DB_USER"]}:#{ENV["DB_PASS"]}@#{ENV["DB_HOST"]}:#{ENV["DB_PORT"]}/#{ENV["DB_NAME"]}?schema=public"
ActiveRecord::Base.establish_connection
class Meeting < ActiveRecord::Base
end
model = "gemma2:2b"
LLM =
Langchain::LLM::Ollama.new(
url: "<http://localhost:11434>",
default_options: {
chat_completion_model_name: model,
completion_model_name: model,
embeddings_model_name: model
}
)
CHAT_HISTORY = Hash.new { |h, k| h[k] = [] }
module ApplicationCable
class Connection < ActionCable::Connection::Base
identified_by :uuid
def connect
self.uuid = SecureRandom.urlsafe_base64
end
end
class Channel < ActionCable::Channel::Base
end
end
class MessagesChannel < ApplicationCable::Channel
def subscribed
Rails.logger.info("User subscribed to MessagesChannel #{params}")
stream_from "#{params[:room]}"
end
def unsubscribed
Rails.logger.info("User unsubscribed from MessagesChannel")
end
def chat_message(data)
Rails.logger.info("User sent a message #{params} #{data}")
meeting = Meeting.find(data["meeting_id"])
history = CHAT_HISTORY[data["meeting_id"]]
# prime the AI with the meeting entry
history << { role: "user", content: meeting.entry } if history.length == 0
message = data["message"]
history << { role: "user", content: message } if message.present?
Rails.logger.info("Chat history #{history}")
ai_response = ""
LLM.chat(messages: history) do |r|
resp = r.chat_completion
ai_response += "#{resp}"
ActionCable.server.broadcast(
"meeting",
{ type: "partial", content: resp }
)
end
ActionCable.server.broadcast(
"meeting",
{ type: "full", content: ai_response }
)
history << { role: "assistant", content: ai_response }
end
end
class ApplicationController < ActionController::Base
end
class RootController < ApplicationController
def index
render inline: <<~HTML.strip
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>DocsChat</title>
<script type="importmap">
{
"imports": {
"react": "<https://esm.sh/react>",
"react-dom": "<https://esm.sh/react-dom>",
"react/jsx-runtime": "<https://esm.sh/react/jsx-runtime>",
"@mui/material": "<https://esm.sh/@mui/material>",
"@mui/icons-material": "<https://esm.sh/@mui/icons-material>",
"@rails/actioncable": "<https://esm.sh/@rails/actioncable>",
"@tanstack/react-query": "<https://esm.sh/@tanstack/react-query>",
"react-window": "<https://esm.sh/react-window>",
"react-virtualized-auto-sizer": "<https://esm.sh/react-virtualized-auto-sizer>",
"marked": "<https://esm.sh/marked>"
}
}
</script>
<script src="<https://unpkg.com/@babel/standalone/babel.min.js>"></script>
<script src="<https://cdn.jsdelivr.net/npm/marked/marked.min.js>"></script>
<script src="<https://cdn.tailwindcss.com?plugins=forms,typography,aspect-ratio,container-queries>"></script>
<link rel="stylesheet" href="<https://fonts.googleapis.com/icon?family=Material+Icons>">
</head>
<body>
<div id="root"></div>
<script type="text/babel" data-presets="react" data-type="module" src="/index.js">
</script>
</body>
</html>
HTML
end
end
class MeetingsController < ApplicationController
def index
meetings =
Meeting
.all
.order(date: :asc)
.pluck(:id, :date, :unit, :topic)
.map do |id, date, unit, topic|
{ id: id, date: date, unit: unit, topic: topic }
end
render json: meetings
end
def show
meeting = Meeting.find(params[:id])
render json: {
aiSummary: meeting.ai_summary,
aiActionItems: meeting.ai_action_items,
entry: meeting.entry,
date: meeting.date,
unit: meeting.unit,
id: meeting.id
}
end
end
class DocsApp < Rails::Application
config.root = __dir__
config.action_controller.perform_caching = true
config.consider_all_requests_local = true
config.public_file_server.enabled = true
config.secret_key_base = "change_me"
config.eager_load = false
config.session_store :cache_store
routes.draw do
mount ActionCable.server => "/cable"
resources :meetings, only: %i[index show]
root "root#index"
end
end
ActionCable.server.config.cable = {
"adapter" => "redis",
"url" => "redis://localhost:6379/1"
}
Rails.cache =
ActiveSupport::Cache::RedisCacheStore.new(url: "redis://localhost:6379/1")
Rails.logger =
ActionCable.server.config.logger =
ActiveRecord::Base.logger = Logger.new(STDOUT)
Rails::Server.new(app: DocsApp, Host: "0.0.0.0", Port: 3000).start
Aside from the database dependency, you'll also need redis (which is required for the websockets feature).
You can probably surmise from the code above, but here are the highlights of the application:
- it's using
React.js
for the UI andmarked
for transforming markdown text (which the LLM uses for its output) - it's using
ActionCable
and websockets to stream the LLM response to the frontend - chat history (for every meeting transcript) is passed on to the LLM to provide further context
And here's the React application (you'll need to put it in the public
folder in accordance to Rails conventions). Similar to the single-file Rails application, this React application is using importmaps and JavaScript modules, allowing imports from URLs. This means you don't need a build system (e.g. webpack, bun, etc) in order to run the application; everything runs within the browser (albeit much more slowly as there is no optimization nor minification steps present as is commonly done during JavaScript builds):
public/index.js
import React, { useState, useEffect, useRef } from "react";
import { createConsumer } from "@rails/actioncable";
import {
useQuery,
QueryClient,
QueryClientProvider,
} from "@tanstack/react-query";
import { marked } from "marked";
import {
List,
ListItemButton,
ListItemAvatar,
Avatar,
ListItemText,
Icon,
Tooltip,
TextField,
Button,
Accordion,
AccordionSummary,
AccordionDetails,
Typography,
} from "@mui/material";
import { FixedSizeList } from "react-window";
import AutoSizer from "react-virtualized-auto-sizer";
const queryClient = new QueryClient();
const consumer = createConsumer();
function ArrowUpDownIcon() {
return (
<svg
class="h-6 w-6 rotate-0 transform text-gray-400 group-open:rotate-180"
xmlns="<http://www.w3.org/2000/svg>"
fill="none"
viewBox="0 0 24 24"
stroke-width="2"
stroke="currentColor"
aria-hidden="true"
>
<path
stroke-linecap="round"
stroke-linejoin="round"
d="M19 9l-7 7-7-7"
></path>
</svg>
);
}
function MeetingItem(props) {
const { currentMeetingId } = props;
const { isPending, error, data } = useQuery({
queryKey: ["meetings", currentMeetingId],
queryFn: () =>
fetch("/meetings/" + currentMeetingId).then((res) => res.json()),
});
if (isPending) return "Loading...";
if (error) return "An error has occurred: " + error.message;
const { aiSummary, aiActionItems, entry, date, unit, id } = data;
return (
<>
<Accordion defaultExpanded>
<AccordionSummary expandIcon={<ArrowUpDownIcon />}>
<Typography>AI Summary</Typography>
</AccordionSummary>
<AccordionDetails>
<div
className="prose"
dangerouslySetInnerHTML={{
__html: marked.parse(aiSummary),
}}
></div>
</AccordionDetails>
</Accordion>
<Accordion>
<AccordionSummary expandIcon={<ArrowUpDownIcon />}>
<Typography>AI Action Items</Typography>
</AccordionSummary>
<AccordionDetails>
<div
className="prose"
dangerouslySetInnerHTML={{
__html: marked.parse(aiActionItems),
}}
></div>
</AccordionDetails>
</Accordion>
<Accordion>
<AccordionSummary expandIcon={<ArrowUpDownIcon />}>
<Typography>Full (raw) Transcript </Typography>
</AccordionSummary>
<AccordionDetails>
<div
className="prose"
dangerouslySetInnerHTML={{
__html: marked.parse(entry),
}}
></div>
</AccordionDetails>
</Accordion>
</>
);
}
function MeetingList({ currentMeetingId, setCurrentMeetingId }) {
const { isPending, error, data } = useQuery({
queryKey: ["meetings"],
queryFn: () => fetch("/meetings").then((res) => res.json()),
});
if (isPending) return "Loading...";
if (error) return "An error has occurred: " + error.message;
const renderRow = ({ index, data, style }) => {
const meeting = data[index];
const { id, date, unit, topic } = meeting;
return (
<Tooltip
title={topic}
placement="right-end"
arrow
slotProps={{
popper: {
modifiers: [
{
name: "offset",
options: {
offset: [0, 16],
},
},
],
},
}}
>
<ListItemButton
style={style}
component="div"
disablePadding
selected={currentMeetingId == meeting.id}
onClick={() => setCurrentMeetingId(id)}
>
<ListItemText primary={`${unit}-${id}`} secondary={date} />
</ListItemButton>
</Tooltip>
);
};
return (
<AutoSizer>
{({ height, width }) => (
<FixedSizeList
height={height}
width={width}
itemCount={data.length}
itemSize={64}
itemData={data}
overscanCount={5}
>
{renderRow}
</FixedSizeList>
)}
</AutoSizer>
);
}
function Spinner() {
return (
<span role="status">
<svg
aria-hidden="true"
class="align-text-bottom inline w-4 h-4 text-gray-200 animate-spin dark:text-gray-600 fill-blue-600"
viewBox="0 0 100 101"
fill="none"
xmlns="<http://www.w3.org/2000/svg>"
>
<path
d="M100 50.5908C100 78.2051 77.6142 100.591 50 100.591C22.3858 100.591 0 78.2051 0 50.5908C0 22.9766 22.3858 0.59082 50 0.59082C77.6142 0.59082 100 22.9766 100 50.5908ZM9.08144 50.5908C9.08144 73.1895 27.4013 91.5094 50 91.5094C72.5987 91.5094 90.9186 73.1895 90.9186 50.5908C90.9186 27.9921 72.5987 9.67226 50 9.67226C27.4013 9.67226 9.08144 27.9921 9.08144 50.5908Z"
fill="currentColor"
/>
<path
d="M93.9676 39.0409C96.393 38.4038 97.8624 35.9116 97.0079 33.5539C95.2932 28.8227 92.871 24.3692 89.8167 20.348C85.8452 15.1192 80.8826 10.7238 75.2124 7.41289C69.5422 4.10194 63.2754 1.94025 56.7698 1.05124C51.7666 0.367541 46.6976 0.446843 41.7345 1.27873C39.2613 1.69328 37.813 4.19778 38.4501 6.62326C39.0873 9.04874 41.5694 10.4717 44.0505 10.1071C47.8511 9.54855 51.7191 9.52689 55.5402 10.0491C60.8642 10.7766 65.9928 12.5457 70.6331 15.2552C75.2735 17.9648 79.3347 21.5619 82.5849 25.841C84.9175 28.9121 86.7997 32.2913 88.1811 35.8758C89.083 38.2158 91.5421 39.6781 93.9676 39.0409Z"
fill="currentFill"
/>
</svg>
<span class="sr-only">...</span>
</span>
);
}
function Messages(props) {
const { messages, stream } = props;
return (
<div className="w-full">
{messages.map((message, index) => (
<div className="flex flex-col w-full leading-1.5 p-4 border-gray-200 bg-gray-100 rounded-e-xl rounded-es-xl dark:bg-gray-700 my-2">
<div className="flex items-center space-x-2 rtl:space-x-reverse">
<span className="flex text-sm font-semibold text-gray-900 dark:text-white">
{message.role === "user" ? "You" : "Assistant"}
</span>
</div>
<span
key={index}
dangerouslySetInnerHTML={{
__html: marked.parse(message.content),
}}
></span>
</div>
))}
{stream != "" && (
<div className="flex flex-col w-full leading-1.5 p-4 border-gray-200 bg-gray-100 rounded-e-xl rounded-es-xl dark:bg-gray-700 my-2">
<div className="flex items-center space-x-2 rtl:space-x-reverse">
<span className="text-sm font-semibold text-gray-900 dark:text-white">
Assistant <Spinner />
</span>
</div>
<span
dangerouslySetInnerHTML={{
__html: marked.parse(stream),
}}
></span>
</div>
)}
</div>
);
}
function Chat(props) {
const { currentMeetingId, subscription, messages, setMessages, stream } =
props;
const [value, setValue] = useState("");
const inputRef = useRef(null);
useEffect(() => {
setValue("");
setMessages([]);
inputRef.current.focus();
}, [currentMeetingId]);
const onSubmit = async (e) => {
e.preventDefault();
setMessages([...messages, { role: "user", content: value }]);
subscription.send({
action: "chat_message",
message: value,
meeting_id: currentMeetingId,
});
setValue("");
};
return (
<div className="flex flex-col">
<div className="flex">
<Messages messages={messages} stream={stream} />
</div>
<form onSubmit={onSubmit}>
<div className="flex w-max items-center absolute bottom-2 right-2">
<TextField
autoFocus
size="small"
fullWidth
value={value}
onChange={(e) => setValue(e.target.value)}
ref={inputRef}
placeholder="Type your prompt here..."
/>
<Button>Send</Button>
</div>
</form>
</div>
);
}
function App() {
const [currentMeetingId, setCurrentMeetingId] = useState(461);
const [subscription, setSubscription] = useState(null);
const [messages, setMessages] = useState([]);
const [stream, setStream] = useState("");
useEffect(() => {
const sub = consumer.subscriptions.create(
{ channel: "MessagesChannel", room: `meeting` },
{
received: (recv) => {
if (recv.type === "full") {
setMessages((messages) => [
...messages,
{ role: "assistant", content: recv.content },
]);
setStream("");
} else if (recv.type === "partial") {
setStream((stream) => stream + recv.content);
}
},
}
);
setSubscription(sub);
return () => sub.unsubscribe();
}, []);
return (
<QueryClientProvider client={queryClient}>
<section className="w-screen h-screen m-0 p-0 ">
<div className="flex h-full overflow-x-hidden">
<div className="w-1/5 overflow-y-auto bg-gray-100">
<MeetingList
currentMeetingId={currentMeetingId}
setCurrentMeetingId={setCurrentMeetingId}
/>
</div>
<div className="w-2/5 flex h-full">
<div className="bg-white p-4 overflow-y-auto border-r-2 w-full">
<MeetingItem currentMeetingId={currentMeetingId} />
</div>
</div>
<div className="w-2/5 flex">
<div className="bg-white p-4 overflow-y-auto w-full max-h-[calc(100vh-4rem)] shadow">
<Chat
currentMeetingId={currentMeetingId}
subscription={subscription}
messages={messages}
setMessages={setMessages}
stream={stream}
/>
</div>
</div>
</div>
</section>
</QueryClientProvider>
);
}
// Mount and bind the React app to the DOM
import { createRoot } from "react-dom";
const domNode = document.getElementById("root");
const root = createRoot(domNode);
root.render(<App />);
It's verbose but all it's really doing is create a few window panes for the meeting list, the transcript and their pre-generated summaries and action items (see seeds.rb
) and provides a bridge between the app and the server so that the user can ask questions about the meeting, and the server will stream the LLM's response through the websocket.
You might also note that the websocket communication is done through the package actioncable which is implemented here outside of the Rails context, so it may be more verbose than otherwise.
Use Failure as a Stepping Stone (e.g. Breakable Toys from Apprenticeship Patterns)
Now that we've seen an example application using Ollama and LangChain, it's probably time to address the elephant in the room: why did I write the application in Ruby instead of Python? Isn't Python the de-facto language of AI and Machine Learning?
While it's true that Python (and increasingly JavaScript) is used for knitting together AI components and has become synonymous with Data Science, the foundational concepts transcend programming languages. As I've mentioned before, I am more familiar with Ruby than Python and it is much easier for me to adapt learning material for AI to Ruby over learning both Python and AI concepts.
This allowed me to experiment quicker, and iterate a lot faster than if I would have done it in a language I'm less familiar with. Instead of having to struggle with having my thoughts translated to Pythonese, I'm able to get into a much better flow with Ruby. Not to throw shade at Python though; it's just an effect of my own personal preference and experience.
It does mean however that there's less material that's immediately usable for copying and pasting as many tutorials and explanation videos assume you're writing your application in Python. The advantage of such friction however is that I spend more time distilling concepts and exploring ideas through trial and error.
This follows one of the recommendations by the book Apprenticeship Patterns called breakable toys where it advocates for starting "toy" projects that are expected to be broken whenever you have new stuff to learn or are diving into uncharted territories. It removes the burden of having to make sure things continue working (such as in the case of products for clients or solving issues in production) and gives you the freedom to fail without much consequences.
For example, it was through this breakable toy that reminded me of GIGO (Garbage In, Garbage Out). My enthusiasm for being able to connect my application with LLMs made me forget to consider the rest of the pipeline.
Here's a screenshot of what I'm talking about:
You can see that the transcript should have said “The motion. There’a [a] motion and a second” but due to the error the LLM got confused and presented this sentence as evidence for potential disagreement.
I had been assuming that the dataset I was using was "accurate" and had been asking it questions based on those presumably accurate transcriptions, but as you can see in the screenshot, I happened to notice that there were some slight transcription errors that wholly changes the meaning and can mislead the LLM.
It's great that I experienced this "failure" while building my breakable toy, as I can easily see this being encountered during client work and it can be extremely difficult to trace why the LLM is giving misleading answers; I was lucky I managed to ask a specific question while looking at the transcript because I was curious how and where it was mining the information from. Now that I've experienced it for real and not just in a theoretical manner, I am now more aware of the possibilities of garbage input as likely candidates of LLM misbehavior.
Caveats, Limitations, and Conclusions
I've learned a lot while working on this mini project, and while I would love to continue working on it I think I'll need to put a cap on it for a bit and work on something tangential (which I'll be writing about as well). I do have a few items that I think could still be improved:
- implement something like tiktoken to better estimate costs of using an LLM when switching over to production
- use a library with a more Ruby-esque design (e.g. https://github.com/BoxcarsAI/boxcars)
- experiment with other models and do some comparisons between them
- refine the project documentation and extract out key points or areas of interest.
AI development doesn't have to be expensive. I encourage you to try out strategies above to help control costs while iterating on features and learning the fundamentals of designing an AI pipeline.
You can find the project repository in Hivekind's GitHub https://github.com/Hivekind/docschat