We´ll use the newly released Spring AI 1.0 GA version, ready for production, to build a chat application with Spring AI, Ollama, Docker and provide out of the box chat memory management.
Let´s enable Java developers to quickly and easily add AI to their projects.
Dependencies:
- Java 21
- Spring Web
- Ollama
- H2 Database
- JDBC Chat Memory Repository
- Spring Boot Actuator
Why Spring AI + Ollama?
The AI engineering landscape is no longer just Python-centric. With Spring AI, Java developers can now build AI-powered applications using open-source models like Llama 3, Gemma, Deepseek-R1 and many more!
And the best part is: You can start hosting them locally via Ollama.
In this article, you’ll learn how to:
- Set up Ollama as your local LLM inference server (Docker).
- Integrate it with Spring AI as a Java-based AI engineering framework.
- Create a multi-user sessions with conversation history
- Build a streaming chatbot using Server-Sent Events (SSE) and an easy Frontend (HTML/CSS/JS).
- Dockerizing for local development and usage
Let’s dive in!
Architecture
1. Setting Up Ollama & Downloading Models
We can start using compact models with more or less 1B parameters. For text generated tasks, small models are a good choice to get started.
Ollama lets you run open-source LLMs locally. Here’s how to get started:
Install Ollama (via Docker)
docker run -d -v ./ollama/ollama-server:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Download Models (pick one or more)
docker exec ollama ollama pull llama3.2:1b # Meta's Llama 3
docker exec ollama ollama pull gemma3:1b # Google's Gemma
docker exec ollama ollama pull deepseek-r1:1.5b # Deepseek's R1
Verify it’s running:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:1b",
"prompt": "Hello, world!"
}'
2. AI Engineering 101: Beyond Python
While Python dominates AI tooling, Java is catching up with frameworks like Spring AI.
Key concepts:
Foundation Models: Pre-trained LLMs (e.g., Llama 3) that you can fine-tune.
Inference APIs: Tools like Ollama let you run these models locally.
AI Engineering: The art of integrating LLMs into real-world apps (e.g., chatbots, RAG systems).
3. Spring AI + Ollama: Java Meets LLMs
Spring AI is a the best choice to bring AI capabilities to the Spring ecosystem. Here’s how to connect it to Ollama:
- Step 3.1: Add Spring AI to Your Project
<!-- use Ollama as LLM inference server and Model provider -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-ollama</artifactId>
</dependency>
<!-- use JDBC to store messages in a relational database. -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-chat-memory-repository-jdbc</artifactId>
</dependency>
<dependency>
<groupId>com.h2database</groupId>
<artifactId>h2</artifactId>
<scope>runtime</scope>
</dependency>
- Step 3.2: Configure Ollama in
application.yml
spring:
application:
name: demo-chatbot
ai:
ollama:
base-url: http://localhost:11434
chat:
model: llama3.2:1b # deepseek-r1:1.5b, gemma3:1b
chat:
memory:
repository:
jdbc:
# https://docs.spring.io/spring-ai/reference/1.0/api/chat-memory.html#_schema_initialization
initialize-schema: always
schema: classpath:sql/schema-h2.sql
datasource:
url: jdbc:h2:mem:~/demo-chatbot
driverClassName: org.h2.Driver
username: sa
password: password
h2:
console:
enabled: true
path: /h2
- Step 3.3: Call the LLM from Java
The ChatClient
offers a fluent API for communicating with an AI Model.
The Default System Prompt
creates a simple Prompt template and set the tone for responses.
The Advisors API
provides a flexible way to intercept, modify, and enhance interactions with a Model.
(LLMs) are stateless, meaning they do not retain information about previous interactions.
Spring AI auto-configures a ChatMemory
bean that allows you to store and retrieve messages across multiple interactions. For H2 you have to create the Schema. Place it inside: src/main/resources/sql/schema-h2.sql
CREATE TABLE IF NOT EXISTS SPRING_AI_CHAT_MEMORY (
conversation_id VARCHAR(36) NOT NULL,
content TEXT NOT NULL,
type VARCHAR(10) NOT NULL CHECK (type IN ('USER', 'ASSISTANT', 'SYSTEM', 'TOOL')),
"timestamp" TIMESTAMP NOT NULL
);
CREATE INDEX IF NOT EXISTS SPRING_AI_CHAT_MEMORY_CONVERSATION_ID_TIMESTAMP_IDX
ON SPRING_AI_CHAT_MEMORY(conversation_id, "timestamp");
@Configuration
public class ChatConfig {
@Bean
public ChatClient chatClient(ChatClient.Builder builder, ChatMemory chatMemory) {
String defaultSystemPrompt = """
Your are a useful AI assistant, your responsibility is provide users questions
about a variety of topics.
When answering a question, always greet first and state your name as JavaChat
When unsure about the answer, simply state that you don´t know.
""";
return builder
.defaultSystem(defaultSystemPrompt)
.defaultAdvisors(
new SimpleLoggerAdvisor(), //simply logs requests and responses with a Model
new PromptChatMemoryAdvisor(chatMemory) //let Spring AI manage long term memory in the DB
)
.build();
}
}
@RequestMapping("/api/chat")
@RestController
public class ChatController {
@Autowired
private ChatClient chatClient;
@GetMapping
public String chat(@RequestParam String question, @RequestParam String chatId) {
return chatClient
.prompt()
.user(question)
.advisors(advisor -> advisor
.param(ChatMemory.CONVERSATION_ID, chatId))
.call()
.content();
}
}
Test it: curl "http://localhost:8080/api/chat?question=Tell%20me%20a%20joke"
4. Streaming Chat with Server-Sent Events (SSE)
SSE is a lightweight protocol for real-time, one-way streaming from server to client (perfect for chatbots). Unlike WebSockets (bidirectional), SSE is simpler for use cases like LLM streaming.
SSE also provides a better UX for end users, because responses are published as soon as they´re ready (some complex replies can take over a minute or more). Let’s stream responses using SSE:
@GetMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<ChunkResponseDTO> streamChat(@RequestParam String question, @RequestParam String chatId) {
return chatClient
.prompt()
.user(question)
.advisors(advisor -> advisor
.param(ChatMemory.CONVERSATION_ID, chatId))
.stream()
.content()
.map(chunk -> new ChunkResponseDTO(chunk));
}
Key Details:
-
TEXT_EVENT_STREAM_VALUE: Header
text/event-stream
enables SSE. -
SSE Format: Each message must end with
\n\n
. Prefix withdata:
for compliance. -
Reactive Streams:
Flux
(from Project Reactor) handles asynchronous streaming.
public record ChunkResponseDTO(String value) {}
Limitations with HTTP/1.1
Connection Limits:
- Browsers allow only 6 concurrent HTTP/1.1 connections per domain.
- SSE consumes one connection per stream, which can block other requests.
Upgrading to HTTP/2 for Performance
HTTP/2 fixes SSE bottlenecks with:
Multiplexing: Multiple streams over a single TCP connection. The maximum number of simultaneous HTTP streams is negotiated between the server and the client (defaults to 100)
How to Enable HTTP/2 in Spring Boot
- Step 4.1: Configure HTTP/2 in
application.yml
server:
http2:
enabled: true
ssl:
enabled: true
key-store: classpath:keystore.p12
key-store-password: yourpassword
- Step 4.2: Generate a Self-Signed Certificate (for testing only):
keytool -genkeypair -alias mydomain -keyalg RSA -keysize 2048 -storetype PKCS12 -keystore keystore.p12 -validity 365
Verify HTTP/2 is Active (-k to trust self signed certificate)
curl --head -k https://localhost:8080/actuator/health
Frontend:
Starting with Javascript:
const chatStream = (question) => {
const eventSource = new EventSource(`https://localhost:8080/api/chat/stream?chatId=1&question=${encodeURIComponent(question)}`);
eventSource.onmessage = (e) => {
console.log('New message:', JSON.parse(e.data).value);
// Append to UI (e.g., a chat div)
document.getElementById('messages').innerHTML += JSON.parse(e.data).value;
};
eventSource.onerror = (e) => {
console.error('SSE error:', e);
eventSource.close();
};
};
// Usage
chatStream("Tell me about Java");
Key Details:
EventSource
: Native browser API for SSE (no libraries needed).
Automatic Reconnection: Built-in retry logic if the connection drops.
Secure Frontend Rendering for LLM Output
LLM responses often include Markdown or HTML (e.g., **bold**
, <script>
), which can lead to XSS vulnerabilities
if rendered naively.
Here’s how to secure your frontend:
- Step 4.3: Sanitize Markdown/HTML (Critical!) Use DOMPurify to sanitize raw LLM output before rendering:
<script src="https://cdnjs.cloudflare.com/ajax/libs/dompurify/3.0.6/purify.min.js"></script>
eventSource.onmessage = (e) => {
const chunkResponse = JSON.parse(e.data).value);
console.log('New message:', chunkResponse);
const sanitized = DOMPurify.sanitize(chunkResponse); // Strips malicious scripts
// Append to UI (e.g., a chat div)
document.getElementById('messages').innerHTML += sanitized;
};
- Step 4.4: For Markdown Support (Optional)
If you want to render Markdown safely, use a library like Marked + DOMPurify:
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
let chunkResponses = '';
eventSource.onmessage = (e) => {
chunkResponses += JSON.parse(e.data).value;
// Sanitize all chunks received so far.
DOMPurify.sanitize(chunkResponses);
// Check if the output was insecure.
if (DOMPurify.removed.length) {
// If the output was insecure, immediately stop what you were doing.
// Reset the parser and flush the remaining Markdown.
chunkResponses = '';
return;
}
// Append to UI (e.g., a chat div)
document.getElementById('messages').innerHTML = marked.parse(chunkResponses);
};
Key Security Considerations: Never Trust LLM Output (nor user´s input)
- Assume all LLM responses may contain malicious code (even unintentionally).
- Assume users will try to break your code and test your security.
- Example attack: Hey
<script>fetch('/steal-cookie')</script>
Limitations with EventSource API
Even though using SSE in the client side is easy, the EventSource API has some restrictions:
- No Custom Request Headers: Custom request headers are not allowed.
- HTTP GET Only: There is no way to specify another HTTP method.
- No Request Body: All the chat messages must be inside the URL, which is limited to 2000 characters in most browsers.
- Check extension libraries for EventSource and SSE: Fetch Event Source, Fetch API + getReader()
- Step 4.5: Starting the HTML Structure
Here’s the HTML structure that includes a form for user input, a container to display the streamed data and a side bar for message history.
<!DOCTYPE html>
<html lang="en">
<head>
<title>Spring AI Chat</title>
<link rel="stylesheet" href="layout.css">
</head>
<body>
<!-- Sidebar for chat history -->
<div id="sidebar">
<h3>Chat History</h3>
<ul id="history-list"></ul>
</div>
<!-- Main chat area -->
<div id="chat-container">
<div id="messages"></div>
<form id="input-form">
<input type="text" id="prompt" placeholder="Type your message..." autocomplete="off">
<button type="submit">Send</button>
</form>
</div>
<script src="main.js"> </script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/dompurify/3.0.6/purify.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
</body>
</html>
Place the HTML in src/main/resources/static/index.html
Put the JavaScript in src/main/resources/static/js/main.js
5. Deploy Your Spring AI + Ollama Chatbot with Docker 🚀
- Step 5.1 Docker Compose Setup
Create ollama-docker-compose.yaml
(P.S. If your machine supports GPU, you can enable GPU acceleration inside Docker containers. Ollama Image docs)
services:
# Ollama LLM inference server
ollama:
volumes: # Ollama with persistent storage (no redownloading models).
- ./ollama/ollama-server:/root/.ollama
container_name: ollama
pull_policy: always
tty: true
restart: unless-stopped
image: docker.io/ollama/ollama:latest
ports:
- 11434:11434
environment:
- OLLAMA_KEEP_ALIVE=24h
# Enable GPU support
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# Spring AI Backend
chat-app:
build:
context: . # Dockerfile in the root folder
container_name: chat-app
ports:
- "8080:8080"
environment:
- SPRING_AI_OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
- Step 5.2 Spring Boot Dockerfile
# Maven build stage
FROM maven:3.9.9-eclipse-temurin-21-alpine as build
WORKDIR /app
COPY pom.xml .
RUN mvn dependency:go-offline
COPY src/ ./src/
RUN mvn clean package
# Spring Boot package stage
FROM eclipse-temurin:21-jre-alpine
COPY --from=build app/target/*.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]
- Start everything using
docker-compose build && docker-compose up -d
- Navigate to:
https://localhost:8080
and start your chat session
- If you want to see how messages are stored in the Database, navigate to H2 console
https://localhost:8080/h2
Conclusion: Your Java AI Future
You just built a locally hosted, open-source chatbot with Spring AI and Ollama, no OpenAI API costs or Python required!
SSE + HTTP/2 + Spring AI = scalable, real-time LLM streaming.
Where to Go Next?
- Checkout the full code
- Experiment with RAG (Retrieval-Augmented Generation) using Spring AI’s embedding model api and vector databases.
What’ll you build? Share your thoughts in the comments! 👇
(P.S. Follow me for more Java + AI tutorials!)
Top comments (0)