Applied AI for Enterprise Java
Development
How to Successfully Leverage Generative AI,
Large Language Models, and Machine Learning
in the Java Enterprise
With Early Release ebooks, you get books in their earliest
form—the author’s raw and unedited content as they write—
so you can take advantage of these technologies long before
the official release of these titles.
Alex Soto Bueno, Markus Eisele, and Natale Vinto
Applied AI for Enterprise Java Development
by Alex Soto Bueno, Markus Eisele, and Natale Vinto
Copyright © 2025 Alex Soto Beuno, Markus Eisele, Natale Vinto. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional
sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Melissa Potter and Brian Guerin Cover Designer: Karen Montgomery
Production Editor: Katherine Tozer Illustrator: Kate Dullea
Interior Designer: David Futato
September 2025: First Edition
Revision History for the Early Release
2024-10-30: First Release
2024-12-12: Second Release
2025-02-18: Third Release
2025-04-28: Fourth Release
See http://oreilly.com/catalog/errata.csp?isbn=9781098174507 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Applied AI for Enterprise Java Develop‐
ment, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do not represent the publisher’s views.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained in this work is at your
own risk. If any code samples or other technology this work contains or describes is subject to open
source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Red Hat. See our statement of editorial
independence.
978-1-098-17444-6
[FILL IN]
Table of Contents
Brief Table of Contents (Not Yet Final). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1. The Enterprise AI Conundrum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Understanding the AI Landscape: A Technical Perspective all the way to Gen
AI 17
Machine Learning (ML): The Foundation of today’s AI 18
Deep Learning: A Powerful Tool in the AI Arsenal 18
Generative AI: The Future of Content Generation 19
Open-Source Models and Training Data 21
Why Open Source is an Important Driver for Gen AI 21
Open Source Training Data 21
Adding Company specific Data to LLMs 22
Explainable and transparent AI decisions 23
Ethical and Sustainability Considerations 23
The lifecycle of LLMs and ways to influence their behaviour 24
MLOps vs DevOps 25
Conclusion 27
2. The New Types of Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Understanding Large Language Models 30
Key Elements of a Large Language Model 31
Deploying and Models 37
Understanding Tool Use and Function Calling 44
Choosing the Right LLM for Your Application 45
Example Categorization 48
Foundation Models or Expert Models - Where are we headed? 49
v
Prompts for Developers - Why Prompts Matter in AI-Infused Applications 50
Types of Prompts 51
Principles of Writing Effective Prompts 51
Prompting Techniques 52
Advanced Strategies 54
Supporting Technologies 57
Vector Databases & Embedding Models 57
Caching & Performance Optimization 57
AI Agent Frameworks 58
Model Context Protocol (MCP) 58
API Integration 58
Model Security, Compliance & Access Control 59
Conclusion 60
3. Inference API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
What is an Inference API? 62
Examples of Inference APIs 63
Deplying Inference Models in Java 67
Inferencing models with DJL 68
Under the hood 76
Inferencing Models with gRPC 77
Next Steps 83
4. Accessing the Inference Model with Java. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Connecting to an Inference API with Quarkus 85
Architecture 86
The Fraud Inference API 87
Creating the Quarkus project 87
REST Client interface 87
REST Resource 88
Testing the example 89
Connecting to an inference API with Spring Boot WebClient 90
Adding WebClient Dependency 90
Using the WebClient 90
Connecting to the Inference API with Quarkus gRPC client 91
Adding gRPC Dependencies 92
Implementing the gRPC Client 92
Going Beyond 95
5. Image Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
OpenCV 99
Initializing the Library 100
vi | Table of Contents
Manual Installation 100
Bundled Installation 100
Load and Save Images 101
Basic Transformations 102
Overlying 106
Image Processing 111
Reading QR/BarCode 127
Stream Processing 130
Processing Videos 131
Processing WebCam 132
OpenCV and Java 133
OCR 135
Next steps 138
Table of Contents | vii
Brief Table of Contents (Not Yet Final)
Chapter 1: The Enterprise AI Conundrum (available)
Chapter 2: The New Types of Applications (available)
Chapter 3: AI Architectures for Applications (unavailable)
Chapter 4: Embedding Vectors, Vector Stores, and Running Models (unavailable)
Chapter 5: Inference API (available)
Chapter 6: Accessing the Inference Model with Java (available)
Chapter 7: Langchain4J (unavailable)
Chapter 8: Vector Embeddings and Stores (unavailable)
Chapter 9: LangGraph (unavailable)
Chapter 10: Image Processing (available)
Chapter 9: Enterprise Use Cases (unavailable)
Chapter 10: Architecture AI Patterns (unavailable)
ix
Preface
Why We Wrote the Book
The demand for AI skills in the enterprise Java world is exploding, but let’s face
it: learning AI can be intimidating for Java developers. Many resources are too
theoretical, focus heavily on data science, or rely on programming languages that
are unfamiliar to enterprise environments. As seasoned programmers with years of
experience in large-scale enterprise Java projects, we’ve faced the same challenges.
When we started exploring AI and LLMs, we were frustrated by the lack of practical
resources tailored to Java developers. Most materials seemed out of reach, buried
under layers of Python code and abstract concepts.
That’s why we wrote this book. It’s the practical guide we wish we had, designed
for Java developers who want to build real-world AI applications using the tools
and frameworks they already know and love. Inside, you’ll find clear explanations
of essential AI techniques, hands-on examples, and real-world projects that will help
you to integrate AI into your existing Java projects.
Who Should Read This Book
It is designed for developers who are interested in learning how to build systems that
use AI and Deep Learning coupled with technologies they know and love around
cloud native infrastructure and Java based applications and services. Developers
like yourself, who are curious about the potential of Artificial Intelligence (AI) and
specifically Deep Learning (DL) and of course Large Language Models. We do not
only want to help you understand the basics but also give you the ability to apply
core technologies and concepts to transform your projects into modern applications.
Whether you’re a seasoned developer or just starting out, this book will guide you
through the process of applying AI concepts and techniques to real-world problems
with concrete examples.
This book is perfect for:
xi
• Java developers looking to expand their skill set into AI and machine learning
• IT professionals seeking to understand the practical implementation of the busi‐
ness value that AI promises to deliver
As the title already implies, we intend to keep this book practical and development
centric. This book isn’t a perfect fit but will still benefit:
• Business leaders and decision-makers. We focus on code and implementation
details a lot. While the introductory chapters provides some context and intro‐
duce challenges, we will not talk a lot about business challenges.
• Data scientists and analysts. Developers could get some use out of our tuning
approaches but won’t need a complete overview of the data science theory behind
the magic.
How the Book Is Organized
In this book, you’ll gain a deeper understanding of how to apply AI techniques
like machine learning (ML), natural language processing (NLP), and deep learning
(DL) to solve real-world problems. Each chapter is designed to build your knowledge
progressively, giving you the practical skills needed to apply AI within the Java
ecosystem.
Chapter 1: The Enterprise AI Conundrum - Fundamentals of AI and Deep Learning
We begin with the foundational concepts necessary for working on modern AI
projects, focusing on the key principles of machine learning and deep learning.
This chapter covers the minimal knowledge needed to collaborate effectively with
Data Scientists and use AI frameworks. Think about it as building a comon
taxonomy. We also provide a brief history of AI and DL, explaining their evolu‐
tion and how they’ve shaped today’s landscape. From here, we introduce how
these techniques can be applied to real-world problems, touching on the impor‐
tance and role of Open Source within the new world, the challenge of training
data, and the side effects developers face when working with these data-driven
models.
Chapter 2: The New Types of Applications - Generative AI and Language Models
In this chapter, we explore the world of large language models. After a brief
introduction to AI classifications, you’ll get an overview of the most common
taxonomies used to describe generative AI models. We’ll dive into the mechanics
of tuning models, including the differences between alignment tuning, prompt
tuning, and prompt engineering. By the end of this chapter, you’ll understand
how to “query” models and apply different tuning strategies to get the results you
need.
xii | Preface
Chapter 3: Models: Serving, Inference, and Architectures - Architectural Concepts for
AI-Infused Applications
Now that we have the basics in place, we move into the architectural aspects of
AI applications. This chapter walks you through best practices for integrating
AI into existing systems, focusing on modern enterprise architectures like APIs,
microservices, and cloud-native applications. We’ll start with a simple scenario
and build out more complex solutions, adding one conceptual building block at a
time.
Chapter 4: Public Models - Exploring AI Models and Model Serving Infrastructure
This chapter talks about the most prominent AI models and their unique special‐
ties. We help you understand the available models and you’ll learn how to choose
the right model for your use case. We also cover model serving infrastructure—
how to deploy, scale, and manage AI models in both cloud and local environ‐
ments. This chapter equips you with the knowledge to serve models efficiently in
production.
Chapter 5: Inference API - Inference and Querying AI Models with Java
We take a closer look at the process of “querying” AI models, often referred to
as inference or asking a model to make a prediction. We introduce the standard
APIs that allow you to perform inference and walk through practical Java exam‐
ples that show how to integrate AI models seamlessly into your applications. By
the end of this chapter, you’ll be proficient in writing Java code that interacts with
AI models to deliver real-time results.
Chapter 6: Accessing the Inference Model with Java - Building a Full Quarkus-Based AI
Application
This hands-on chapter walks you through the creation of a full AI-infused appli‐
cation using Quarkus, a lightweight Java framework. You’ll learn how to integrate
a trained model into your application using both REST and gRPC protocols and
explore testing strategies to ensure your AI components work as expected. By the
end, you’ll have your first functional AI-powered Java application.
Chapter 7: Introduction to LangChain4J
LangChain4J is a powerful library that simplifies the integration of large language
models (LLMs) into Java applications. In this chapter, we introduce the core
concepts of LangChain4J and explain its key abstractions.
Chapter 8: Image Processing - Stream-Based Processing for Images and Video
This chapter takes you through stream-based data processing, where you’ll learn
to work with complex data types like images and videos. We’ll walk you through
image manipulation algorithms and cover video processing techniques, including
optical character recognition (OCR).
Preface | xiii
Chapter 9: Enterprise Use Cases
Chapter nine covers enterprise use cases. We’ll discuss real life examples that we
have seen and how they make use of either generative or predictive AI. It is a
selection of experiences you can use to extend your problem solving toolbox with
the help of AI.
Chapter 10: Architecture AI Patterns
In this final chapter, we shift focus from foundational concepts and basic
implementations to the patterns and best practices you’ll need for building AI
applications that are robust, efficient, and production-ready. While the previous
chapters provided clear, easy-to-follow examples, real-world AI deployments
often require more sophisticated approaches to address the unique challenges
that arise at scale which you will experience a selection of in this chapter.
Prerequisites and Software
While the first chapter introduces a lot of concepts that are likely not familiar to
you yet, we’ll dive into coding later on. For this, you need some software packages
installed on your local machine. Make sure to download and install the following
software:
• Java 17+ (https://openjdk.java.net/projects/jdk/17/)
• Maven 3.8+ (https://maven.apache.org/download.cgi)
• Podman Desktop v1.11.1+ (https://podman-desktop.io/)
• Podman Desktop AI lab extension
We are assuming that you’ll run the examples from this book on your laptop and you
have a solid understanding of Java already. The models we are going to work with are
publicly accessible and we will help you download, install, and access them when we
get to later chapters. If you have a GPU at hand, perfect. But it won’t be neccessary
for this book. Just make sure you have a reasonable amount of disc space available on
your machine.
xiv | Preface
CHAPTER 1
The Enterprise AI Conundrum
A Note for Early Release Readers
With Early Release ebooks, you get books in their earliest form—the author’s raw and
unedited content as they write—so you can take advantage of these technologies long
before the official release of these titles.
This will be the 1st chapter of the final book. Please note that the GitHub repo will be
made active later on.
If you have comments about how we might improve the content and/or examples in
this book, or if you notice missing material within this chapter, please reach out to the
editor at mpotter@oreilly.com.
Artificial Intelligence (AI) has rapidly become an essential part of modern enterprise
systems. We witness how it is reshaping industries and transforming the way busi‐
nesses operate. And this includes the way Developers work with code. However,
understanding the landscape of AI and its various classifications can be overwhelm‐
ing, especially when trying to identify how it fits into the enterprise Java ecosystem
and existing applications. In this chapter, we aim to provide you with a foundation by
introducing the core concepts, methodologies, and terminologies that are critical to
building AI-infused applications.
While the focus of this chapter is on setting the stage, it is not just about abstract
definitions or acronyms. The upcoming sections will cover:
A Technical Perspective All the Way to Generative AI While large language models
(LLMs) are getting most of the attention today, the field of artificial intelligence has
a much longer history. Understanding how AI has developed over time is important
when deciding how to use it in your projects. AI is not just about the latest trends
15
—it’s about recognizing which technologies are reliable and ready for real-world
applications. By learning about AI’s background and how different approaches have
evolved, you will be able to separate what is just hype from what is actually useful
in your daily work. This will help you make smarter decisions when it comes to
choosing AI solutions for your enterprise projects.
Open-Source Models and Training Data AI is only as good as the data it learns from.
High-quality, relevant, and well-organized data is crucial to building AI systems
that produce accurate and reliable results. In this chapter, you’ll learn why using
open-source models and data is a great advantage for your AI projects. The open-
source community shares tools and resources that help everyone, including smaller
companies, access the latest advancements in AI.
Ethical and Sustainability Considerations As AI becomes more common in business,
it’s important to think about the ethical and environmental impacts of using these
technologies. Building AI systems that respect privacy, avoid bias, and are transparent
in how they make decisions is becoming more and more important. And training
large models requires significant computing power, which has an environmental
impact. We’ll introduce some of the key ethical principles you should keep in mind
when building AI systems, along with the importance of designing AI in ways that are
environmentally friendly.
The Lifecycle of LLMs and Ways to Influence Their Behavior If you’ve used AI chatbots
or other tools that respond to your questions, you’ve interacted with large language
models (LLMs). But these models don’t just work by magic—they follow a lifecycle,
from training to fine-tuning for specific tasks. In this chapter, we’ll explain how LLMs
are created and how you can influence their behavior. You’ll learn the very basics
about prompt tuning, prompt engineering, and alignment tuning, which are ways to
guide a model’s responses. By understanding how these models work, you’ll be able to
select the right technique for your projects.
DevOps vs. MLOps As AI becomes part of everyday software development, it’s impor‐
tant to understand how traditional DevOps practices interact with machine learning
operations (MLOps). DevOps focuses on the efficient development and deployment
of software, while MLOps applies similar principles to the development and deploy‐
ment of AI models. These two areas are increasingly connected, and development
teams need to understand how they complement each other. We’ll briefly outline the
key similarities and differences between DevOps and MLOps, and show how both are
necessary and interconnected to successfully deliver AI-powered applications.
Fundamental Terms AI comes with a lot of technical terms and abbreviations, and it
can be easy to get lost in all the jargon. Throughout this book, we will introduce you
to important AI terms in simple, clear language. From LLMs to MLOps, we’ll explain
everything in a way that’s easy to understand and relevant to your projects. You’ll
also find a glossary at the end of the book that you can refer to whenever you need a
16 | Chapter 1: The Enterprise AI Conundrum
quick reminder. Understanding these basic terms will help you communicate with AI
specialists and apply these concepts in your own Java development projects.
By the end of this chapter, you’ll have a clearer understanding of the AI landscape and
the fundamental principles. Let’s begin by learning some basics and setting the stage
for your journey into enterprise-level AI development.
Understanding the AI Landscape: A Technical Perspective
all the way to Gen AI
Gen AI employs neural networks and deep learning algorithms to identify patterns
within existing data, generating original content as a result. By analyzing large vol‐
umes of data, Gen AI algorithms synthesize knowledge to create novel text, images,
audio, video, and other forms of output. The history of AI spans decades, marked by
progress, occasional setbacks, and periodic breakthroughs. The individual disciplines
and specializations can be thought of as a nested box system as shown in Figure
1-1. Foundational ideas in AI date back to the early 20th century, while classical AI
emerged in the 1950s and gained traction in the following decades. Machine learning
(ML) is a comparably new discipline which was created in the 1980s, involving
training computer algorithms to learn patterns and make predictions based on data.
The popularity of neural networks during this period was inspired by the structure
and functioning of the human brain.
Figure 1-1. What is Gen AI and how is it positioned within the AI Stack.
What initially sounds like individual disciplines can be summarized under the gen‐
eral term Artificial Intelligence (AI). And AI itself is a multidisciplinary field within
Computer Science that boldly strives to create systems capable of emulating and
surpassing human-level intelligence. While traditional AI can be looked at as a mostly
rule-based system the next evolution step is ML, which we’ll dig into next.
Understanding the AI Landscape: A Technical Perspective all the way to Gen AI | 17
Machine Learning (ML): The Foundation of today’s AI
ML is the foundation of today’s AI technology. It was the first approach that allowed
computers to learn from data without the need to be explicitly programmed for every
task. Instead of following predefined rules, ML algorithms can analyze patterns and
relationships within large sets of data. This enables them to make decisions, classify
objects, or predict outcomes based on what they’ve learned. The key idea behind ML
is that it focuses on finding relationships between input data (features) and the results
we want to predict (targets). This makes ML incredibly versatile, as it can be applied
to a wide range of tasks, from recognizing images to predicting trends in data.
Machine Learning has far-reaching implications across various industries and
domains. One prominent applications is Image Classification, where ML algorithms
can be trained to identify objects, scenes, and actions from visual data. For instance,
self-driving cars rely on image classification to detect pedestrians, roads, and obsta‐
cles.
Another application is Natural Language Processing (NLP), which enables computers
to comprehend, generate, and process human language. NLP has numerous practi‐
cal uses, such as chatbots that can engage in conversation, sentiment analysis for
customer feedback, and machine translation for real-time language interpretation.
Speech Recognition is another significant application of ML, allowing devices to
transcribe spoken words into text. This technology has changed the way we interact
with devices. It’s early iterations brought us voice assistants like Siri, Google Assistant,
and Alexa. Finally, Predictive Analytics uses ML to analyze data and forecast future
outcomes. For example, healthcare providers use predictive analytics to identify
high-risk patients and prevent complications, while financial institutions utilize this
technology to predict stock market trends and make informed investment decisions.
Deep Learning: A Powerful Tool in the AI Arsenal
While it may have seemed like everyone was just interested in talking about LLMs,
the basic theories of Machine Learning still made real progress in recent years. ML’s
progress was followed by Deep Learning (DL) which added another evolution to
the artifical intelligence toolbox. As a subset of ML, DL involves the use of neural
networks to analyze and learn from data, leveraging their unique ability to learn hier‐
archical representations of complex patterns. This allows DL algorithms to perform
at tasks that require understanding and decision-making, such as image recognition,
object detection, and segmentation in computer vision applications.
Looking at Machine Learning (ML) compared to Deep Learning (DL), many people
assume that they’re one and the same. However, while both techniques share some
similarities, they also have major differences in the way they model reality. One key
difference lies in their depth - ML algorithms can be shallow or deep, whereas DL
specifically refers to the use of neural networks with multiple layers. This added com‐
18 | Chapter 1: The Enterprise AI Conundrum
plexity gives DL its unique ability to learn complex patterns and relationships in data.
But what about the complexity itself? In most cases, DL algorithms are indeed more
complex and with that computationally more expensive than ML algorithms. This is
because they require larger amounts of data to train and validate models, whereas
ML can often work with smaller datasets. And yet, despite these differences, both ML
and DL have a wide range of applications across various fields - from image classifi‐
cation and speech recognition to predictive analytics and game playing AI. The key
difference lies in their suitability for specific tasks: while ML is well-suited for more
straightforward pattern recognition, DL shines when it comes to complex problems
that require hierarchical representations of data. Machine Learning encompasses a
broader range of techniques and algorithms, while Deep Learning specifically focuses
on the use of neural networks to analyze and learn from data.
Generative AI: The Future of Content Generation
The advances in deep learning have laid the groundwork for Generative AI. Genera‐
tive AI is all about generating new content, such as text, images, code, and more.
This area has received the most attention in recent years mainly because of it’s
impressive demos and results around text generation and live chats. Generative AI
is considered both a distinct research discipline and an application of deep learning
(DL) techniques to create new behaviors. As a distinct research discipline, Gen AI
integrates a wide range of techniques and approaches that focus on generating origi‐
nal content, such as text, images, audio, or videos. Researchers in this field explore
various new methods for training models to generate coherent, realistic, and often
creative outputs that get very close to perfectly mimic human-like behavior.
At its center, Gen AI uses neural networks, enriching them with specialized archi‐
tectures to further improve the results DL can already achive. For instance, convolu‐
tional neural networks (CNNs) are used for image synthesis, where complex patterns
and textures are learned from unbelievably large datasets. This allows generative AI
to produce almost photorealistic images that are closer to being indistinguishable
from real-world counterparts than ever before. Similarly, recurrent neural networks
(RNNs) are employed for language modeling, enabling GenAI to generate coherent
and grammatically correct text. Think about it as a Siri 2.0. With the addition of
transformer architectures for for text generation generative AI can efficently process
sequential data and respond in almost real time. In particular the transformer archi‐
tecture has changed the field of NLP and Large Language Models by introducing a
more efficient and effective architecture for sequenzing tasks. The core innovation
is the self-attention mechanism, which allows the model to capture specific parts
of the input sequence simultaneously, enabling the model to capture long-range
dependencies and context information. This is enhanced by an encoder-decoder
architecture, where the encoder processes the input sequence and generates a contex‐
Understanding the AI Landscape: A Technical Perspective all the way to Gen AI | 19
tualized representation, and the decoder generates the output sequence based on this
representation.
Beyond neural networks, Gen AI also leverages generative adversarial networks
(GANs) to create new data samples. GANs consist of two components: a generator
network that produces new data samples and a discriminator network that evaluates
the generated samples. This approach ensures that the generated data is not only real‐
istic but also diverse and meaningful. Variational autoencoders (VAEs) are another
type of DL model used by GenAI for image and audio generation. VAEs learn to
compress and reconstruct data. This capability enables applications that generate
high-quality audio samples simulating real-world sounds or even produce images
that blend the styles of different artists. By combining DL techniques with new data
chunking and transforming approaches, gen AI pushed applications a lot closer to
being able to produce human like content.
Despite the advenanccements in research, the ongoing developments of more sophis‐
ticated computing hardware also significantly contributed to the visibility of genera‐
tive AI. Namely Floating-Point Units (FPUs), Graphics Processing Unit (GPUs), and
Tensor Processing Units (TPUs). A FPU excels at tasks like multiplying matrices,
using specific math functions, and normalizing data. Matrix multiplication is a funda‐
mental part of neural network calculations, and FPUs are designed to do this super
fast. They also efficiently handle activation functions like sigmoid, tanh, and ReLU,
which enables the execution of complex neural networks. Additionally, FPUs can
perform normalization operations like batch normalization, helping to stabilize the
learning process.
GPUs, originally designed for rendering graphics, have evolved into specialized
processors that excel in machine learning tasks due to their unique architecture.
By leveraging multiple cores they can process multiple tasks simultaneously, GPUs
enable parallel processing capabilities that are particularly well-suited for handling
large amounts of data. TPUs are custom-built ASICs (Application-Specific Integrated
Circuits) specifically designed for accelerating machine learning and deep learning
computations, particularly matrix multiplications and other deep learning operations.
The speed and efficiency gains provided by FPUs, GPUs, and TPUs have a direct
impact on the overall performance of machine learning models. Both for training but
also for querying them.
One particular takeaway from this is that running LLMs on local developer machines
is quite challenging. These models are often very large in size, requiring significant
computational resources that can easily overwhelm not just the CPU, but also mem‐
ory and disk space on typical development machines. This makes working with
such models difficult in local environments. In later chapters, particularly Chapter 4,
we will dive into model classification and explore strategies to overcome this issue.
One such approach is model quantization, a technique that reduces the size and
20 | Chapter 1: The Enterprise AI Conundrum
complexity of models by lowering the precision of the numbers used in calculations,
without sacrificing too much accuracy. By quantizing models, you can reduce their
memory footprint and computational load, making them more suitable for local
testing and development, while still keeping them close enough to the performance
you’d expect in production.
Open-Source Models and Training Data
One very important piece of the AI ecosystem is open source models. What you
know and love from source code and libraries is something less comon in the world
of machine learning but has been gaining a lot more attention lately.
Why Open Source is an Important Driver for Gen AI
A simplified view of AI models breaks them down into two main parts. First, there’s
a collection of mathematical functions, often called “layers,” that are designed to solve
specific problems. These layers process data and make predictions based on the input
they receive. The second part involves adjusting these functions to work well with the
training data. This adjustment happens through a process called “backpropagation,”
which helps the model find the best values for its functions. These values, known as
“weights,” are what allow the model to make accurate predictions. Once a model is
trained, it consists of two main parts: the mathematical functions (the neural network
itself) and the weights, which are the learned values that allow the model to make
accurate predictions. Both the functions and the weights can be shared or published,
much like source code in a traditional software project. However, sharing the training
data is less common, as it is often proprietary or sensitive. As you might imagine,
open sourcing of the necessary amounts of data to train the most capable models
out there is something not every vendor would want to do. Not only because it
might cost the competetive advantage there are also speculations about the proper
attribution and usage rights on some of the largest models out there. For the purpose
of this book, we do use Open Source Models only. Not only because of the mostly
hidden usage restrictions or legal limitations but also because we, the authors, believe
that Open Source is an essential part of software development and the open source
community is a great place to learn.
Open Source Training Data
As you may have guessed, the training data is the ultimate factor that makes a model
capable of generating specific features. If you train a model on legal paperwork, it
will not be able to generate a good enough model for sport predictions. The domain
and context of the training data is crucial for the success of a model. We’ll talk about
picking the right model for certain requirements and the selection process in chapter
two, but note that it is generally important to understand the impact of data quality
Open-Source Models and Training Data | 21
for training the models. Low-quality data can lead to a range of problems, including
reduced accuracy, increased error rates, overfitting, underfitting, and biased outputs.
Overfitting happens when a model learns the specific details of the training data
so well that it fails to generalize to new, unseen data. This means that the model
will perform very poorly on test or validation data, which is drawn from the same
population as the training data but was not used during training.
In contrast to that, an underfitted model is like trying to fit a square peg into a
round hole - it just doesn’t match up with the true nature of the data. As a result, the
model fails to accurately predict or classify new, unseen data. In this context, data that
refers to information that is messy or contains errors is called “Noisy”. It is making
it harder for AI models to learn accurately. For example, if you’re training a model
to recognize images of cats, “noisy” data might include blurry pictures, mislabelled
images, or photos that aren’t even cats. This kind of incorrect or irrelevant data can
confuse the model, leading it to make mistakes or give inaccurate results. In addition,
data that is inconsistent, like missing values or using different formats for the same
kind of information, can also cause problems. If the model doesn’t have clean, reliable
data to learn from, its performance will suffer, resulting in poor or biased predictions.
For instance, if an AI model is trained on data that includes biased or stereotypical
information, it can end up making unfair decisions based on those biases, which
could negatively impact people or groups.
You can mitigate these risks by prioritizing data quality from the beginning. This
involves collecting high-quality data from the right sources, cleaning and preprocess‐
ing the data to remove noise, outliers, and inconsistencies, validating the data to
ensure it meets required standards, and regularly updating and refining the model
using new, high-quality data. And you may have guessed already, that this is some‐
thing that developers should only rarely do but absolutely need to be aware of.
Especially if they observe that their models are not performing as expected. A very
simple example why this is relevant to you can be JSON processing for what is called
“function calling” or “agent integration”. While we will talk about this in Chapter 10
in more detail, you need to know that a model that has not been trained on JSON
data, will not be able to generate it. This is a very common problem that developers
face.
Adding Company specific Data to LLMs
Beyond the field of general purpose skills for large language models, there is also
a growing need for task specific optimizations in certain applications. These can
range from small scale edge scenarios with highly optimized models to larger scale
enterprise level solutions. The most powerful feature for business applications is to
add company specific data to the model. This allows it to learn more about the
context of the problem at hand, which in turn improves its performance. What
sounds like a job comparable to a database update is indeed more complex. There are
22 | Chapter 1: The Enterprise AI Conundrum
different approaches to this which provide different benefits. We will look at training
techniques that can be used for this later in chapter two when we talk about the
classification of LLMs and will talk about architectural approaches in chapter three.
For now it is essential to keep in mind, that there is no serious business application
possible without proper integration of business relevant data into the AI-infused
applications.
Explainable and transparent AI decisions
Another advantage of open source models is the growing need for explainable
and transpartent decision making by models. With recent incidents generated by
corporate chat bots only being indivudual examples, it is important that companies
understand how their models work and can trust in their output. As AI becomes
more relevant in all areas, people want to know how decisions are made and what
factors influence them. Instead of treating models like black-boxes, transparency and
openness builds trust in AI systems. On top, governments and regulatory bodies are
starting to require a certain level of transparency from AI-driven decision-making
processes, especially in healthcare, finance, and law enforcement. Lastly there is also
growing concerns about the potential for bias and unfair treatment. While there
are technical approaches to explaining model behaviour and giving insights into
behaviours most of the safeguards in place today aren’t perfect and require a certain
amount of safeguards in place. We will talk about this in chapter three.
Ethical and Sustainability Considerations
While explainability of results is one part of the challenge, there are also a lot of
ethical considerations. The most important thing to remember is that AI models are
defined by the underlaying training data. This means that AI systems will always
be biased towards the training data. This doesn’t seem to carry a risk on first sight
but there is a lot of potential for bias. For example, a model trained on racist com‐
ments might be biased towards white people. A model trained on political comments
might be biased towards democrats or republicans. And these are just two obvious
examples. AI models will reflect and reinforce societal biases present in the data they
are trained on. The Unesco released reccomendations on AI ethics. This is a good
starting point for understanding the potential biases that models might have.
But there are other thoughts that need to be taken into account when working
with AI-infused applications. Energy consumption of large model deployments is
dramatic, it is our duty as software architects and developers to pay close attention
when executing and measureing sustainability of these systems. While there is a
growing movement to direct AI usage towards good uses, towards the sustainable
development goals, for example, it is important to address the sustainability of devel‐
oping and using AI systems. A study by Strubell et al. illustrated that the process of
Ethical and Sustainability Considerations | 23
training a single, deep learning, NLP model on GPUs can lead to approx. 300 tons
of carbon dioxide emissions. This is roughly the equivalent of five cars over their
lifetime. Other studies looked at Google’s AlphaGo Zero, which generated almost 100
tons of CO2 over 40 days of training which is the equivalent of 1000 hours of air
travel. In a time of global warming and commitment to reducing carbon emissions, it
is essential to ask the question about whether using algorythms for simplistic tasks is
really worth the cost.
The lifecycle of LLMs and ways to influence their
behaviour
Now that you know a bit more about the history of AI and the major components
of LLMs and how they are build, let’s take a deeper look at the lifecycle of LLMs and
how we can influence their behaviour as outlined in Figure 1-2.
Figure 1-2. Training, Tuning, Inference.
You’ve already heard about training data, so it should come as no surprise that at the
heart of the lifecycle lies something called the training phase. This is where LLMs
are fed unbelievable amounts of data to learn from and adapt to. Once an LLM has
been trained it is somewhat a general purpose model. Usually, those models are also
refered to as foundation models. In particularly if we look at very large models like
Llama3 example, their execution requires hughe amounts of resources and they are
generally exceptionally good at general purpose tasks. The next phase a model usually
goes through is known as tuning. This is where we adjust the model’s parameters
to optimize its performance on specific tasks or datasets. Through the process of
hyperparameter tuning, model architects can fine-tune models for greater accuracy,
efficiency, and scalability. This is generally called “hyperparameter optimization” and
includes techniques like: grid search, random search, and Bayesian methods. We do
not dive deeper into this in this book as both training and traditional fine-tuning are
more a Data Scientist’s realm. You can learn more about this in Natural Language
Processing with Transformers, Revised Edition. However we do cover two very spe‐
24 | Chapter 1: The Enterprise AI Conundrum
cific and more developer relevant tuning methods in chapter Two. Most importantly
prompt tuning and alignment tuning with InstructLab.
The last and probably most well known part of the lifecycle is inference, which is
another word for “querying” a model. The word “inference” itself comes from the
French word “inférence”. In the context of LLMs, “inference” referres to the process of
drawing conclusions from observations or premises. Which is a much more accurate
description to what a model actually delivers. There are several ways to “query” a
model and they can affect the quality and accuracy of the results, so it’s important to
understand the different approaches. One key aspect is how you structure your query,
this is where prompt engineering comes into play. Prompt engineering involves
crafting the input or question in a way that guides the model toward providing the
most useful and relevant response. Another important concept is data enrichment,
which refers to enhancing the data the model has access to during its processing.
One powerful technique for this is RAG (Retrieval-Augmented Generation), where
the model combines its internal knowledge with external, up-to-date information
retrieved from a database or document source. In chapter Three, we will explore
these techniques in more detail.
For now it is important to remember that models undergo a lifecycle within software
projects. They are not static and should not be treated as such. While inferencing
a model does not change model behavior in any way, models are only knowledable
to the so called “cut of date” for their training data. If new information occurs or
existing model “knowledge” needs to be changed, the weights ultimately will have
to be adjusted. Either fine-tuned or re-trained. While this initially sounds like a
responsiblity for a Data Science Team, it is not always possible to draw straight lines
between the ultimate responsibilities of Data Science Team and the actual application
developers. This book does draw a clear line though, as we do not cover training
at all. We do however look in more detail into tuning techniques and inferencing
architectures. But how do these teams work together in practice?
MLOps vs DevOps
Two important terms have been coined during the last few years when we look at
the way modern software-development and production setting is happening. The first
is DevOps, a term coined in 2009 by Patrick Debois to refer to “development” and
“operations”. The second is Machine Learning Operations or MLOps initialy used
by David Aronchick, in 2017. MLOps is a derived term and basically describes the
application of DevOps principles to the Machine Learning field. The most ovious
difference is the central artifact they are grouped around. The DevOps team is
focused on business applications and the MLOps team is more focused on Machine
Learning models. Both describe the process of developing an artifact and making it
ready for consumption in production.
MLOps vs DevOps | 25
DevOps and MLOps share many similarities, as both are focused on streamlining and
automating workflows to ensure continuous integration (CI), continuous delivery
(CD), and reliable deployment in production environments. Figure 1-3 describes one
possible combination of DevOps and MLOps.
Figure 1-3. DevOps and MLOps
The shared practices, such as cross-functional collaboration, using Git as a single
source of truth, repeatability, automation, security, and observability are at the
core. Both DevOps and MLOps rely on collaboration between developers, data
scientists, and operations teams to ensure that code, models, and configurations
are well-coordinated. Automation and repeatability are emphasized for building,
testing, and deploying both applications and models, ensuring consistent and reliable
results. However, MLOps introduces additional layers, such as model training and
data management, which are distinct from typical DevOps pipelines. The need to
constantly monitor models for drift and ensure their performance over time adds
complexity to MLOps, but both processes share a focus on security and observability
to maintain trust and transparency in production systems.
The two approaches are ultimately deeply interwined and complement each other.
To make things even more complicated, MLOps is kind of a catch-all term. There’s
plenty of other terms that are closely related. These include: ModelOps, LLMOps, or
DataOps. They all refer to a similar set of practices in different combinations. There
is no single “correct” way to combine DevOps and MLOps because the approach will
depend on the organization, the teams involved, and their established practices. Some
teams may prefer closer integration between data science and development, while
others may want clear divisions between model development and application devel‐
opment. Organizational structure, team expertise, and project goals play significant
roles in how the workflows are integrated.
For example, in a smaller organization, data scientists and developers might work
more closely, sharing codebases and automating model deployment alongside appli‐
cation code. In larger organizations, teams might be more specialized, requiring
26 | Chapter 1: The Enterprise AI Conundrum
distinct processes for model management and software engineering, leading to a
more modular approach to integration.
In summary, DevOps and MLOps work together by borrowing best practices from
each other, but with the added layers of data management, model training, and
continuous monitoring in MLOps. The right approach will depend on team collabo‐
ration, project complexity, and organizational needs.
Conclusion
In conclusion, the development and deployment of Large Language Models (LLMs)
require a solid understanding of the training, tuning, and inference processes
involved. As the field of MLOps continues to evolve, it is essential to recognize
the key differences between DevOps and MLOps, with the latter focusing on
the specific needs of machine learning model development and deployment. By
acknowledging the intersection approaches required for cloud-native application-
and model-development, teams can effectively collaborate across disciplines and
bring AI-infused applications successfully into production.
Chapter Two will introduce you to various classifications of LLMs and unveal more
of their inner workings. We’ll provide an overview of the most common taxonomies
used to describe these models. We will also dive into the mechanics of tuning these
models, breaking down the differences between alignment tuning, prompt tuning,
and prompt engineering.
Conclusion | 27
CHAPTER 2
The New Types of Applications
A Note for Early Release Readers
With Early Release ebooks, you get books in their earliest form—the author’s raw and
unedited content as they write—so you can take advantage of these technologies long
before the official release of these titles.
This will be the 2nd chapter of the final book. Please note that the GitHub repo will be
made active later on.
If you have comments about how we might improve the content and/or examples in
this book, or if you notice missing material within this chapter, please reach out to the
editor at mpotter@oreilly.com.
Java developers have spent decades refining best practices for building scalable, main‐
tainable, and performant applications. From enterprise web services to cloud-native
microservices, the language and its ecosystem have been shaped by the needs of
real-world applications. Now, with generative AI (GenAI) and other AI-infused
capabilities, new types of applications are becoming more prominent and require
additional knowledge, architecture, and tooling.
We hope that you already understand that GenAI is not a radical break from past
advancements but rather an evolution of AI research in deep learning combined with
the foundations of software engineering. Just as Java developers have adapted to the
shift from monoliths to microservices and from imperative to reactive programming,
they now face the challenge of integrating AI models into their applications in a way
that aligns with the principles they already know: modularity, scalability, testability,
and maintainability.
29
To effectively use AI in Java applications, an understanding of the fundamental
components that make these systems work is not only helpful but necessary. In this
chapter, we will break down the key aspects of AI integration:
Understanding Large Language Models
LLMs are a special class of AI models trained on vast amounts of text data to
perform natural language processing tasks. We will explore how they generate
responses, their limitations, and introduce you to more relevant details to be able
to classify models and use the right models for your requirements.
Understanding Model Types
Not all AI models are created equal. While generative models like large language
models (LLMs) and diffusion models capture the most attention, they are just
one part of the AI landscape. We will explore different types of models, including
classifiers, embeddings, and retrieval-augmented generation (RAG), and how
they map to real-world application needs.
Supporting Technologies
AI models do not run in isolation; they rely on a rich ecosystem of tools and
frameworks. From vector databases that store and retrieve knowledge efficiently
to APIs that expose models as services, understanding the AI stack is crucial
for Java developers who want to build applications that are both powerful and
maintainable.
Teaching Models New Tricks
Unlike traditional software, AI-infused applications improve through different
means: fine-tuning, prompt engineering, retrieval augmentation, and reinforce‐
ment learning. We’ll discuss these techniques and their trade-offs, particularly in
enterprise environments where control and customization are key.
It sounds like a lot of ground to cover but we promise to cut down where possible and
equip you with the most basic knowledge.
Understanding Large Language Models
As a Java developer, you might be used to working with structured data, type-safe
environments, and explicit control over program execution. Large Language Mod‐
els (LLMs) operate in a completely different way. Instead of executing pre-defined
instructions like a Java method, they generate responses probabilistically based on
learned patterns. You can think of an LLM as a powerful autocomplete function on
steroids—one that doesn’t just predict the next character but understands the broader
context of entire conversations.
If you’ve ever worked with compilers, you know that source code is transformed into
an intermediate representation before execution. Similarly, LLMs don’t process raw
30 | Chapter 2: The New Types of Applications
text directly; instead, they convert it into numerical representations that make com‐
putations efficient. You can compare this to Java bytecode—while human-readable
Java code is structured and understandable, it’s the compiled bytecode that the JVM
actually executes. In an LLM, tokenization plays a similar role: it translates human
language into a numerical format that the model can work with.
Another useful comparison is how Java Virtual Machines (JVMs) manage Just-In-
Time (JIT) compilation. A JIT compiler dynamically optimizes code at runtime
based on execution patterns. Similarly, LLMs adjust their text generation dynamically,
predicting words based on probability distributions instead of following a hardcoded
set of rules. This probabilistic nature allows them to be flexible and creative but
also means they can sometimes produce unexpected or incomplete results. Now, let’s
break down their components, starting with the key elements.
Key Elements of a Large Language Model
LLMs rely on several foundational elements that define their effectiveness and applic‐
ability. While their training data-model is important there are other elements that
also play their roles. For instance, attention mechanisms allow models to weigh the
importance of different words in a sequence, while tokenization strategies determine
how efficiently input is processed. Additionally, factors like context length, memory
constraints, and computational efficiency decide how well an LLM can handle com‐
plex prompts and interactions. A base understanding of these core components is
necessary to successfully integrate these new features into applications because they
influence performance, scalability, and overall user experience.
How LLMs Generate Responses
At a high level, LLMs process text input (Natural Language Understanding, NLU)
and generate meaningful responses (Natural Language Generation, NLG). There are
different acronyms and terms you need to know in order to understand model
use-cases and decide which model to use for your specific application.
Natural Language Understanding (NLU)
focuses on interpreting and analyzing human input. It is meant for tasks like
intent recognition, entity extraction, text classification, and sentiment analysis.
This is conceptually similar to how Java applications parse JSON or XML,
extracting key data for business logic. If you are building an AI-powered search
or recommendation system, an encoder-based model (e.g., BERT) optimized for
NLU is typically good fit.
Natural Language Generation (NLG)
is responsible for constructing meaningful and coherent responses. This is use‐
ful for chatbots, report generation, and text summarization. Conceptually, this
mirrors Java’s templating engines (e.g., Thymeleaf, Freemarker) that dynamically
Understanding Large Language Models | 31
generate output based on structured input. Decoder-based models (e.g., GPT) are
more suited for these tasks.
Tokenization
Before processing, LLMs break input and output into smaller chunks (tokens),
similar to how Java tokenizes strings using StringTokenizer or regex. Token
limits affect how much context a model can “remember” in a single request.
When outputting tokens the model adds a degree of randomness injecting a
non-deterministic behaviour. This degree of randomness is added to simulate the
process of creative thinking and it can be tuned using a model parameter called
temperature.
Self-Attention Mechanism Mechanisms (Transformer Architecture)
Large Language Models (LLMs) built on transformers use a self-attention mecha‐
nism to determine which words (tokens) matter most within a sentence. Instead
of treating each word equally, the model assigns higher importance—or “atten‐
tion”—to key words based on their relevance to the overall meaning. You can
think of it as similar to dependency resolution in Java build tools: when Maven
or Gradle resolves conflicting library versions, it actively selects the most critical
dependency based on scopes or position in the dependency tree. Just as certain
dependencies take precedence in Java builds, specific tokens receive more atten‐
tion within transformer models.
Context Windows
The context window is analogous to a buffer size or a stack frame in Java. Just
as a method in Java has a limited stack frame size to store local variables, an
LLM has a fixed memory space to store input and output tokens. For example,
an LLM with a 4K-token context window (such as GPT-3.5) can process roughly
3,000 words at a time before discarding older tokens. Larger models (e.g., GPT-4
Turbo, Claude 3 Opus) support 128K+ tokens, which allow for much longer
interactions without losing past context.
You’ve read about the basic terms now but there is more to know about how models
work. The most important part is the underlaying model architecture. You don’t need
to remember all of this for now. We just don’t want you to be surprised when we use
certain descriptions later on. Treat the following overview as a place to revisit when
you stumble over something later in the book.
Model Architectures
Just as Java frameworks are designed for specific workloads (Quarkus for microser‐
vices, Lucene for search, and Jackson for JSON processing) different types of AI
models are optimized for specific use cases. LLMs generally fall into three categories:
encoder-only, decoder-only, and encoder-decoder models, each with unique charac‐
teristics.
32 | Chapter 2: The New Types of Applications
Encoder-only models. Such as BERT, RoBERTa, and E5, are designed for understand‐
ing text rather than generating it. These models process entire inputs at once, extract‐
ing semantic meaning and relationships between words. They are widely used in
retrieval-augmented generation (RAG) pipelines, where their ability to generate vec‐
tor embeddings enables semantic search in vector databases like Weaviate, Pinecone,
and FAISS. By converting text into numerical representations, these models enhance
enterprise search by retrieving relevant documents based on meaning rather than
keywords. You can integrate encoder models with traditional Lucene-based search
engines to combine lexical and semantic retrieval techniques, improving the accuracy
and relevance of search results.
Beyond search, encoder models are also valuable in classification tasks, such as intent
recognition for chatbots, spam detection, and fraud analysis. They enable named
entity recognition (NER) and information extraction, making them useful in docu‐
ment processing applications where structured data must be extracted from legal,
financial, or compliance-related texts. In recommendation systems, these models
generate embeddings that help match users with relevant articles, documentation, or
products, improving personalization. Security applications also benefit from encoder
models, as they can classify logs and detect anomalies in system monitoring and
fraud prevention workflows.
Decoder-only models. Such as GPT, LLaMA, and Mistral, focus on text generation.
Unlike encoders, which analyze entire inputs at once, decoders generate text one
token at a time, predicting the next word based on prior context. This makes them
ideal for chatbots and conversational AI, where dynamic, context-aware responses
are necessary. Java applications integrating AI-powered customer support can use
decoder models to generate replies, assist agents with suggested responses, and
provide automated insights. In software development, decoder models are widely
used in code generation and auto-completion, helping developers by predicting
Java code snippets, completing function calls, and even explaining complex code
in natural language. Java-based enterprise applications can also leverage these models
for report generation and content automation, creating summaries, legal documents,
and personalized customer communications. In text rewriting and summarization,
decoder models can be applied to simplify, paraphrase, or expand content dynami‐
cally, enhancing content creation workflows.
Encoder-decoder models. Such as T5, BART, and FLAN-T5, combine the strengths
of both architectures, making them particularly effective for structured input-to-
output transformations. Unlike decoder-only models that generate text sequentially,
encoder-decoder models first process input using an encoder, then generate struc‐
tured output using a decoder. This design is well-suited for machine translation,
enabling Java applications to support multilingual users by translating UI elements,
Understanding Large Language Models | 33
emails, and user-generated content in real time. Documentation localization is
another practical use case, allowing businesses to translate software manuals and
API documentation efficiently. In text summarization, these models extract key
information from large documents, such as legal contracts, financial reports, or mon‐
itoring logs, making complex information easier to review. Java developers working
with knowledge management systems can use encoder-decoder models to refine,
paraphrase, and restructure content, ensuring clarity and consistency in enterprise
communications.
Recent advancements in LLM architectures focus on improving efficiency without
sacrificing performance. Techniques such as Mixture of Experts (MoE), used in
models like GPT-4.5 and Gemini 2.5, selectively activate only a portion of the model
parameters during inference, reducing computational overhead while maintaining
high accuracy. This approach is conceptually similar to lazy-loading mechanisms in
Java frameworks, where resources are only loaded when needed. Quantization and
model distillation allow developers to deploy smaller, resource-efficient versions of
large models without significant loss of accuracy, much like JVM optimizations that
improve runtime performance. Emerging memory-efficient techniques, such as flash
attention and sparse computation, further reduce hardware costs, akin to Java’s use of
memory-mapped files for optimizing performance in high-throughput applications.
Selecting the right model depends on the specific needs of an application. Java devel‐
opers integrating semantic search in a RAG pipeline will benefit most from encoder-
only models like BERT or E5. Applications requiring chat-based interactions, code
suggestions, or dynamic content generation are best suited for decoder-only models
such as GPT or LLaMA. For tasks involving machine translation, structured docu‐
ment transformation, and summarization, encoder-decoder models like T5 or FLAN-
T5 provide the best results. Understanding these architectures allows developers to
make informed decisions, balancing accuracy, efficiency, and cost while integrating
AI into enterprise Java applications.
Size and Complexity
Large Language Models (LLMs) come in a variety of sizes, typically measured by
the number of parameters. Parameters are basically the internal numerical values
that define how well a model can predict and generate text. Just as a Java developer
carefully selects the right database, caching strategy, or framework to balance per‐
formance and scalability, choosing the right LLM size ensures efficient inference,
cost-effectiveness, and deployment feasibility.
Smaller models, generally in the 7 billion to 13 billion parameter range (e.g., Mistral
7B, TinyLlama), are optimized for local execution and require minimal computa‐
tional resources. These models are well-suited for applications that need low-latency
responses, such as edge AI, embedded systems, or lightweight chatbot applications.
34 | Chapter 2: The New Types of Applications
Running such a model locally is comparable to using an embedded database like
SQLite—it is efficient, self-contained, and practical for single-user workloads.
Medium-sized models, ranging from 30 billion to 65 billion parameters (e.g., LLaMA
65B), provide better contextual awareness and accuracy but demand higher mem‐
ory and GPU resources. They are ideal for server-side deployment in enterprise
applications, powering AI-driven customer service bots, enterprise search, document
summarization, and intelligent automation tools. Their infrastructure footprint is
similar to managing a Redis caching layer or a lightweight microservice cluster, where
performance optimization is essential to avoid excessive resource consumption.
What’s in a Name - Model Naming
Model names often include suffixes that signal their intended use case or level of opti‐
mization. “Base” models are unmodified foundation models trained on large-scale
data sets without specific fine-tuning. “Instruct” or “Chat” variants are adapted for
interactive conversation tasks, making them ideal for chatbot development. “Code”
models are fine-tuned on programming languages, making them useful for code
completion, bug fixing, and AI-assisted software development. Other common suf‐
fixes like “QA” (question-answering) and “RAG-ready” (Retrieval-Augmented Gener‐
ation optimized) indicate models specifically tuned for enterprise knowledge retrieval
and document-based AI workflows.
At the highest tier, large-scale models exceeding 175 billion parameters, such as
GPT-4, Claude 3, and Gemini Ultra, require specialized hardware and distributed
inference. These models deliver superior contextual reasoning, multi-turn conversa‐
tion capabilities, and complex problem-solving. However, the infrastructure demands
are immense, requiring cloud-based inference solutions due to their size and energy
consumption. Using these models is akin to operating a distributed system like
Apache Kafka or Elasticsearch, where scalability and resource allocation are primary
concerns. Most Java developers interacting with large models will do so via cloud
APIs, integrating them into applications without the need for direct infrastructure
management.
Wait: What does “7 Billion Parameters” even mean here?. When we say that Mistral 7B
has 7 billion parameters, we are referring to the total number of trainable weights that
define the model’s behavior. These parameters are stored in tensors. These parame‐
ters define how a model processes input data and generates output, similar to how
Java developers configure class variables and constants that dictate an application’s
behavior. In mathematical terms, an LLM is essentially a giant function with billions
of parameters, and these parameters exist as multi-dimensional tensors. A simple
analogy would be how Java handles matrices using multi-dimensional arrays. Sup‐
Understanding Large Language Models | 35
pose we have a Java program for image processing that uses a 3D array to represent
an RGB image:
int[][][] image = new int[256][256][3]; // A 256x256 image with 3 color channels
In deep learning, tensors work similarly but at a much larger scale. A single LLM
layer could have weight tensors shaped like [12288, 4096], meaning it has 12,288
input features and 4,096 output features. This is much like a huge adjacency matrix,
where each weight value determines how one input influences an output. Working
with pre-trained LLMs means working with tensor weights stored in formats like
Safetensors or GGUF (more on this later). These formats efficiently load precompu‐
ted parameters into memory, similar to how Java loads compiled bytecode into the
JVM for execution.
And tensors come in different precisions. While a model’s parameter count defines
its capacity for reasoning, contextual depth, and overall accuracy, the tensor type
determines how efficiently those parameters are stored, loaded, and processed. The
higher the precision of the tensor type, the more memory and computational power
is required per parameter. On the opposite, lower-precision tensors allow for com‐
pression and faster execution, enabling larger models to run on smaller hardware.
They come in full-precision (FP32 or FP16) and in quantized (INT8, INT4) versions.
For large-scale models exceeding 175B parameters, full-precision inference is only
available on massivly distributed systems. Think of how a distributed databases parti‐
tion and processes large datasets. For smaller or local deployments INT8 or INT4
quantization reduces memory footprint and still keeps functional accuracy.
Optimizing Model Size with Quantization and Compression. Going from one precision to
another is something you can think of as optimizing JVM memory usage and garbage
collection settings to improve application performance. For LLMs the techniques
used are called quantization and compression. Quantization reduces the precision
of model weights, typically from 32-bit floating point (FP32) to 16-bit (FP16), 8-bit
(INT8), or even 4-bit (INT4) representations.
Compression techniques such as weight pruning and distillation further reduce
model size. Weight pruning removes less critical parameters, effectively shrinking
the model while maintaining most of its predictive capabilities. Distillation, on the
other hand, involves training a smaller “student” model to mimic a larger “teacher”
model, capturing its behavior while being far more efficient. Think of it as something
similar to JIT optimizations in the JVM or the use of compressed indexes in search
engines, where efficiency is achieved without sacrificing too much accuracy.
In summary, understanding how parameters and the derived precision helps opti‐
mize performance and hardware requirements:
36 | Chapter 2: The New Types of Applications
Memory Considerations
The 7B parameters must be loaded into GPU VRAM or RAM. Using FP16
tensors instead of FP32 reduces memory usage by half.
Inference Speed
Larger models require more tensor computations per token generated. Using
quantized INT8 or INT4 models reduces processing time at the cost of slight
accuracy loss.
Context Windows
More context (longer input prompts) means more activations, increasing VRAM
usage. A 4K token context consumes significantly more memory than a 1K token
context.
Deploying and Models
So far, we have explored the inner workings of transformer models to give you
a foundational understanding of how they function and what the common terms
and acronyms mean in this domain. The world of creating, training, and serving
these models is heavily centered around Python, with very little involvement from
Java. While there are exceptions, such as TensorFlow for Java, most tooling and
frameworks are designed with Python in mind.
Deploying a LLM involves several steps. First, the model must be exported in a for‐
mat compatible with an inference engine, which handles loading the model weights,
optimizing execution, and managing resources like GPU memory. Unlike traditional
Java applications AI models are packaged in formats such as ONNX, GGUF, or Safe‐
tensors. Each is designed for different execution environments. Choosing an infer‐
ence engine determines how efficiently the model runs, what hardware it supports,
and how well it integrates with existing applications. While Java Developers typically
do not make these choices, they will have to help formulate the non functional
requirements that can help with making the right choice as it directly affects factors
such as latency, throughput. All needing to align with your application’s requirements.
In modern cloud-native architectures, inference engines are typically accessed as
services deployed either on-premise, in the cloud, or within containerized environ‐
ments. Java applications interact with these services through REST APIs or gRPC
to send input data and receive model predictions. This aligns with scalable, service-
based architectures, where AI models are deployed as independent services that can
be load-balanced and auto-scaled like other application components. Many cloud
hosted offerings such as OpenAI, Hugging Face, and cloud inference endpoints
from AWS, Azure, and Google Cloud, expose standardized APIs that allow Java
applications to integrate seamlessly without needing direct model deployment. More
about Inference APIs and selected providers in chapter five.
Understanding Large Language Models | 37
For self-hosted or on-device models, Java can leverage native bindings via JNI (Java
Native Interface) or JNA (Java Native Access) to directly invoke inference engines
like llama.cpp or ONNX Runtime without needing external services. For scenar‐
ios requiring low-latency, on-device inference, Java applications can integrate with
frameworks like Deep Java Library (DJL), which provides high-level APIs to load and
execute models directly on supported hardware.
With all that in mind, let’s look at some of the most popular inference engines for
LLM deployment, their capabilities, and how you can access them through Java.
• vLLM + vLLM is an inference engine optimized for high-throughput, low-
latency LLM serving. It features PagedAttention, an efficient memory manage‐
ment technique that significantly improves batch processing, streaming token
generation, and GPU memory efficiency. + Java Integration: OpenAI-compatible
API server
• TensorRT + NVIDIA’s TensorRT is an SDK for running LLMs on NVIDIA GPUs.
It offers inference, graph optimizations, quantization support (FP8, INT8, INT4),
and more. + Java Integration: TensorRT uses the Triton Inference Server which
offers several client libraries and examples of how to use those libraries.
• ONNX Runtime + ONNX Runtime provides optimized inference for models
converted into the ONNX format, enabling cross-platform execution on CPU,
GPU, and specialized AI accelerators. + Java Integration: Native Java bindings
• llama.cpp + llama.cpp is an inference engine designed to run quantized models
(GGUF format) on standard hardware without requiring a GPU. It is one of the
most common options for self-hosting an LLM on a local machine or deploying
it on edge devices. + Java Integration: Can be accessed via JNI (Java Native
Interface) bindings or a REST API wrapper.
• OpenVINO + is an inference engine designed by Intel to optimize AI workloads
on Intel specific processors. + Java Integration: JNI bindings and REST API
wrapper.
• RamaLama + RamaLama tool facilitates local management and serving of AI
Models from OCI images. + Java Integration: uses llama.cpp REST API end‐
points.
• Ollama + Ollama Server and various tools to run models on local hardware. +
Java Integration: Either native Java Client library or REST API endpoints.
When running models and inference engines locally, containerization simplifies
packaging the runtime, libraries, and optimizations into a single environment. Some
inference engines already provide pre-built containers. Tools like Podman further
streamline management of these containers, including the ability to pull model
38 | Chapter 2: The New Types of Applications
images or create custom containers for your specific hardware. Podman Desktop
provides a user interface for easily spinning up and testing these AI services.
But there is even more. While the technology advances quickly there are more ways
to access models through specialized ways, as outlined in the list below. And it does
not look like the options will be slowing down anytime soon.
• Cloud-Native Serving + Cloud platforms like AWS, Google Cloud, and Azure
offer fully managed model serving solutions. You can deploy models through
their respective marketplaces or tooling, often with automatic scaling and built-
in monitoring. This reduces operational overhead but may introduce vendor
lock-in.
• Edge AI + Deploying models at the edge—on IoT devices or local gateways—
reduces latency and network usage. Frameworks geared for edge AI often include
optimizations for low-power hardware, making it viable for real-time or mission-
critical scenarios in remote locations.
• Model Registry + Model registries help you store and organize versions of your
trained models. Popular services like Hugging Face Model Hub allow you to
discover, share, or fine-tune community models, aiding reproducibility and easy
updates.
• Knative Serving + Knative Serving is a serverless framework for Kubernetes.
It automates scaling, deployment, and versioning for containerized workloads,
which can simplify hosting AI inference services alongside other cloud-native
applications in a unified environment.
You now have a good overeview about the model inner workings and how they can
be served. We are now pivoting to the more delicate tweaks that can be made to
models.
Key Hyperparameters for Model Inference
We’ve already talked about model parameters parameters, but there’s more we should
discuss: So called hyperparameters help by optimizing inference speed, response
quality, and memory efficiency. While parameters (weights) define the model’s
learned knowledge, hyperparameters control inference behavior, allowing developers
to fine-tune the creativity, accuracy, and efficiency of model responses. Many of
these can be changed via API calls or Java APIs. You should experiment with hyper‐
parameter tuning to achieve the best results for your use cases. It’s a great way to
change between precise, deterministic output or creative, open-ended responses. The
following list contains the most comon Hyperparameters:
• Temperature: Controls the randomness of text generation
— Low values (0.2–0.5) result in deterministic, factual responses.
Understanding Large Language Models | 39
— Temperature = 0.2 - “Java garbage collection manages memory automati‐
cally.”
— High values (0.7–1.2) push more creative, diverse outputs.
— Temperature = 1.0 - “Java’s garbage collection is like an unseen janitor,
tidying memory dynamically.”
• Top-k Sampling: Limits the number of token choices to the top-k most probable
tokens.
— A lower k results in more deterministic responses, while a higher k adds
variability.
• Top-p Sampling (or Nucleus Sampling): Chooses tokens from the top p% of
probability mass.
— Helps generate more natural-sounding responses by adjusting sampling
dynamically.
• Repetition Penalty: Penalizes or reduces the probability of generating tokens that
have recently appeared.
— Encourages the model to generate more diverse and non-repetitive output.
• Context Length: Defines how many tokens the model remembers in a single
request.
— Short context (4K tokens) - Fast, but forgets earlier parts of a conversation.
— Long context (128K tokens) - Better recall, higher computational cost.
Make sure to check your API documentation to confirm which hyperparameters are
supported, if any.
Model Tuning - Beyond tweaking the output
We’ve talked about models, adapters and tuning so far, but we’ve tried to avoid over‐
loading you with knowledge that isn’t directly applicable for working with models.
However, sometimes just using an existing model and slightly tweaking its inputs
and outputs isn’t enough and you can’t find a specific model for your use-case. This
is when you have to look into other ways to create a specific adaptation or even
create a new model. We want to make sure you understand the various ways to
influence model behaviour in terms of complexity and invasiveness, to give you a
better understanding of what you can probably do yourself and when you need help
from a data scientist. We do not cover all the details here, as most of them are
clear Data Scientist specialties, but want to mention them for completeness. You can
learn more about this in the excellent O’Reilly book AI Engineering by Chip Huyen.
Tuning in the traditional sense changes the models weights and adapts a pre-trained
model to a specific need. But there are many ways to change model behaviour
40 | Chapter 2: The New Types of Applications
without changing the pre-trained model but still changing the inner workings of the
model. This approach, known as model adapters, allows the base model to retain its
general knowledge while the adapter layers adds task-specific knowledge on top.
Common adapter techniques include:
LoRA (Low-Rank Adaptation)
LoRA inserts small trainable layers into existing transformer weights rather than
modifying the entire model.
PEFT (Parameter-Efficient Fine-Tuning)
PEFT encompasses various adapter techniques, including LoRA, Prefix Tuning,
and Adapter Layers, to fine-tune models efficiently while keeping most parame‐
ters unchanged.
Prefix Tuning & Prompt Tuning
These methods add trainable prefixes to input prompts rather than modifying
model weights, allowing task-specific customization closer to the model. Think
of them as system-prompts that are build in.
Adapter models can be integrated via different techniques in inference engines and
effectively layered on top of existing models. Think of it as additional layers of a con‐
tainer. Adapters are comonly referenced in the model name and their documentation
explains for what application they are adapted for.
When the data scientist community talks about fine-tuning they can refer to different
things with different complexities and cost implications. The following image gives
you an overview about the different approaches and we rank them by effort and their
usefulnes for certain scenarios shown in Figure 2-1.
Figure 2-1. Common Tuning Techniques applied to LLMs
Understanding Large Language Models | 41
Let’s take a look at each of these in more detail.
Prompt Tuning or Engineering. Prompt tuning is like optimizing SQL queries or tweak‐
ing configuration files, where adjustments to inputs improve overall performance
without modifying the underlying system. It differs from prompt engineering, which
focuses more on crafting better prompts without systematic experimentation. Prompt
tuning systematically refines input patterns and embeddings so the model produces
more desirable outputs. Applications or automated systems typically implement it
using structured templates that modify prompts based on user interactions. This
allows developers to guide model behavior with minimal overhead, making it an
accessible and cost-effective approach to improving AI responses. However, unlike
fine-tuning, prompt tuning does not alter the model’s internal parameters, which
means its effectiveness is constrained by the base model’s existing capabilities.
Prompt Learning. Prompt learning extends prompt tuning by training a model with
structured input-output prompt pairs. This enables the model to refine its responses
based on structured examples rather than simple trial and error. This is typically done
through fine-tuning methods where labeled examples guide the model’s learning,
helping it adjust to specific patterns or desired behaviors. One effective approach is
using Low-Rank Adaptation (LoRA), which allows adjustments to be made without
retraining the entire model, making it more efficient and resource-friendly. This
method is mainly used in applications that require predefined response structures.
Good examples are enforcement of compliance in AI-generated text, maintaining
consistency in customer support interactions, or applying business logic guardrails. It
can roughly be compared to writing test-driven development (TDD) tests where
expected inputs and outputs help refine software behavior iteratively, ensuring
predictable and improved model performance over time. Unlike using pre-trained
adapters, this will have to be executed by Data Scientists.
Parameter Efficient Fine Tuning (PEFT and LoRA). As discussed earlier, model adapters
allow developers to modify specific behaviors without retraining the entire model,
making this approach much more accessible.
Full Fine Tuning. Full fine-tuning means adjusting all model weights by retraining
it on domain-specific data, requiring specialized knowledge and significant computa‐
tional resources. This process is akin to recompiling an entire application with a
new framework version instead of just upgrading a single dependency. Unlike lighter
tuning methods, full fine-tuning demands expertise in machine learning, access to
high-performance hardware, and a well-prepared dataset to ensure optimal results.
Because of these complexities, it is not typically accessible to traditional developers
and instead is done by a dedicated Data Science team.
42 | Chapter 2: The New Types of Applications
Alignment Tuning. Alignment tuning adjusts a model’s outputs to ensure compliance
with ethical guidelines, safety regulations, and industry standards, making it essential
for responsible AI deployment. This process involves modifying the model’s decision-
making process to align with predefined rules, much like defining security policies
and implementing Role-Based Access Control (RBAC) to enforce access restrictions
across a system. Similar to enforcing API rate limits, deploying security patches, or
establishing governance policies at an enterprise level, alignment tuning ensures the
AI operates within acceptable boundaries, mitigating risks associated with uninten‐
ded behavior or biased decision-making.
The InstructLab project takes a novel approach to alignment tuning by using syn‐
thetic data generation and reinforcement learning from human feedback (RLHF)
to refine AI behavior. InstructLab generates curated datasets that help models learn
ethical reasoning, industry-specific regulations, and business logic while ensuring
knowledge consistency across different applications. This approach allows developers
to integrate AI systems that are adaptable to compliance needs without requiring full
retraining, reducing costs and effort while maintaining safety and reliability.
Table 2-1 provides a complete overview of the tuning methods with their respective
resource and cost implications and an indication of their advantages and disadvan‐
tages.
Table 2-1. Overview of Tuning Methods
Method Effort Resources Cost Skills Pros Cons
Required
Prompt Low Minimal Low Developers • No extra infrastructure • Limited deeper control
Tuning or • Immediate, easy • Trial and error needed
Engineering changes
• Ideal for minor tweaks
Prompt Medium Moderate Medium Data • Better reliability • Needs labeled data
Learning Scientists without parameter • Limited by model
changes constraints
• More effective than
prompt tuning
Parameter- Medium Moderate Medium Developers, • Lower training costs • Requires data prep and
Efficient Fine- Data than full fine-tuning updates
Tuning (PEFT, Scientists • Runs on standard GPUs • Gains vary by task
LoRA) • Adapts model without complexity
losing original
knowledge
Full Fine- Very Extensive Very Data • Maximum behavior • High compute/storage
Tuning High High Scientists control costs
• Ideal for proprietary or • Requires advanced ML
highly specific tasks expertise
Understanding Large Language Models | 43
Method Effort Resources Cost Skills Pros Cons
Required
Alignment High High High Developers • Ensures ethical AI • Complex
Tuning • Essential for regulated implementation
industries • Dedicated AI
infrastructure needed
• High ongoing costs
Understanding Tool Use and Function Calling
LLMs can go beyond simple text generation by interacting with external systems,
APIs, and tools to enhance their capabilities. When a model claims “tool use” or
“function calling,” it means it has been specifically designed or fine-tuned to interpret,
generate, and execute structured function calls rather than just responding with raw
text. These features make LLMs more actionable and useful when it comes to retriev‐
ing data from APIs, running database queries, performing calculations, or triggering
automated workflows.
“Tool use” refers to an LLM’s ability to decide when and how to use external
resources to complete a task. Instead of answering a question directly, the model
can invoke a predefined tool, retrieve the necessary information, and incorporate it
into its response. One way LLMs enable “tool use” is through system prompts, where
models rely on predefined instructions embedded in system messages to determine
when and how to call external tools. These prompts guide the model’s behavior,
helping it recognize when an API call or external retrieval is needed instead of a
direct response. An example system prompt could be the following:
In this environment you have access to a set of tools you can use to answer\n
the user's question.
{{ FORMATTING INSTRUCTIONS }}
Lists and objects should use JSON format. The output is expected to be valid XML.
Here are the functions available in JSONSchema format:
{{ TOOL DEFINITIONS IN JSON SCHEMA }}
{{ USER SYSTEM PROMPT }}
{{ TOOL CONFIGURATION }}
By carefully crafting system prompts, you can direct the model to make decisions
on when to respond with which answer in a structured way. This is just one way of
prompting. You’ll learn more about prompting in the section “Prompts for Develop‐
ers”.
For more advanced tool use, custom model adapters can be used to add a fine-tuned
layer to existing models. This approach requires additional training or a specific
adapter model but allows you to adapt a model to a specific use case or even domain.
Instead of relying on general-purpose instructions, fine-tuned adapters improve a
model’s ability to detect when external functions should be invoked and generate
precise API requests that align with a given service’s requirements.
44 | Chapter 2: The New Types of Applications
Choosing the Right LLM for Your Application
Categorizing LLMs is challenging due to their diverse capabilities, architectures, and
applications. When selecting one for your project, it helps to categorize available
options based on common attributes. Similar to how you evaluate every other service
or library you use in your applications, you want to consider both functional and
non-functional requirements. Utilizing everything we discussed so far, this section
includes a common set of categories and attributes that might help you to decide
which kind of model you need for your application. However, keep in mind there
are additional attributes you might consider, and the final choice often depends on
evaluating results—a task typically handled by data scientists. Please find a deeper
understanding of model evaluation in Chip Huyen’s O’Reilly book AI Engineering, 3.
Evaluation Methodology.
Model Type
Choosing the right LLM type depends on how you plan to use it, balancing specific‐
ity, efficiency, and adaptability. We can group models by how they generate, retrieve,
or understand text and other inputs.
Text Generation models perform well for open-ended tasks, such as chatbots, auto‐
mated documentation, and summarization. They may need tuning to align with
business requirements. Instruction-tuned chat models specialize in conversational
interfaces, making them suitable for customer support and AI-driven assistants; they
respond to structured prompts with refined contextual understanding.
If a use case requires external knowledge, Retrieval-Augmented Generation (RAG)
models integrate dynamic data sources for more accurate, domain-specific answers.
Embedding models focus on semantic search, classification, and similarity match‐
ing to enhance AI-driven search and recommendation systems. Multimodal mod‐
els process images and text for tasks like optical character recognition (OCR) or
image-based question answering. Code generation models target developer produc‐
tivity, assisting with automated refactoring and AI-assisted coding. Function and
tool-calling models interact with enterprise systems to automate workflows or trigger
specific API actions.
When deciding among these options, consider whether you need freeform text gen‐
eration, structured responses, external knowledge, or specialized features such as
coding or multimodal capabilities. This is the primary decision making category for
you.
Model Size & Efficiency
Model size influences cost, accuracy, and latency. Small models (≤7B parameters)
fit edge or on-premises deployments where low latency and limited resources
Understanding Large Language Models | 45
matter most. Medium-sized models (7B–30B parameters) balance efficiency and
performance, making them a balanced choice without too much infrastructure
requirements. Large models (≥30B parameters) offer advanced reasoning but
demand substantial compute resources.
Decide whether to prioritize lower cost, higher performance, or compatibility with
existing hardware. In many scenarios, smaller or quantized models can provide
results close to those of larger models, reducing hardware investments without losing
essential functionality.
Deployment Approaches
The deployment strategy that you choose affects scalability, data security, and opera‐
tional complexity. Consider that:
• API-based or 3rd party hosted models are straightforward to integrate, with
almost no infrastructure overhead. They scale easily but may raise concerns
about vendor lock-in, latency, and ongoing usage fees.
• Self-hosted models provide more control over data and can reduce inference
costs when scaled out. However, they require managing GPUs or other special‐
ized hardware and handling ongoing optimizations. This approach suits enter‐
prises with strict compliance needs or those aiming to minimize reliance on
external providers.
• Edge or local deployment offers low-latency, offline operations. They work well
for mobile or IoT devices and also developer machines. But they face constraints
due reduced model size and complexity.
Your choice depends on ease of integration, security requirements, and cost con‐
straints. If you are handling sensitive data you want to use self-hosted or hybrid
approaches. When you want to quickly deploy and scale you may opt for API- or 3rd
party based models.
Supported Precision & Hardware Optimization
You already learned how numeric precisions, affects speed and memory usage. Full
precision (FP32, BF16, FP16) delivers the highest accuracy. Quantized models (INT8,
INT4) reduce memory demands and provide more speed. Furthermore the hardware
choice influences the available precision options. For example CUDA and TensorRT
based inference optimizes performance for NVIDIA GPUs, whereas ONNX, Open‐
VINO, and CoreML open deployment possibilities on Intel or Apple Silicon. Evaluate
whether you need specialized accelerators or if general-purpose hardware will suffice.
46 | Chapter 2: The New Types of Applications
Ethical Considerations & Bias
Bias and ethical risks arise when training on broad datasets. Mitigation strategies
help ensure fairness and align with regulations. Some enterprise models implement
built-in bias filtering, whereas open-source models may require extra oversight to
manage potentially harmful outputs.
Regulatory compliance is also vital, particularly when handling personally identifiable
information (PII). Content filtering features in proprietary models can help address
these concerns, while open-source implementations demand custom safeguards.
Transparency matters as well; models with open weights enable deeper scrutiny
of training data and decision-making. Strike a balance between ethical obligations,
operational constraints, and responsible AI practices.
Community & Documentation Support
Success with LLMs often relies on a robust developer ecosystem, solid documenta‐
tion, and community support. Widely adopted open-source projects tend to offer
extensive forums, software development kits (SDKs), and established best practices.
Enterprises that prefer vendor-backed services can look for solutions with service-
level agreements (SLAs) and direct support. Comprehensive documentation, libra‐
ries, and frameworks—especially those with Java-friendly APIs—streamline the
integration process. When deciding, consider both the reliability of the model and
the ecosystem’s maturity to ensure a smoother rollout.
Closed vs. Open Source
LLMs can be categorized by their licensing models, which influence accessibility, cus‐
tomization options, and long-term sustainability. The choice between closed-source
and open-source models carries significant consequences for enterprises, especially
with respect to control, cost, and flexibility.
Closed-source models are proprietary solutions often provided through cloud-hosted
APIs or software products. They typically feature specific capabilities and benefit
from ongoing updates from the vendor. However, they can limit visibility into the
underlying mechanisms, introduce vendor lock-in, and raise potential data privacy
issues. Similar to using a proprietary Java framework, where you gain enterprise-level
support but relinquish detailed control over the implementation.
Open-source models offer transparency, community-driven development, and the
freedom to self-host and customize. Organizations with strict data governance often
prefer these models because they maintain full authority over deployment and fine-
tuning. Yet, open-source solutions generally require more in-house engineering for
maintenance and optimization. This is comparable to open-source Java frameworks,
which grant flexibility but demand internal expertise.
Understanding Large Language Models | 47
Enterprises must weigh their need for control, compliance, and cost-efficiency when
deciding which approach to adopt. Closed-source offerings may provide an out-of-
the-box experience, while open-source alternatives allow greater adaptability and
independence from vendor constraints.
Example Categorization
Table 2-2 shows an example matrix that helps you weigh these attributes based
on your project’s priorities. The matrix includes potential considerations for each
attribute and how you might rate them for different use cases (for instance, on a scale
of Low, Medium, High).
Table 2-2. Decision Making Matrix
Attribute Decision Factors Example Rating
Model Type General vs. domain focus Low/Med/High
Flexibility vs. specialization
Model Size & Efficiency Resource consumption (CPU/GPU/Memory) Low/Med/High
Response time requirements
Deployment Modality Data privacy needs Low/Med/High
Infrastructure control vs. convenience
Supported Precision & Hardware Need for high throughput Low/Med/High
Hardware availability (GPUs vs. CPUs, etc.)
Ethical Considerations & Bias User trust Low/Med/High
Regulations and compliance
Community & Documentation Support Maturity of ecosystem Low/Med/High
Availability of tutorials / community expertise
Closed vs. Open Source Proprietary and Transparency Low/Med/High
Data and Model Ownership and Flexibility
Function & Tool Calling Application integration requirements Low/Med/High
Real-time data or external services
Let’s walk through how to Use the Matrix:
+Define your primary goals (e.g., text classification, code completion, or domain-
specific question answering). Then note whether you need a broad or niche
solution. .Prioritize the Attributes +Determine which attributes matter most. For
instance, if data security is paramount, you might score “Deployment Modality”
and “Ethical Considerations” as “High.” .Assign Ratings +Rate each attribute based
on how critical it is to your project. A “Low” rating indicates it is less important,
while “High” signals a crucial requirement. .Evaluate Trade-offs +Review high-rated
attributes to see if they conflict. For example, you might want robust tool calling
and low resource usage, but a smaller model may not offer extensive integration
options. .Choose a Model Category After weighing the trade-offs, you can see which
48 | Chapter 2: The New Types of Applications
LLM category (general, specialized, large, small, etc.) or deployment approach (cloud,
on-premises) best meets your needs.
Foundation Models or Expert Models - Where are we headed?
After examining different ways to categorize models, it’s helpful to look ahead and
consider how these categories might evolve in the future. One important distinction
that is clearly emerging is between Foundation Models (FMs) and Expert Models
(EMs). A Foundation Model in the context of Large Language Models refers to
a type of pre-trained model that serves as the basis for a variety of specialized
applications. Much like the foundation of a skyscraper supports structures of varying
complexity, FMs provide a general-purpose framework that can be fine-tuned or
adapted for specific tasks. These models are trained on vast datasets. Think of it like
almost all the available public text, code, images, and other data sources—to learn
broad linguistic, factual, and contextual representations. While FMs are designed
to be general-purpose, many real-world applications benefit from Expert Models,
which are way smaller and optimized for specific domains. They often outperform
general-purpose FMs in their niche areas by focusing on task-specific accuracy
and efficiency. In practice, organizations often deploy ensembles of expert models
rather than relying on a single FM. By combining domain-specialized models with a
general-purpose FM, companies can achieve higher precision in critical applications
while still leveraging the broad knowledge embedded in the foundational model.
Industry Perspectives: Large vs. Small Models vs. Task Oriented vs. Domain Specific
Researchers and practitioners continue to debate the trade-offs between large and
small models. Initially, many viewed bigger models (measured in billions or even
trillions of parameters) as the path forward. They seem to demonstrate better gen‐
eralization and language capabilities. Scaling and using large models comes at a
significant cost at every stage of the model lifecycle. As a result, there has been a shift
toward smaller, task-optimized models that perform well with significantly less com‐
putational demands. Techniques such as distillation, pruning, and quantization help
compress large models while keeping the desired model capabilities. Open-source
models like Mistral 7B and LLaMA 2 13B are examples of this trend. They offer good
performance at a fraction of the size of models like GPT-4 or Gemini.
Some industry experts argue that small, specialized models working in concert will
outperform monolithic large models in specific applications. This is where model
chaining and hybrid architectures come into play.
Mixture of Experts (MoE), Multi-Modal Models, Model Chaining, et al
A growing trend in AI is moving beyond purely text-based models to multi-modal
models that integrate text, images, audio, and video. Those allow users to interact
Understanding Large Language Models | 49
with AI in more natural ways. These models expand the traditional foundation model
paradigm by enabling use-cases that combine inputs. Another evolving concept is
model chaining, where multiple specialized models collaborate dynamically instead
of relying on a single monolithic FM. Instead of deploying a general-purpose model
to handle all tasks, task-specific expert models (e.g., a summarization model, a
retrieval-augmented generation (RAG) model, or a reasoning engine) work together
to achieve better accuracy and efficiency. This aligns with the shift toward retrieval-
augmented generation (RAG) pipelines, where smaller models retrieve relevant docu‐
ments before generating responses, reducing the need for massive parameter counts.
Instead of chaining complete models, the MoE approach works model internally.
Neural sub-networks represent multiple “experts” within a larger neural network, and
a router selectively activates only those experts best suited to handle the input. Many
systems already broadly use this approach.
DeepSeek and the Future of Model Architectures
Innovations like DeepSeek introduce hybrid model architectures that combine tra‐
ditional neural networks with new reasoning and retrieval mechanisms. These
approaches try to enhance the efficiency of FMs by focusing on modular, interpreta‐
ble, and adaptable architectures rather than expensive scaling. Techniques such as
adaptive model scaling, task-specific adapters, and memory-augmented transformers
push the boundaries of what FMs can achieve. Data Science moves fast and new
models and further approaches appear fast. This is surely also based on the perceived
competition in a very active field.
Now that we’ve covered the technical details of large language models, let’s focus on
practical application for developers. The upcoming section gives you an overview of
how to write effective prompts.
Prompts for Developers - Why Prompts Matter in AI-
Infused Applications
Prompts are the primary mechanism for interacting with LLMs. They define how an
AI system responds, influencing the quality, relevance, and reliability of generated
content. For Java developers building AI-infused applications, understanding prompt
design is one of the most important skills. A well-structured prompt can reduce
hallucinations, improve consistency, and optimize performance without requiring
fine-tuning of the model. There are many different recommendations out there on
how to write effective prompts and which techniques to use. An example is the
OpenAI Prompt Engineering Guide or the book Prompt Engineering for Generative
AI. Consider this a brief overview and the beginning of your learning journey.
50 | Chapter 2: The New Types of Applications
Types of Prompts
Prompts differ based on their source and how they guide the model. Key types
include:
User Prompts: Direct input from the user
User prompts are the raw input provided by end users. These are typically unstruc‐
tured and need preprocessing or context enrichment to ensure accurate responses.
String userPrompt = "What is the capital of France?";
Handling user prompts effectively requires input sanitization, intent recognition, and
context enhancement. We will get to this in more detail in the next chapter.
System Prompts: Instructions that guide model behavior
System prompts define how the model behaves in a session. These are often set at the
start of an interaction and remain hidden from the user. They can be used to establish
tone, enforce constraints, or guide the model’s responses.
String systemPrompt = "You are a helpful AI assistant\n
that provides concise and factual responses.";
System prompts help in defining boundaries of the LLM within applications. They
can also be used to enforce certain outputs or contain tool-calling instructions.
Contextual Prompts: Pre-populated or dynamically generated inputs
Contextual prompts include background information, past interactions, or domain-
specific knowledge added to the prompt to improve responses. These can be dynami‐
cally generated based on user history or external data. This is also a very effective way
to inject memory into conversations. We’ll cover more on the architectural aspects of
this in the next chapter.
String context = "User previously asked about European capitals.";
String fullPrompt = context + " " + userPrompt;
Contextual prompts enhance the relevance of responses, particularly in multi-turn
conversations and function as de-facto memory, helping LLMs to keep conversational
cohesivness.
Principles of Writing Effective Prompts
Specificity and structure are essential for effective prompt engineering. By being
precise and organizing your prompts logically, you can significantly improve the
quality and relevance of the responses you receive from LLMs. Investing time in
crafting well-structured and specific prompts is a crucial step in getting the most out
Prompts for Developers - Why Prompts Matter in AI-Infused Applications | 51
of these powerful tools. Here are two examples for a too specific and a too vague
prompt:
String vaguePrompt = "Tell me about Java."; // Too broad
String specificPrompt = "Explain Java's garbage collection \n
mechanisms in one paragraph."; // Much better
The vaguePrompt is so open-ended that the LLM could respond with anything
related to Java. Maybe its history, its uses, its syntax, etc. The specificPrompt, on the
other hand, clearly states what information is needed and even specifies the desired
length (one paragraph).
Crafting good prompts for Large Language Models (LLMs) is important for getting
the responses you want. Common mistakes can make this difficult. One problem
is being too vague, which leads to unclear or general results. Giving the model too
much information can also confuse it. It’s also important to keep prompt length in
mind, as very long prompts may be cut off. Finally, you should test and change your
prompts to make them better.
Prompting Techniques
Different prompting techniques offer structured ways to interact with models, rang‐
ing from direct instructions and examples to more complex methods that enhance
reasoning or incorporate external knowledge.
Zero-shot prompting: Asking without context
Zero-shot prompting is where you ask the model to do something without giving
it any examples. The model uses what it learned during training to understand and
complete the request. It’s like asking someone who knows a lot about a subject a
question without giving them any background. Zero-shot prompts rely entirely on
the model’s pre-trained knowledge. A simple example looks like this:
String zeroShotPrompt = "Define polymorphism in object-oriented programming.";
While zero-shot prompting can work well, it has limits. Accuracy can vary with task
complexity. Ambiguous prompts can be misinterpreted. It may have trouble with
completely new tasks. Zero-shot prompting is good for quickly testing an LLM’s
abilities. It’s a good starting point, but other methods might be needed for more
complex tasks.
Few-shot prompting: Providing examples to guide responses
With few-shot prompting you provide a few examples of the task you want the model
to perform. These examples demonstrate the desired input-output relationship and
help steer the model towards generating the correct type of response. It’s like showing
52 | Chapter 2: The New Types of Applications
someone a couple of examples of how to do something before asking them to do it
themselves. Working with examples in Java can look like this:
String fewShotPrompt = "Translate the following phrases to Spanish:\n\n" +
"English: Hello\n" +
"Spanish: Hola\n\n" + // Example 1
"English: Good morning\n" +
"Spanish: Buenos días\n\n" + // Example 2
"English: How are you?\n" +
"Spanish: "; // The LLM completes this
In this example, you’re giving the LLM two examples of English-Spanish translations.
This helps the model understand that you want a Spanish translation for “How are
you?”. It’s more likely to give a correct translation (“¿Cómo estás?”) than if you had
used a zero-shot prompt. By seeing a few examples, the LLM can better understand
the pattern or rule you want it to follow. It can generalize from these examples and
apply the learned pattern to new, unseen inputs. This is particularly helpful for tasks
where the desired output format is specific or where the task is slightly ambiguous.
Chain-of-Thought (CoT) prompting: Encouraging step-by-step reasoning
Chain-of-Thought (CoT) prompting is intended to help LLMs perform complex rea‐
soning by explicitly generating a series of intermediate steps, or a “chain of thought,”
before arriving at a final answer. It’s like asking someone to “show their work” on
a math problem. Instead of just getting the final result, you see the step-by-step
reasoning that led to it. This works great for word problems like the following:
Problem: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
Chain of Thought:
Roger started with 5 tennis balls.
He bought 2 cans of 3 tennis balls each, so he bought 2 * 3 = 6 tennis balls.
In total, he has 5 + 6 = 11 tennis balls.
Answer: 11
You’re not just giving the LLM the problem and asking for the answer. You’re showing
it how to solve the problem by breaking it down into steps. If you then give the
LLM a similar problem, it’s more likely to generate its own chain of thought and
arrive at the correct answer. LLMs have learned a lot about reasoning from the
massive datasets they were trained on. However, they don’t always explicitly use this
reasoning ability when answering questions. CoT prompting encourages them to
activate and utilize their reasoning capabilities by providing examples of how to think
step-by-step.
Prompts for Developers - Why Prompts Matter in AI-Infused Applications | 53
Self-Consistency: Improving accuracy by generating multiple responses
Self-consistency is an approach to improve the accuracy of LLM responses, particu‐
larly on reasoning tasks. The core idea is simple: instead of relying on a single gener‐
ated response, you generate multiple responses and then select the most consistent
one. This leverages the idea that while an LLM might make occasional errors in its
reasoning, the correct answer is more likely to appear consistently across multiple
attempts. LLMs are probabilistic models. They don’t always produce the same output
for the same input. Sometimes they make mistakes, especially on complex reasoning
tasks. However, the assumption behind self-consistency is that the correct answer
is more likely to be generated repeatedly, even if the LLM makes occasional errors.
By generating multiple responses, you increase the chances of capturing the correct
answer and filtering out the incorrect ones.
Instruction Prompting: Directing the model explicitly
Instruction prompting is a straightforward approach. It involves giving a model
explicit instructions about what you want it to do. Instead of relying on implicit cues
or indirect suggestions, you directly tell the LLM what task to perform and what kind
of output you expect. It’s like giving someone very clear and specific directions. A
very explict prompt looks like the following:
TRANSLATE: English to Spanish
TEXT: "Hello, world!"
Instruction prompting is often used in conjunction with other prompting
approaches. You might use instruction prompting within a few-shot learning setup,
where the examples also include clear instructions. It can also be combined with
chain-of-thought prompting by instructing the model to “show its reasoning step-by-
step” before providing the final answer.
Retrieval-Augmented Generation (RAG): Enhancing prompts with external data
RAG is basically a version of Contextual Prompts. Instead of any text context, RAG
use external sources, like databases or documents, to get more current or specific
information before answering a question. Technically this is more an architectural
approach than a prompting technique. We will look at how this works in more detail
in the next chapter.
Advanced Strategies
Building on fundamental techniques, this section covers advanced prompting strate‐
gies for creating more dynamic, reliable, and optimized interactions with language
models. You will see many of these general examples again in chapters 6 and 7.
54 | Chapter 2: The New Types of Applications
Dynamic Prompt Construction: Combining static and generated inputs
This approach involves building prompts on the fly by combining fixed text with
information generated by other parts of the system. For example, you might have a
template prompt for summarizing a product, but the specific product details (name,
description, price) are pulled from a database and inserted into the prompt before
sending it to the model. This allows for flexible and context-aware prompts. You can
use Java’s String features to dynamically add content like in the following example:
String productTemplate = "Summarize product:\n
Name: {productName}\n
Description: {productDescription}";
String productName = "Wireless Headphones";
String productDescription = "Noise-canceling, Bluetooth 5.0";
String dynamicPrompt = productTemplate.replace("{productName}", productName)
.replace("{productDescription}", productDescription),
String.valueOf(productPrice);
// The 'dynamicPrompt' can be sent to the model.
Use code to build prompts dynamically from templates and variable data for
increased flexibility and context awareness.
Using Prompt Chaining to Maintain Context
If you need to maintain context across multiple interactions with a model you can
use prompt chaining. Instead of treating each prompt in isolation, you link them
together. The output of one prompt becomes part of the input for the next. This
is useful for multi-step tasks, like building a story or answering complex questions
that require multiple pieces of information. Think about a coversation where you still
remeber the beginning and not just reply to the next question. Picture this as a very
small prompt to prompt memory like in the following:
String initialPrompt = "List the ingredients in a Margherita pizza.";
String firstResponse = getLLMResponse(initialPrompt); // Hypothetical LLM call
String followUpPrompt = "How do I make a pizza with these ingredients:\n"
+ firstResponse;
String finalResponse = getLLMResponse(followUpPrompt); // Hypothetical LLM call
Prompt chaining enables multi-step problem-solving or a very small conversational
memory by incorporating previous model responses into subsequent prompts.
Guardrails and Validations for Safer Outputs
You’ll find a lot of definitions for the terms guardrails and validations. They basically
all mean the same. Ensuring that a models’s output is safe and reliable. Guardrails
might involve filtering or rejecting outputs that contain harmful content. Validations
Prompts for Developers - Why Prompts Matter in AI-Infused Applications | 55
might check if the output conforms to a certain format or logic. For example, if you
ask a model to generate code, you might validate that the code compiles correctly.
We will look into more specific implementations in chapter 7. You can build them
manually into your code as in the below example or use library features like in
Quarkus or like LangChain4J provides for certain models.
String llmR = getLLMResponse("Write a short story."); // Hypothetical LLM call
// Simple guardrail: Check for harmful content
if (llmR.contains("violent") || llmR.contains("hate")) {
System.out.println("Response flagged for inappropriate content.");
} else {
// Validation (example: check length)
if (llmR.length() > 500) {
System.out.println("Response too long. Truncating.");
llmR = llmR.substring(0, 500);
}
System.out.println(llmR);
}
Apply guardrails and validations to model outputs to enforce safety standards and
verify conformance with expectations.
Leveraging APIs for Prompt Customization
Model providers often offer APIs that let you customize the prompting process.
These APIs might allow you to set hyperparameter that control the model behavior,
or they might provide tools for managing and organizing your prompts. Their ability
depends on the API used and the functionality exposed.
Optimizing for Performance vs. Cost
Generating longer responses or making many API calls adds up in cost for usage or
resources. Therefore, it’s important to optimize for both performance (getting good
results) and cost (minimizing expenses). This might involve using shorter prompts,
caching common responses, or choosing a less expensive LLM for less critical tasks.
Debugging Prompts: Troubleshooting Poor Responses
“Debugging” prompts involves figuring out why the LLM gave a bad response and
then revising the prompt to fix the problem. This often requires careful analysis of
the prompt and the LLM’s output to pinpoint the issue. It’s like debugging code, but
instead of code, you’re debugging your questions.
Mastering prompting techniques gives us direct control over model interaction, but
AI infused applications require more technical components. Let’s look at the support‐
ing technologies you usually find, such as embeddings, vector databases, caches,
agents, and frameworks that facilitate more complex solutions.
56 | Chapter 2: The New Types of Applications
Supporting Technologies
LLMs require a full-stack ecosystem beyond just model inference. Technologies like
vector databases, caching, orchestration frameworks, function calling, and security
layers are necessary for a production ready application.
Vector Databases & Embedding Models
Vector databases and embedding models are key elements for system architectures
of AI systems. Especially when working with lots of unstructured information and
complex searches. Embedding models transform text into vectors, which are lists of
numbers that capture the text’s meaning. Similar texts have vectors that are close
together, enabling vector searches based on meaning. Vector databases then store and
quickly search these vectors, using specialized indexing to find the vectors closest to
a search query. This allows for fast retrieval of relevant documents, even in huge data‐
sets. In a RAG setup, a user’s question is converted into a vector and used to query the
vector database. The database returns similar document vectors, and the correspond‐
ing documents are retrieved. These documents are combined with the user’s question
to create a prompt for the LLM, which then generates a more informed response.
Essentially, embedding models provide semantic understanding, and vector databases
provide efficient search, working together to help getting specific information out of
models without changing their weights. If you can not wait to see some more code
behind how this actually works, turn the pages forward to chapter 4.
Caching & Performance Optimization
Caching involves storing the results of LLM requests so that identical or similar
requests can be served directly from the cache, avoiding repeated calls to the model
and saving both time and resources. Key considerations for caching include deter‐
mining the appropriate cache key for identifying similar requests, establishing a cache
invalidation strategy to handle outdated or changed data, selecting a suitable storage
mechanism (in-memory, local files, or dedicated services) and managing cache size.
Beyond caching, other performance optimization techniques are also helpful. We’ve
already talked about prompt optimizations but there’s also the possibility to batch
requests in single calls to reduce overhead. Asynchronous or stream based requests
lets applications continue on other tasks while waiting for model responses. Continu‐
ous monitoring and profiling helps to identify performance bottlenecks as with any
other traditional system. So it is really important to implement a suitable monitoring
solution from the very beginning of your project.
Supporting Technologies | 57
AI Agent Frameworks
Models alone are not enough to build intelligent applications. You need tools to
integrate these models into workflows, interact with external systems, and handle
structured decision-making. AI agent frameworks promise to bridge this gap by
managing tool execution, memory, and reasoning. All in one place. The term “agent”
is used differently across various discussions today. At a basic level, an agent can be:
Most real-world implementations today focus on tool invocation, while full agent-
based architectures are still developing.
LangChain4J for example offers a structured way to integrate AI-driven tools into
applications, using declarative tool definitions together with prompt management
and structured responses, enriched by context handling and memory management.
IBM’s Bee Agent Framework takes a different approach, focusing on multi-agent
workflows. It also contains a notion of distributed agents with explicit task execution
and planning as well as cusomized integrations.Bee is in early development but
presents an alternative to single-agent models by allowing multiple agents to work
together.
Model Context Protocol (MCP)
The Model Context Protocol (MCP) emerged as a very early standard and defines
how applications provide contextual information to AI models. MCP is probably
going to replace traditional tool/function calling with a structured, session-based
approach that separates context, intent, and execution. Instead of issuing one-shot
function calls packed into prompts, the model operates within defined contexts,
expresses goals as intents, and interacts with typed resources through clear protocols.
This design enables lifecycle management, state awareness, and consistent behavior
across runtimes. Unlike tightly coupled, tool-specific implementations, MCP pro‐
motes model-agnostic interoperability, better debugging, and long-lived, goal-driven
interactions—ideal for building robust, agentic systems.
API Integration
API integration is another really imporant part in modern system architectures. It
makes AI models accessible and manageable in production environments. While
models provide the intelligence, APIs handle communication, security, monitoring,
and performance optimization. You can find traditional API management solutions
with specific offerings enhanced for model access but also features integrated into
AI platforms or even model registries. The main responsibilities of API management
frameworks are:
• Authentication & Authorization: Using OAuth, JWT, or API keys to restrict
access.
58 | Chapter 2: The New Types of Applications
• Role-Based Access Control (RBAC): Grants permissions based on user roles (e.g.,
developers vs. inference consumers).
• Audit Logs: Tracks API requests for monitoring and compliance.
• Rate limiting: Preventing excessive API usage.
• Load balancing: Distributing requests across multiple instances.
• Observability - Latency detection: Measuring response times.
• Observability - Request volume monitoring: Identifying trends and potential
scaling needs.
• Caching: Reducing redundant API calls for efficiency.
Beyond handling requests, API integration also serves as the first line of defense for
securing model access and enforcing usage policies. But API gateways alone aren’t
enough—deeper, layered security is required to protect models, data, and operations
end to end.
Model Security, Compliance & Access Control
While API management usually is the outermost layer of an AI-infused system,
there are more elements concerned when it comes to further security considerations.
Security, compliance, and access control are layered across different parts of an AI
system. From model storage and access control to runtime monitoring and compli‐
ance enforcement. This adds more complexity to those applications ultimately. What
makes it particularly challenging is the heterogenity in infrastructure, runtimes and
even languages. While we do know how to build robust cloud-native applications, the
age of AI-infused applications just started and outside of some very few offerings,
there hardly is any cohesive all-in-one platform available as of today.
To manage these risks effectively, different components of an AI platform require
their own security measures. The following areas highlight where targeted controls
and best practices are essential and need to be implemented.
Model Storage and Registry Security
A model registry stores and tracks different versions of models, ensuring gover‐
nance and traceability. Security in this layer includes:
• Encryption – Protects stored models from unauthorized access.
• Integrity Checks – Uses hashing or digital signatures to ensure the model has
not been tampered with.
• Access Controls – Limits who can upload, modify, or retrieve models.
Supporting Technologies | 59
Compliance and Governance
AI models must comply with regulations like GDPR, HIPAA, and SOC 2,
depending on their use case. Compliance involves:
• Data Anonymization – Ensures personal data used in training does not
expose sensitive information.
• Audit Trails – Logs all model training, versioning, and inference requests for
traceability.
• Bias & Fairness Audits – Implements tools like IBM AI Fairness 360 to detect
biases in models.
Runtime Security and Model Monitoring
Once deployed, models need continuous monitoring for security threats, perfor‐
mance degradation, or adversarial attacks. This includes:
• Input Validation – Prevents injection attacks and malformed inputs that
could crash the model.
• Drift Detection – Alerts when input data distribution changes significantly.
• Rate Limiting – Controls excessive API requests to prevent abuse.
Conclusion
AI integration in Java applications builds on established principles rather than replac‐
ing them. Just as we adapted to cloud-native architectures and reactive programming,
it is now working with AI models that requires new tools and concepts while main‐
taining modularity, scalability, and maintainability.
This chapter covered the foundational elements of AI integration—understanding
large language models, architectural choices, and supporting technologies. We
explored different model types, deployment strategies, and the trade-offs between
open and closed-source solutions. We also introduced retrieval-augmented genera‐
tion (RAG), function calling, and tuning techniques that help integrate AI into
enterprise systems effectively.
The challenge for Java developers is not only in understanding AI models but in
applying them in ways that align with existing software practices. Whether optimiz‐
ing inference, selecting deployment methods, or refining model responses, the ability
to integrate AI effectively will shape future applications.
The next chapter will focus on structuring AI-powered applications, ensuring these
capabilities fit seamlessly into enterprise systems.
60 | Chapter 2: The New Types of Applications
CHAPTER 3
Inference API
A Note for Early Release Readers
With Early Release ebooks, you get books in their earliest form—the author’s raw and
unedited content as they write—so you can take advantage of these technologies long
before the official release of these titles.
This will be the 5th chapter of the final book. Please note that the GitHub repo will be
made active later on.
If you have comments about how we might improve the content and/or examples in
this book, or if you notice missing material within this chapter, please reach out to the
editor at mpotter@oreilly.com.
You’ve already expanded your knowledge about AI, and the many types of models.
Moreover, you deployed these models locally (if possible) and test them with some
queries. But when it is time to use models, you need to expose them properly,
follow your organization’s best practices, and provide developers with an easy way to
consume the model.
An Inference API helps solve these problems, making models accessible to all devel‐
opers.
This chapter will explore how to expose an AI/ML model using an Inference API in
Java.
61
What is an Inference API?
An Inference API allows developers to send data (in any protocol, such as HTTP,
gRPC, Kafka, etc.) to a server with a machine learning model deployed and receive
the predictions or classifications as a result.
Practically, every time you access cloud models like OpenAI or Gemini or models
deployed locally using ollama, you do so through their Inference API.
Even though it is common these days to use big models trained by big corporations
like Google, IBM, or Meta, mostly for LLM purposes, you might need to use small
custom-trained models to solve one specific problem for your business.
Usually, these models are developed by your organization’s data scientists, and you
must develop some code to infer them.
Let’s take a look at the following example:
Suppose you are working for a bank, and data scientists have trained a custom model
to detect whether a credit card transaction can be considered fraud.
The model is in onnx format with six input parameters and one output parameter of
type float.
As input parameters:
distance_from_last_transaction
The distance from the last transaction that happened. For example,
0.3111400080477545.
ratio_to_median_price
Ratio of purchased price transaction to median purchase price. For example,
1.9459399775518593.
used_chip
Is the transaction through the chip. 1.0 if true, `0.0 if false.
used_pin_number
Is the transaction that happened by using a PIN number. 1.0 if true, 0.0 if false.
online_order
Is the transaction an online order. 1.0 if true, 0.0 if false.
And the output parameter:
prediction
The probability the transaction is fraudulent. For example, 0.9625362.
A few things you might notice here are:
62 | Chapter 3: Inference API
• Everything is a float, even when referring to a boolean like in the used_chip field.
• The output is a probability, but from the business point of view, you want to
know if there has been fraud.
• Developers prefer using classes instead of individual parameters.
This is a typical use case for creating an Inference API for the model to add an
abstraction layer that makes consuming the model easier.
The Figure 3-1 shows the transformation between a JSON document and the model
parameters done by the Inference API:
Figure 3-1. Inference API Schema
The advantages of having an Inference API are:
• The models are easily scalable. The model has a standard API, and because of the
stateless nature of models, you can scale up and down as any other application of
your portfolio.
• The models are easy to integrate with any service as they offer a well-known API
(REST, Kafka, gRPC, …)
• It offers an abstraction layer to add features like security, monitoring, logging, …
Now that we understand why having an Inference API for exposing a model is
important let’s explore some examples of Inference APIs.
Examples of Inference APIs
Open (and not Open) Source tools offer an inference API to consume models from
any application. In most cases, the model is exposed using a REST API with a
documented format. The application only needs a REST Client to interact with the
model.
What is an Inference API? | 63
Nowadays, there are two popular Inference APIs that might become the de facto API
in the LLM space. We already discussed them in the previous chapter: one is OpenAI,
and the other is Ollama.
Let’s explore each of these APIs briefly. The idea is not to provide full documentation
of these APIs but to give you concrete examples of Infernece APIs so that in case you
develop one, you can get some ideas from them.
OpenAI
OpenAI offers different Inference APIs, such as chat completions, embeddings, image,
image manipulation, or fine tuning.
To interact with those models, create an HTTP request including the following parts:
• The HTTP method used to communicate with the API is POST.
• OpenAI uses a Bearer token to authenticate requests to the model.
• Hence, any call must have an HTTP header named Authorization with the value
Bearer $OPENAI_API_KEY.
• The body content of the request is a JSON document.
In the case of chat completions, two fields are mandatory: the model to use and the
messages to send to complete.
An example of body content sending a simple question is shown in the following
snippet:
{
"model": "gpt-4o",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the Capital of Japan?"
}
],
"temperature": 0.2
}
Model to use
Messages sent to the model with the role
Role system allows you to specify the way the model answers questions
64 | Chapter 3: Inference API
Role user is the question
Temperature value defaults to 1.
And the response contains multiple fields, the most important one choices offering
the responses calculated by the model:
{
"id": "chatcmpl-123",
...
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "\n\nThe capital of Japan is Tokyo.",
},
"logprobs": null,
"finish_reason": "stop"
}],
...
}
A list of chat completion choices.
The role of the author of this message.
The response of the message
In the case of embeddings, model and input fields are required:
{
"input": "This is a cat",
"model": "text-embedding-ada-002"
}
String to vectorize
Model to use
The response contains an array of floats in the data field containing the vector data:
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [
0.0023064255,
-0.009327292,
.... (1536 floats total for ada-002)
What is an Inference API? | 65
-0.0028842222,
],
"index": 0
}
],
...
}
The vector data
These are two examples of OpenAI Inference API, but you can find the documenta‐
tion at https://platform.openai.com/docs/overview.
Ollama
Ollama provides an Inference API to access LLM models that are running in ollama.
Ollama has taken a significant step forward by making itself compatible with the
OpenAI Chat Completions API, making it possible to use more tooling and applica‐
tions with Ollama. This effectively means interacting with models running in ollama
for chat completions can be done either with OpenAI API or with ollama API.
It uses the POST HTTP method, and the body content of the request is a JSON
document, requiring two fields, model and prompt:
{
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": false
}
Name of the model to send the request
Message sent to the model
The response is returned as a single response object rather than a stream
The response is:
{
"model": "llama3",
...
"response": "The sky is blue because it is the color of the sky.",
"done": true,
...
}
The generated response
66 | Chapter 3: Inference API
In a similar way to OpenAI, llama provides an API for calculating embeddings. The
request format is quite similar, requiring the model, and input fields:
{
"model": "all-minilm",
"input": ["Why is the sky blue?"]
}
The response is a list of embeddings:
{
"model": "all-minilm",
"embeddings": [[
0.010071029, -0.0017594862, 0.05007221, 0.04692972, 0.054916814,
0.008599704, 0.105441414, -0.025878139, 0.12958129, 0.031952348
]]
}
These are two examples of ollama Inference API, but you can find the documentation
at https://github.com/ollama/ollama/blob/main/docs/api.md
In these sections, we discussed why an Inference API is important and explored some
existing ones, mostly for LLM models.
Next, let’s get back to our fraud detection model introduced at the beginning of this
chapter. Let’s discuss how to implement an Inference API for the model and, even
more importantly, how to do it in Java.
In the next section, we’ll develop an Inference API in Java, deploy it, and send some
queries to validate its correct behavior.
Deplying Inference Models in Java
Deep Java Library (or DJL) is an open-source Java project created by Amazon to
develop, create, train, test and infer machine learning and deep learning models
natively in Java.
DJL provides a set of APIs abstracting the complexity involved in developing Deep
learning models, providing a unified way for training and inferencing for most popu‐
lar AI/ML frameworks like Apache MxNet, PyTorch, Tensorflow, ONNX formats, or
even the popular HuggingFace AutoTokenizer and Pipeline.
DJL contains a high-level abstraction layer that connects to the corresponding AI/ML
model to use, making a change on the runtime almost transparent from the Java
application layer.
You can configure DJL to use CPU or GPU; both modes are supported based on the
hardware configuration.
Deplying Inference Models in Java | 67
A model is just a file/s. DJL will load the model and offer a pro‐
grammatic way to interact with it. The model can be trained using
DJL or any other training tool (Python sk-learn) as long as it saves
the model into a supported file format by DJL.
The Figure 3-2 shows an overview of the DJL architecture. The bottom layer shows
the integration between DJL and CPU/GPU, the middle layer are native libraries to
run the models, and tese layers are controlled using plain Java:
Figure 3-2. DJL Architecture
Even though DJL provides a layer of abstraction, you still need to have a basic
understanding of machine learning common concepts.
Inferencing models with DJL
The best way to understand DJL for inferencing models is to develop an example.
Let’s develop a Java application using DJL to create an Inference API to expose the
onnx fraud detection model described previously.
Let’s use Spring Boot to create a REST endpoint to infer the model. The Figure 3-3
shows what we want to implement:
68 | Chapter 3: Inference API
Figure 3-3. Spring Boot Rest API Schema
First, generate a simple Spring Boot application with Spring Web dependency. You
can use Spring Initializr (https://start.spring.io/) to scaffold the project or start from
scratch. The name of the project is fraud-detection, and add the Spring Web
dependency.
The Figure 3-4 shows the Spring Initializr parameters for this example:
Figure 3-4. Spring Initializr
With the basic layout of the project, let’s work through the details, starting with
adding the DJL dependencies.
Dependencies
DJL offers multiple dependencies depending on the AI/ML framework used. DJL
project provides a Bill of Materials (BOM) dependency to manage the versions of
the project’s dependencies, offering a centralized location to define and update these
versions.
Deplying Inference Models in Java | 69
Add the BOM dependency (in the dependencyManagement section) in the pom.xml
file of the project:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>ai.djl</groupId>
<artifactId>bom</artifactId>
<version>0.29.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
Since the model is in onnx format, add the following dependency containing the
ONNX engine: onnxruntime-engine:
<dependency>
<groupId>ai.djl.onnxruntime</groupId>
<artifactId>onnxruntime-engine</artifactId>
</dependency>
No version is required as it is inherited from BOM
The next step is creating two Java records, one representing the request and another
representing the response.
POJOs
The request is a simple Java record with all the transaction details.
public record TransactionDetails(String txId,
float distanceFromLastTransaction,
float ratioToMedianPrice, boolean usedChip,
boolean usedPinNumber, boolean onlineOrder) {}
The response is also a Java record returning a boolean setting if the transaction is
fraudulent.
public record FraudResponse(String txId, boolean fraud) {
}
The next step is configuring and loading the model into memory.
Loading the model
We’ll use two classes to configure and load the fraud detection model: ai.djl.repos
itory.zoo.Criteria and ai.djl.repository.zoo.ZooModel. Let’s look at each of
those in more detail:
70 | Chapter 3: Inference API
Criteria
This class configures the location and interaction with the model. Criteria
support loading models from multiple storages (local, S3, HDFS, URL) or
implementing own protocol (FTP, JDBC, …). Moreover, you configure the trans‐
formation from Java parameters to model parameters and viceversa.
ZooModel
The ModelZoo API offers a standardized method for loading models while
abstracting from the engine. Its declarative approach provides excellent flexibility
for testing and deploying the model.
Create a Spring Boot Configuration class to instantiate these classes. A Spring
Boot Configuration class needs to be annotated with @org.springframework.con
text.annotation.Configuration.
@Configuration
public class ModelConfiguration {
}
Then, create two methods, one instantiating a Criteria and the other one a ZooMo
del.
The first method creates a Criteria object with the following parameters:
• The location of the model file, in this case, the model is stored at classpath.
• The data type that developers send to the model, for this example, the Java record
created previously with all the transaction information.
• The data type returned by the model, a boolean indicating whether the given
transaction is fraudulent.
• The transformer to adapt the data types from Java code (TransactionDetails,
Boolean) to the model parameters (ai.djl.ndarray.NDList).
• The engine of the model.
@Bean
public Criteria<TransactionDetails, Boolean> criteria() {
String modelLocation = Thread.currentThread()
.getContextClassLoader()
.getResource("model.onnx").toExternalForm();
return Criteria.builder()
.setTypes(TransactionDetails.class, Boolean.class)
.optModelUrls(modelLocation)
.optTranslator(new TransactionTransformer(THRESHOLD))
.optEngine("OnnxRuntime")
.build();
}
Deplying Inference Models in Java | 71
The Criteria object is parametrized with the input and output types
Gets the location of the model within the classpath
Sets the types
The Model location
Instantiates the Transformer to adapt the parameter.
The Runtime. This is especially useful when more than one engine is present in
the classpath.
The second method creates the ZooModel instance from the Criteria object created
in the previous method:
@Bean
public ZooModel<TransactionDetails, Boolean> model(
@Qualifier("criteria") Criteria<TransactionDetails, Boolean> criteria)
throws Exception {
return criteria.loadModel();
}
Criteria object is injected
Calls the method to load the model
One piece is missing from the previous implementation, and it is the Transaction
Transformer class code.
Transformer
The transformer is a class implementing the ai.djl.translate.NoBatchifyTransla
tor to adapt the model’s input and output parameters to Java business classes.
The model input and output classes are of type ai.djl.ndarray.NDList, which
represents a list of arrays of floats.
For the fraud model, the input is an array in which the first position is the distan
ceFromLastTransaction parameter value, the second position is the value of ratio
ToMedianPrice, and so on. For the output, it is an array of one position with the
probability of fraud.
The transformer has the responsibility to have this knowledge and adapt it according
to the model.
Let’s implement one transformer for this use case:
72 | Chapter 3: Inference API
public class TransactionTransformer
implements NoBatchifyTranslator<TransactionDetails, Boolean> {
private final float threshold;
public TransactionTransformer(float threshold) {
this.threshold = threshold;
}
@Override
public NDList processInput(TranslatorContext ctx, TransactionDetails input)
throws Exception {
NDArray array = ctx.getNDManager().create(toFloatRepresentation(input),
new Shape(1, 5));
return new NDList(array);
}
private static float[] toFloatRepresentation(TransactionDetails td) {
return new float[] {
td.distanceFromLastTransaction(),
td.ratioToMedianPrice(),
booleanAsFloat(td.usedChip()),
booleanAsFloat(td.usedPinNumber()),
booleanAsFloat(td.onlineOrder())
};
}
private static float booleanAsFloat(boolean flag) {
return flag ? 1.0f : 0.0f;
}
@Override
public Boolean processOutput(TranslatorContext ctx, NDList list)
throws Exception {
NDArray result = list.getFirst();
float prediction = result.toFloatArray()[0];
System.out.println("Prediction: " + prediction);
return prediction > threshold;
}
Interface with types to transform
Parameter set to decide when fraud is considered
Transforming business inputs to model inputs
Shape is the size of the array (5 parameters)
Deplying Inference Models in Java | 73
Process the output of the model
Calculates if the probability of fraud is beyond the threshold or not
With the model in memory, it is time to query it with some data.
Predict
The model is accessed through the ai.djl.inference.Predictor interface. The
predictor is the main class that orchestrates the inference process.
The predictor is not thread-safe, so performing predictions in parallel requires one
instance for each thread. There are multiple ways to handle this problem. One option
is creating the Predictor instance per request. Another option is to create a pool of
Predictor instances so threads can access them.
Moreover, it is very important to close the predictor when it is no longer required to
free memory.
Our advice here is to measure the performance of creating the Predictor instance
per request and then decide whether it is acceptable or use the first or the second
option.
To implement per-request strategy in Spring Boot, return a java.util.function.Sup
plier instance, so you have control over when the object is created and closed.
@Bean
public Supplier<Predictor<TransactionDetails, Boolean>>
predictorProvider(ZooModel<TransactionDetails, Boolean> model) {
return model::newPredictor;
}
Returns a Supplier instance of the parametrized Predictor
ZooModel created previously is injected
Creates the Supplier
The last thing to do is expose the model through a REST API.
REST Controller
To create a REST API in Spring Boot, annotate a class with @org.springframe
work.web.bind.annotation.RestController.
Moreover, since the request to detect fraud should go through the POST HTTP
Method, annotate the method with the’ @org.springframework.web.bind.annota‐
tion.PostMapping` annotation.
74 | Chapter 3: Inference API
The Predictor supplier instance is injected using the @jakarta.annota
tion.Resource annotation.
@RestController
public class FraudDetectionInferenceController {
@Resource
private Supplier<Predictor<TransactionDetails, Boolean>> predictorSupplier;
@PostMapping("/inference")
FraudResponse detectFraud(@RequestBody TransactionDetails transactionDetails)
throws TranslateException {
try (var p = predictorSupplier.get()) {
boolean fraud = p.predict(transactionDetails);
return new FraudResponse(transactionDetails.txId(), fraud);
}
}
}
Injects the supplier
Creates a new instance of the Predictor
Predictor implements Autoclosable, so try-with-resources is used
Makes the call to the model
Builds the response
The service is ready to start and expose the model.
Testing the example
Go to the terminal window, move to the application folder, and start the service by
calling the following command:
./mvnw clean spring-boot:run
Then send two requests to the service, one with no fraud parameters and another one
with fraud parameters:
// None Fraud Transaction
curl -X POST localhost:8080/inference \
-H 'Content-type:application/json' \
-d '{"txId": "1234",
"distanceFromLastTransaction": 0.3111400080477545,
"ratioToMedianPrice": 1.9459399775518593,
"usedChip": true,
"usedPinNumber": true,
"onlineOrder": false}'
Deplying Inference Models in Java | 75
// Fraud Transaction
curl -X POST localhost:8080/inference \
-H 'Content-type:application/json' \
-d '{"txId": "5678",
"distanceFromLastTransaction": 0.3111400080477545,
"ratioToMedianPrice": 1.9459399775518593,
"usedChip": true,
"usedPinNumber": false,
"onlineOrder": false}'
And the output of both requests are:
{"txId":"1234","fraud":false}
{"txId":"5678","fraud":true}
Moreover, if you inspect the Spring Boot console logs, you’ll see the calculated
probability of fraud done by the model.
Prediction: 0.4939952
Prediction: 0.9625362
Now, you’ve successfully run an Inference API exposing a model using only Java.
Let’s take a look of what’s happening under the hood when the application starts the
DJL framework:
Under the hood
JAR file doesn’t bundle the AI/ML engine for size reasons. In this example, if the
JAR contained the ONNX runtime, it should contain all ONNX runtime for all the
supported plaforms. For example, ONNX runtime for Operating Systems like Linux
or MacOS and all possible hardware, such as ARM or x86 architectures.
To avoid this problem, when we start an application using DJL, it automatically
downloads the model engine for the running architecture.
DJL uses cache directories to store model engine-specific native files; they are down‐
loaded only once. By default, cache directories are located in the .djl.ai directory
under the current user’s home directory.
You can change this, by setting the DJL_CACHE_DIR system property or environment
variable. Adjusting this variable will alter the storage location for both model and
engine native files.
DJL does not automatically clean obsolete cache in the current version. Users can
manually remove unused models or native engine files.
76 | Chapter 3: Inference API
If you plan to containerize the application, we recommend bun‐
dling the engine inside the container to avoid downloading the
model every time the container is started. Furthermore, start-up
time is improved.
One of the best features of the DJL framework is its flexibility in not requiring a spe‐
cific protocol for model inferencing. You can opt for the Kafka protocol if you have
an event-driven system or the gRPC protocol for high-performance communication.
Let’s see how the current example changes when using gRPC.
Inferencing Models with gRPC
gRPC is an open-source API framework following the Remote Procedure Call (RPC)
model. Although the RPC model is general, gRPC serves as a particular implementa‐
tion. gRPC employs Protocol Buffers and HTTP/2 for data transmission.
gRPC is only the protocol definition; every language and frameworks have an imple‐
mentation of both the main elements of a gRPC application, the gRPC Server and the
gRPC Stub .
gRPC Server
It is the server part of the application, where you define the endpoint and
implement the business logic.
gRPC Stub
It is the client part of the application, the code that makes remote calls to the
server part.
The Figure 3-5 provides a high-level overview of a gRPC architecture of an applica‐
tion. You see a gRPC Service implemented in Java, and two clients connecting to this
service (one in Java and the other one in Ruby) using Protocol Buffer format.
Figure 3-5. gRPC Architecture
Deplying Inference Models in Java | 77
gRPC offers advantages over REST when implementing high-performance systems,
with high data loads, or when you need real-time applications. In most cases, gRPC
is used for internal systems communications, for example, between internal services
in a microservices architecture. Our intention here is not to go deep in gRPC, but to
show you the versatility of inferencing models with Java.
Throughout the book, you’ll see more ways of doing this, but for now let’s transform
the Fraud Detection example into a gRPC application.
Protocol Buffers
The initial step in using protocol buffers is to define the structure for the data you
want to serialize, along with the services, specifying the RPC method parameters and
return types as protocol buffer messages. This information is defined in a .proto file
used as the Interface Definition Language (IDL).
Let’s implement the gRPC Server in the Spring Boot project.
Create a fraud.proto file in src/main/proto with the following content expressing
the Fraud Detection contract.
syntax = "proto3";
option java_multiple_files = true;
option java_package = "org.acme.stub";
package fraud;
service FraudDetection {
rpc Predict (TxDetails) returns (FraudRes) {}
}
message TxDetails {
string tx_id = 1;
float distance_from_last_transaction = 2;
float ratio_to_median_price = 3;
bool used_chip = 4;
bool used_pin_number = 5;
bool online_order = 6;
}
message FraudRes {
string tx_id = 1;
bool fraud = 2;
}
Defines package where classes are going to be materialized
Defines Service name
78 | Chapter 3: Inference API
Defines method signature
Defines the data transferred
Integer is the order of the field
With the contract API created, use a gRPC compiler to scaffold all the required classes
for implementing the server side.
The Figure 3-6 summarizes the process:
Figure 3-6. gRPC Generation Code
Let’s create the gRPC server reusing the Spring Boot project but implementing now
the Inference API for the Fraud Detection model using gRPC Protocol Buffers.
Implementing the gRPC Server
To implement the server part, open the pom.xml file and add dependencies for
coding the gRPC server using Spring Boot ecosystem. Add the Maven extension and
plugin to automatically read the src/main/proto/fraud.proto file and generate the
required stubs and skeletons classes.
These generated classes are the data messages (TxDetails and FraudRes) and the
base classes containing the logic for running the gRPC server.
Add the following dependencies:
<dependency>
<groupId>io.grpc</groupId>
<artifactId>grpc-protobuf</artifactId>
<version>1.62.2</version>
Deplying Inference Models in Java | 79
</dependency>
<dependency>
<groupId>io.grpc</groupId>
<artifactId>grpc-stub</artifactId>
<version>1.62.2</version>
</dependency>
<dependency>
<groupId>net.devh</groupId>
<artifactId>grpc-server-spring-boot-starter</artifactId>
<version>3.1.0.RELEASE</version>
</dependency>
<dependency>
<groupId>javax.annotation</groupId>
<artifactId>javax.annotation-api</artifactId>
<version>1.3.2</version>
<scope>provided</scope>
<optional>true</optional>
</dependency>
<build>
...
<extensions>
<extension>
<groupId>kr.motd.maven</groupId>
<artifactId>os-maven-plugin</artifactId>
<version>1.7.1</version>
</extension>
</extensions>
...
<plugins>
<plugin>
<groupId>org.xolstice.maven.plugins</groupId>
<artifactId>protobuf-maven-plugin</artifactId>
<version>0.6.1</version>
<configuration>
<protocArtifact>
com.google.protobuf:protoc:3.25.1:exe:${os.detected.classifier}
</protocArtifact>
<pluginId>grpc-java</pluginId>
<pluginArtifact>
io.grpc:protoc-gen-grpc-java:3.25.1:exe:${os.detected.classifier}
</pluginArtifact>
</configuration>
<executions>
<execution>
<id>protobuf-compile</id>
<goals>
<goal>compile</goal>
<goal>test-compile</goal>
</goals>
80 | Chapter 3: Inference API
</execution>
<execution>
<id>protobuf-compile-custom</id>
<goals>
<goal>compile-custom</goal>
<goal>test-compile-custom</goal>
</goals>
</execution>
</executions>
</plugin>
...
</plugins>
Adds an extension that gets OS information and stores it as system properties
Registers plugin to compile the protobuf file
Configures the plugin using properties set by the os-maven-plugin extension to
download the correct version of protobuf compiler
Links the plugin lifecycle to the Maven compile lifecycle
At this point, every time you compile the project through Maven, the protobuf-
maven-plugin generates the required gRPC classes from .proto file. These classes
are generated at target/generated-sources/protobuf directory, and automatically
added to the classpath and also packaged in the final JAR file.
Some IDEs don’t recognize these directories as source code, giving
you compilation errors. To avoid these problems, register these
directories as source directories in IDE configuration or using
Maven.
In a terminal window, run the following command to generate these classes:
./mvnw clean compile
The final step is to implement the business logic executed by the gRPC server.
Generated classes are packaged in the package defined at the java_package option
defined in the fraud.proto file; in this case, it is org.acme.stub.
To implement the service, create a new class annotated with
@net.devh.boot.grpc.server.service.GrpcService and extend the base class
org.acme.stub.FraudDetectionGrpc.FraudDetectionImplBase generated previ‐
ously by protobuf plugin which contains all the code for binding the service.
@GrpcService
public class FraudDetectionInferenceGrpcController
Deplying Inference Models in Java | 81
extends org.acme.stub.FraudDetectionGrpc.FraudDetectionImplBase {
}
Base class name is the servicename defined in the fraud.proto
Since the project uses the Spring Boot framework, you can inject dependencies using
the @Autowired or @Resource annotations.
Inject of the ai.djl.inference.Predictor class as done in the REST Controller to
access the model:
@Resource
private Supplier<Predictor<org.acme.TransactionDetails, Boolean>>
predictorSupplier;
Finally, implement the rpc method defined in fraud.proto file under FraudDetec
tion service. This method is the remote method invoked when the gRPC client makes
the request to the Inference API.
Because of the streaming nature of gRPC, the response is sent using a reactive call
through the io.grpc.stub.StreamObserver class.
@Override
public void predict(TxDetails request,
StreamObserver<FraudResponse> responseObserver) {
org.acme.TransactionDetails td =
new org.acme.TransactionDetails(
request.getTxId(),
request.getDistanceFromLastTransaction(),
request.getRatioToMedianPrice(),
request.getUsedChip(),
request.getUsedPinNumber(),
request.getOnlineOrder()
);
try (var p = predictorSupplier.get()) {
boolean fraud = p.predict(td);
FraudRes fraudResponse = FraudRes.newBuilder()
.setTxId(td.txId())
.setFraud(fraud).build();
responseObserver.onNext(fraudResponse);
responseObserver.onCompleted();
} catch (TranslateException e) {
throw new RuntimeException(e);
}
}
82 | Chapter 3: Inference API
RPC Method receives input parameters and the StreamObserver instance to send
the output result.
Transforms the gRPC messages to DJL classes.
Gets the predictor as we did in the REST Controller
Creates the gRPC message for the output
Sends the result
Finishes the stream for the current request
Both REST and gRPC implementations can coexist in the same project. Start the
service with the spring-boot:run goal to notice that both endpoints are available:
./mvnw clean spring-boot:run
o.s.b.w.embedded.tomcat.TomcatWebServer: Tomcat started on port 8080
(http) with context path '/'
n.d.b.g.s.s.GrpcServerLifecycle: gRPC Server started,
listening on address: *, port: 9090
Sending requests to a gRPC server is not as easy as with REST; you can use tools
like grpc-client-cli (https://github.com/vadimi/grpc-client-cli), but in the following
chapter, you’ll learn how to access both implementations from Java.
Next Steps
You’ve completed the first step in inferring models in Java. DJL has more advanced
features, such as training models, automatic download of popular models (resnet,
yolo, …), image manipulation utilities, or transformers.
This chapter’s example was simple, but depending on the model, things might be
more complicated, especially when images are involved.
In later chapters, we’ll explore more complex examples of inferencing models using
DJL and show you other useful enterprise use cases and models.
In the next chapter, you’ll learn how to consume the Inference APIs defined in this
chapter before diving deep into DJL.
Next Steps | 83
CHAPTER 4
Accessing the Inference Model with Java
A Note for Early Release Readers
With Early Release ebooks, you get books in their earliest form—the author’s raw and
unedited content as they write—so you can take advantage of these technologies long
before the official release of these titles.
This will be the 6th chapter of the final book. Please note that the GitHub repo will be
made active later on.
If you have comments about how we might improve the content and/or examples in
this book, or if you notice missing material within this chapter, please reach out to the
editor at mpotter@oreilly.com.
In the previous chapter, you learned to develop and expose a model that produces
data using an Inference API. That chapter covered half of the development; you only
learned how to expose the model, but how about consuming this model from another
service? Now it is time to cover the other half, which involves writing the code to
consume the API.
In this chapter, we’ll complete the previous example, you’ll create Java clients to
consume the Fraud Inference APIs to detect if a given transaction can be considered
fraudulent or not.
We’ll show you writting clients for Spring Boot and Quarkus using both REST and
gRPC clients.
Connecting to an Inference API with Quarkus
Quarkus provides two methods for implementing REST Clients:
85
• The Jakarta REST Client is the standard Jakarta EE approach for interacting with
RESTful services.
• The MicroProfile REST Client provides a type-safe approach to invoke RESTful
services over HTTP using as much of the Jakarta RESTful Web Services spec
as possible. The REST client is defined as a Java interface, making it type-safe
and providing the network configuration using Jakarta RESTful Web Services
annotations.
In this section, you’ll develop a Quarkus service consuming the Fraud Detection
model using the MicroProfile REST Client.
Architecture
Let’s create a Quarkus service sending requests to the Fraud Service Inference API
developed in the previous chapter.
This service contains a list of all transactions done and exposes an endpoint to
validate whether a given transaction ID can be considered fraudulent.
The Figure 4-1 shows the architecture of what you’ll be implementing in this chapter.
Quarkus service receives an incoming request to validate whether a given transaction
is fraudulent. The service gets the transaction information from the database and
sends the data to the fraudulent service to validate whether the transaction is fraudu‐
lent. Finally, the result is stored in the database and returned to the caller.
Figure 4-1. Overview of the architecture
86 | Chapter 4: Accessing the Inference Model with Java
Let’s remember the document format returned by the inference API, as it is important
to implement it correctly on the client side.
The Fraud Inference API
The Fraud Inference API developed in the previous chapter uses the HTTP POST
method, exposing the /inference endpoint and JSON documents as body requests
and responses.
An example of body content could be:
{
"txId": "5678",
"distanceFromLastTransaction": 0.3111400080477545,
"ratioToMedianPrice": 1.9459399775518593,
"usedChip": true,
"usedPinNumber": false,
"onlineOrder": false
}
And a response:
{
"txId":"5678",
"fraud":true
}
Let’s scaffold a Quarkus project to implement the consumer part.
Creating the Quarkus project
First, generate a simple Quarkus application with REST Jackson and REST Client
Jackson dependencies. You can use Code Quarkus to scaffold the project or start from
scratch.
With the basic layout of the project, let’s write the REST Client using the MicroProfile
REST Client spec.
REST Client interface
Create the org.acme.FraudDetectionService interface to interact with the Inference
API.
In this interface, you define the following information:
• The connection information using JakartaEE annotations (@jakarta.ws.rs.Path
for the endpoint and @jakarta.ws.rs.POST for the HTTP Method) .
• The classes used for body content and response.
Connecting to an Inference API with Quarkus | 87
• Annotate the class as a REST Client with the @org.eclipse.micropro
file.rest.client.inject.RegisterRestClient annotation and set the client’s
name.
@Path("/inference")
@RegisterRestClient(configKey = "fraud-model")
public interface FraudDetectionService {
@POST
FraudResponse isFraud(TransactionDetails transactionDetails);
}
Remote Path to connect
Sets the interface as REST client
The request uses the POST HTTP method
TransactionDetails is serialized to JSON as body message
FraudResponse is serialized to JSON as response
Host to connect is set in the application.properties file with the quar
kus.rest-client.<configKey>.url property. Open the src/main/resources/appli
cation.properties file and add the following line:
quarkus.rest-client.fraud-model.url=http://localhost:8080
configKey value was set to fraud-model in the RegisterRestClient annotation
The inference API is deployed locally
With a few lines, we’ve developed the REST Client and it’s ready for use. The next step
is creating the REST endpoint.
REST Resource
The next step is to create the REST endpoint, which will call the REST client created
earlier. The endpoint is set up to handle requests using the GET HTTP method, and
it is implemented with the `@jakarta.ws.rs.GET annotation. The transaction ID is
passed as a path parameter using the `@jakarta.ws.rs.PathParam annotation.
To use the REST client, you should inject the interface using the `@org.eclipse.micro‐
profile.rest.client.inject.RestClient annotation.
Please create a class called TransactionResource with the following content:
88 | Chapter 4: Accessing the Inference Model with Java
@Path("/fraud")
public class TransactionResource {
// ....
@RestClient
FraudDetectionService fraudDetectionService;
@GET
@Path("/{txId}")
public FraudResponse detectFraud(@PathParam("txId") String txId) {
final TransactionDetails transaction = findTransactionById(txId);
final FraudResponse fraudResponse = fraudDetectionService.isFraud(transaction);
markTransactionFraud(fraudResponse.txId(), fraudResponse.fraud());
return fraudResponse;
}
// ....
Interface is injected
Defines the path param
Injects the path param value as method parameter
Executes the remote call to Inference API
The service is ready to start using the inference model.
Set the quarkus.http.port=8000 property in the applica
tion.properties file to start this service in port 8000 so it doesn’t
collide with the Spring Boot port.
Testing the example
To test the example, you need to start the Spring Boot service developed in the
previous chapter and the Quarkus service developed in this chapter.
In one terminal window, navigate to the Fraud Detection Inference directory and
start the Spring Boot service by running the following command:
./mvnw clean spring-boot:run
Connecting to an Inference API with Quarkus | 89
In another terminal window, start the Quarkus service running the following com‐
mand:
./mvnw quarkus:dev
With both services running, send the following request to the TransactionResource
endpoint:
curl localhost:8000/fraud/1234
{"txId":"1234","fraud":false}
You consumed an Inference API using Quarkus; in the next section, we’ll see imple‐
menting the same consumer using Spring Boot.
Connecting to an inference API with Spring Boot
WebClient
Let’s implement a REST client but at this time using Spring Boot WebFlux classes.
WebClient is an interface serving as the primary entry point for executing web
requests, replacing the traditional RestTemplate classes. Furthermore, this new client
is a reactive, non-blocking solution that operates over the HTTP/1.1 protocol, but it
is suitable for synchronous operations.
Adding WebClient Dependency
We can use WebClient with synchronous and asynchronous operations, but the client
is under reactive dependencies.
Add the following dependency if the project is not already a WebFlux service.
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
With the dependency registered, let’s implement the code to make REST calls.
Using the WebClient
To call remote REST services, instantiate the org.springframework.web.reac
tive.function.client.WebClient interface as a class attribute. Then, we’ll use this
interface to create the request call and retrieve the result.
private final WebClient webClient;
public TransactionController() {
webClient = WebClient.create("http://localhost:8080");
}
90 | Chapter 4: Accessing the Inference Model with Java
@GetMapping("/fraud/{txId}")
FraudResponse detectFraud(@org.springframework.web.bind.annotation.PathVariable
String txId) {
final TransactionDetails transaction = findTransactionById(txId);
final ResponseEntity<FraudResponse> fraudResponseResponseEntity = webClient.post()
.uri("/inference")
.body(Mono.just(transaction), TransactionDetails.class)
.retrieve()
.toEntity(FraudResponse.class)
.block();
return fraudResponseResponseEntity.getBody();
}
Creates and configures the WebClient
Instantiates a new instance to execute a POST
Sets the path part
Body content
Executes the call
Transforms the async call to sync
So far, you have used both frameworks to consume an Inference API with two differ‐
ent approaches: declarative and programmatic. You can integrate any Java (REST)
HTTP client without issues.
Let’s now implement the same logic for consuming a model but using gRPC protocol
instead of REST.
Connecting to the Inference API with Quarkus gRPC client
Let’s build a gRPC client with Quarkus to access the Fraud Detection model exposed
as the gRPC server built in the previous chapter.
As you did when implementing the server-side part, you need to generate the gRPC
Stub from the protobuf file.
Quarkus only requires you to register quarkus-grpc and quarkus-rest extensions.
Connecting to the Inference API with Quarkus gRPC client | 91
Adding gRPC Dependencies
Open pom.xml file of the Fraud Client project and under the dependencies section
add the following dependency:
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-grpc</artifactId>
</dependency>
You should already have the quarkus-rest dependency registered
as you are reusing the project.
With the dependency registered, let’s implement the code to make gRPC calls.
Implementing the gRPC Client
Create the fraud.proto file under src/main/proto directory with the following
content:
syntax = "proto3";
option java_multiple_files = true;
option java_package = "org.acme.stub";
package fraud;
service FraudDetection {
rpc Predict (TxDetails) returns (FraudRes) {}
}
message TxDetails {
string tx_id = 1;
float distance_from_last_transaction = 2;
float ratio_to_median_price = 3;
bool used_chip = 4;
bool used_pin_number = 5;
bool online_order = 6;
}
message FraudRes {
string tx_id = 1;
bool fraud = 2;
}
92 | Chapter 4: Accessing the Inference Model with Java
It is the same file created in the server-side project, so you can copy
it or put it in a shared project and add the project as an external
dependency.
With this setup, you can place the protobuf file in the src/main/proto directory. The
quarkus-maven-plugin (already present in any Quarkus project) will then generate
Java files from the proto files.
Under the hood, the quarkus-maven-plugin fetches a compatible version of protoc
from Maven repositories based on your OS and CPU architecture.
At this point, every time you compile the project through Maven, the quarkus-
maven-plugin generates the required gRPC classes from the .proto file. These classes
are generated at the target/generated-sources/grpc directory, automatically added
to the classpath, and packaged in the final JAR file.
Some IDEs don’t recognize these directories as source code, giving
you compilation errors. To avoid these problems, register these
directories as source directories in IDE configuration or using
Maven.
In a terminal window, run the following command to generate these classes:
./mvnw clean compile
The final step is sending requests using the gRPC client, which uses the classes
generated in the previous step.
Inject the generated service interface with the name org.acme.stub.FraudDetection
into the TransactionResource class using the @io.quarkus.grpc.GrpcClient anno‐
tation.
@GrpcClient("fraud")
FraudDetection fraud;
Inject the service and configure its name.
Quarkus provides a runtime implementation for the interface that is similar to the
REST Client.
When implementing the server-side part, you have seen that gRPC applications are
inherently reactive. Quarkus uses the project Mutiny to implement reactive applica‐
tions, similar to Spring Boot WebFlux or ReactiveX, and it integrates smoothly with
gRPC.
Connecting to the Inference API with Quarkus gRPC client | 93
Mutiny uses the io.smallrye.mutiny.Uni class to represent a lazy asynchronous
operation that generates a single item. Since the Fraud Detection service returns a
single result (fraud or not), the Uni class is used as the return type by the gRPC client.
Let’s implement a new endpoint to verify if a transaction is fraudulent, but using
gRPC instead of REST.
@GET
@Path("/grpc/{txId}")
public Uni<FraudResponse> detectFraudGrpcClient(
@PathParam("txId") String txId) {
final TransactionDetails tx = findTransactionById(txId);
final TxDetails txDetails = TxDetails.newBuilder()
.setTxId(txId)
.setDistanceFromLastTransaction(tx.distanceFromLastTransaction())
.setRatioToMedianPrice(tx.ratioToMedianPrice())
.setOnlineOrder(tx.onlineOrder())
.setUsedChip(tx.usedChip())
.setUsedPinNumber(tx.usedPinNumber())
.build();
final Uni<FraudRes> predicted = fraud.predict(txDetails);
return predicted
.onItem()
.transform(fr -> new FraudResponse(fr.getTxId(), fr.getFraud()));
}
Reactive endpoint, not necessary to block for the result
gRPC input message
Makes the remote call
For the message item returned by the service
Transforms the gRPC message to required output type
The last step before running the example is configuring the remote service location in
the application.properties file:
quarkus.grpc.clients.fraud.host=localhost
quarkus.grpc.clients.fraud.port=9090
fraud is the name used in the @GrpcClient annotation
These are all the steps required for using a gRPC client in a Quarkus application.
94 | Chapter 4: Accessing the Inference Model with Java
Going Beyond
So far, we have looked at using Inference APIs as REST or gRPC using standard Java
libraries that were not specifically designed for AI/ML. This approach works well
for cases where the model is stateless and can be used for a single purpose, such as
detecting fraud or calculating embeddings.
However, when using large language models like Llama3, OpenAI, and Mistral, a
plain REST client might not be sufficient to meet all the requirements. For instance:
• Models are stateless, but in some scenarios, it’s crucial to know what was asked
before in order to generate a correct answer. Generic clients do not have memory
features.
• Using RAG is not directly supported by clients.
• There is no agent support.
• You need to implement the specific Inference API for each model you use.
For these reasons, there are some projects in the Java ecosystem to address these
limitations. The most popular one is LangChain4J.
In the next chapter, we’ll introduce you the LangChain4J project and discuss how to
use it when interacting with LLM models.
Going Beyond | 95
CHAPTER 5
Image Processing
A Note for Early Release Readers
With Early Release ebooks, you get books in their earliest form—the author’s raw and
unedited content as they write—so you can take advantage of these technologies long
before the official release of these titles.
This will be the 9th chapter of the final book. Please note that the GitHub repo will be
made active later on.
If you have comments about how we might improve the content and/or examples in
this book, or if you notice missing material within this chapter, please reach out to the
editor at mpotter@oreilly.com.
The previous chapters covered the basics of integrating Java with AI/ML projects. You
also learned how to load and infer models in Java using DJL and consume them with
LangChain4J. For the remaining part of this book, we will build upon this knowledge
to implement more advanced use cases closer to what you might encounter in a real
project.
One common use of AI in projects involves image processing for classification or
information extraction tasks. The input image can be a single photo provided by a
user or a stream of images from a device like a camera.
Here are some examples of image processing use cases:
• Detecting objects or people, such as in a security surveillance system.
• Classifying images by content, such as categorizing products.
• Extracting information from documents, like ID cards or passports.
97
• Reading vehicle license plates, for example, in speed cameras.
One common aspect on all these use cases is the need to prepare the image before it is
processed by the model. This can involve tasks such as resizing the image to meet the
model’s input size requirements, squaring the image for central cropping, or applying
other advanced algorithms like Gaussian filtering or the Canny algorithm to aid the
model in image detection, classification, or processing.
This chapter does not discuss image processing algorithms but instead provides a
basic understanding of when and how they can be applied in Java. After completing
this chapter, you’ll be able to effectively use these image-processing algorithms for
some use cases provided by data engineers or vision experts.
Bur prior to get into image processing let’s understand what an image is.
First, it’s important to understand how an image is stored in memory to comprehend
how image processing operates.
An image is made up of pixels, with each pixel representing a point in the image. The
total number of pixels depends on the image’s dimensions (width and height).
Each pixel contains information about that point, including color, opacity, and other
attributes based on the image format. For instance:
• In a grayscale image, a pixel is an integer between 0 and 255, where 0 represents
black and 255 represents white.
• In an RGB (Red, Green, Blue) image, a pixel is represented by a group of three
integers for the red, green, and blue color components. For example, the values
255, 0, and 255 produce pink.
• In an RGBA (Red, Green, Blue, Alpha) image, a pixel is represented by four
integers, including RGB and opacity.
In a scenario of a 4x4 image (16 pixels in total) in RGB format, the image comprises a
3-dimensional matrix (one for each color) of integers ranging from 0 to 255.
Figure 5-1 illustrates this decomposition:
98 | Chapter 5: Image Processing
Figure 5-1. Image decomposition
Image processing is applying changes to the matrix at the pixel level, for example,
changing a value close to zero to a strict zero.
Let’s explore how to do image processing in Java.
OpenCV
Open Source Computer Vision Library https://opencv.org/ (OpenCV) is a C++ library
written under Apache License 2 for programming real-time computer vision algo‐
rithms and image manipulation. The library implements over 2500 algorithms.
OpenCV supports GPU acceleration, making it perfect for large or real-time image
processing images.
The main operations supported by OpenCV for image processing are:
Image Acquisition
Obtain images by loading them from a disk or capturing them from a cam‐
era. This operation can include resizing, color conversion, cropping, and other
adjustments.
Image Enhancement
Modify image levels, such as brightness and contrast, to improve visual quality.
Image Restoration
Correct defects that degrade an image, such as noise or motion blur.
Color Image Processing
Adjust colors through color balancing, color correction, or auto-white balance.
Morphological Processing
Analyze shapes within images to extract useful information using algorithms like
dilation and erosion.
OpenCV | 99
Segmentation
Divide an image into multiple regions for detailed scene analysis.
Even though OpenCV is written in \C, there is a Java binding project
named OpenPnP OpenCV(https://github.com/openpnp/opencv) that uses the
Java Native Interface (JNI) to load and use OpenCV natively in Java.
The classes and method names used in the Java binding are similar (if
not the same) as those in the OpenCV \C project, facilitating the adoption of the
Java library.
To get started with OpenPnP OpenCV Java, from this point we’ll refer to it simply as
OpenCV, register the following dependency on your build tool:
<dependency>
<groupId>org.openpnp</groupId>
<artifactId>opencv</artifactId>
<version>4.9.0-0</version>
</dependency>
You can start using OpenCV for Java, as the JAR file bundles the OpenCV native
library for most platforms and architectures.
Initializing the Library
With the library present at the classpath, load the native library into memory. This is
done in two ways in OpenCV:
Manual Installation
The first involves manually installing the OpenCV native library on the system and
then calling System.loadLibrary(org.opencv.core.Core.NATIVE_LIBRARY_NAME),
usually in a static block.
Bundled Installation
The second one is called the nu.pattern.OpenCV.loadLocally() method.
This call will attempt to load the library exactly once per class loader. Initially, it will
try to load from the local installation, which is equivalent to the previous method.
If this attempt fails, the loader will copy the binary from its dependency JAR file to a
temporary directory and add that directory to java.library.path.
The library will remove these temporary files during a clean shutdown.
In the book, we advocate for the bundled installation to get started as no extra
steps are required, like installing a library to your system by calling OpenCV.loadLo
cally().
100 | Chapter 5: Image Processing
With the library loaded, you can start using OpenCV classes.
The library has multiple classes as a point of entry; the most important ones are
org.opencv.imgproc.Imgproc and org.opencv.imgcodecs.Imgcodecs because they
contain the main methods and constants to do image processing.
Let’s explore the basic operations for loading and saving images.
Load and Save Images
To load an image, OpenCV offers the org.opencv.imgcodecs.Imgcodecs.imread
method.
protected org.opencv.core.Mat loadImage(Path image) {
return Imgcodecs.imread(
image.toAbsolutePath().toString()
);
}
imread returns the image as a matrix representation
The image location is a String
And the equivalent method for saving an image:
protected void saveImage(Mat mat, Path path) {
Imgcodecs.imwrite(
path.toAbsolutePath().toString(),
mat);
}
Materializes the given matrix to the path
The destination location is a String
This API is useful when transforming an image file to an image matrix or mate‐
rializing a matrix to an image file. However, in some cases, the source or destina‐
tion of the photo is not a matrix but a byte[]. For these cases, OpenCV has the
org.opencv.core.MatOfByte class.
The following method shows how to transform a byte[]/java.io.InputStream to
an org.opencv.core.Mat:
private Mat fromStream(InputStream is) throws IOException {
final byte[] bytes = toByteArray(is);
return Imgcodecs.imdecode(
new MatOfByte(bytes),
Imgcodecs.IMREAD_UNCHANGED
OpenCV | 101
);
}
Reads theInputStream
Use the imdecode method to decode from bytes to an image matrix
Creates a matrix from the byte[]
Similarly, you can transform an image matrix to a byte[]:
private InputStream toStream(Mat mat) {
MatOfByte output = new MatOfByte();
Imgcodecs.imencode(".jpg", mat, output);
return new ByteArrayInputStream(
output.toArray()
);
}
Encode image matrix into a MatOfByte object
Gets the content as byte[]
Now that you know how to load and save images in OpenCV, let’s explore actual
image processing with basic transformations.
Basic Transformations
It’s not uncommon for AI models that use images as input parameters to require
some image processing as a precondition for analyzing the image. This process can
affect the image size, requiring you to crop or resize images, or the number of color
layers, requiring you to remove the alpha channel or transform to a greyscale.
To Greyscale
To convert an image to any color space, use the Imgproc.cvtColor method. The
typical conversion is to greyscale, running the following method:
private Mat toGreyScale(Mat original) {
Mat greyscale = new Mat();
Imgproc.cvtColor(original, greyscale, Imgproc.COLOR_RGB2GRAY);
return greyscale;
}
Original RGB photo
102 | Chapter 5: Image Processing
Creates the object to store the conversion
Converts to greyscale
Other possible conversions can be: COLOR_BGR2HLS to convert from RGB to HLS
(Hue, Lightness, Saturation), COLOR_RGBA2GRAY to convert from RGBA to greyscale,
or COLOR_GRAY2RGB to convert grayscale to RGB, to mention some of them. All
constants starting from COLOR_ in the Ìmgproc class refer to color conversions.
Resize
To resize an image, use the Imgproc.resize command, setting the new size of the
image (or the resize ratio) and the interpolation method.
private Mat resize(Mat original, double ratio) {
Mat resized = new Mat();
Imgproc.resize(original, resized,
new Size(),
ratio,
ratio,
Imgproc.INTER_LINEAR
);
return resized;
}
Creates the object to store the resized image
Output image size, if not set, uses the ratio
Scale factor along the horizontal axis
Scale factor along the vertical axis
Interpolation method
Other possible interpolation methods include INTER_CUBIC for cubic interpolation, or
INTER_LANCZOS4 for Lanczos interpolation method.
Sometimes, it is not possible to resize an image without deforming it. For example,
the model requires a 1:1 aspect ratio, while the input image is in the 16:9 ratio.
Resizing is an option, but at the cost of deforming the image. Another option is
to crop the image, focusing on the important part of the image. There are many
different algorithms for finding the important part using vision algorithms, but in
most cases, a center crop of the image with the required aspect ratio works correctly.
OpenCV | 103
Crop
Let’s crop the center of an image into a square, using the org.opencv.core.Rect to
define the valid rectangle of an image.
To implement this crop, you need to play a bit with math to calculate the exact
coordinates to find the starting cropping point, as the crop size is already set. Let’s
take an overview of the steps required to calculate the starting point:
1. Calculate the center of the image
2. Determine the starting point for cropping
3. Ensure the cropped image is within image boundaries
4. Crop the image with the crop size defined from the starting point.
Figure 5-2 shows each of these points in a photo of 1008 x 756px with a crop size of
400px:
Figure 5-2. Image with Points for Center Cropping
The following snippet shows the implementation of the center cropping algorithm
using OpenCV:
104 | Chapter 5: Image Processing
private Mat centerCrop(Mat original, int cropSize) {
int centerX = original.cols() / 2;
int centerY = original.rows() / 2;
int startX = centerX - (cropSize / 2);
int startY = centerY - (cropSize / 2);
startX = Math.max(0, startX);
startY = Math.max(0, startY);
int cropWidth = Math.min(cropSize, original.cols() - startX);
int cropHeight = Math.min(cropSize, original.rows() - startY);
Rect r = new Rect(startX, startY, cropWidth, cropHeight);
return new Mat(original, r);
}
Calculate the center of the image
Calculate the top-left corner of the crop area
Ensure the crop area is within the image boundaries
Generates a rectangle with the valid section of the image
Generates a new image matrix with only the part limited by the rectangle
The image processed with the cropping algorithm results in the following output:
OpenCV | 105
Figure 5-3. Center Cropped Image
At this point you’re familiar with basic image manipulation algorithms. In the next
section, you’ll see how to overlay elements in an image, such as another image,
rectangles, or text.
Overlying
When implementing AI/ML models involving an image, the model usually either
returns a string representing the categorization of the image (e.g., boots, sandals,
shoes, slippers) or a list of coordinates of what the model is detecting within the
image (e.g., cat, dogs, human, etc.).
In this latter use case, drawing rectangles with labels in the image is useful to show
the viewer what and where the model has detected the points of interest.
Figure 5-4 shows an example of an output image with an overlay showing a detected
hand:
106 | Chapter 5: Image Processing
Figure 5-4. Image with the detected element
Let’s explore how to overlying elements in an image using Open CV:
Draw Boundaries
OpenCV provides two methods for drawing rectangles or overlaying texts on an
image: Imgproc.rectangle and Imgproc.putText.
Let’s implement a method to draw the boundaries of an object. The algorithm checks
the length of the text to adapt the width of the given rectangle in case the text is more
significant than the rectangle:
protected Mat drawRectangleWithText(Mat original, Rect rectangle,
Scalar color, String text) {
final double fontScale = 0.9d;
final int fontThickness = 3;
final int rectangleThickness = 3;
final int font = Imgproc.FONT_HERSHEY_SIMPLEX;
Mat destination = original.clone();
final Size textSize = Imgproc.getTextSize(text, font, fontScale,
OpenCV | 107
fontThickness, null);
if (textSize.width > rectangle.width) {
rectangle.width = (int) textSize.width;
}
Imgproc.rectangle(destination, rectangle, color,
rectangleThickness);
Imgproc.putText(destination, text,
new Point(rectangle.x, rectangle.y - 10),
font, fontScale, color, fontThickness);
return destination;
}
Defines default values for font scale, font thickness, font and rectangle thickness
Copies the original image to not modify it
Gets the size of the text when materialized in the image
Checks if the label width is bigger than the rectangle width
Draws the rectangle
Moves the text coordinates 10 pixels above the rectangle not to overlap
Embeds the text into the given point
Another option is not drawing only the rectangle’s border (or any other shape) but
filling it with some color, optionally with some transparency.
In the following example, you’ll create a rectangle filled with green color and a
transparent layer so the main image is partially visible. This is done using the
org.opencv.core.Core.addWeighted method.
protected Mat fillRectangle(Mat src, Rect rect, Scalar color, double alpha) {
final Mat overlay = src.clone();
Imgproc.rectangle(overlay, rect, color,
-1);
Mat output = new Mat();
Core.addWeighted(overlay, alpha, src, 1 - alpha, 0, output);
return output;
}
Creates a copy of the original image
108 | Chapter 5: Image Processing
Creates a rectangle
Fills the rectangle with the color
Output matrix
Blends the images with transparency
Figure 5-5 shows the result:
Figure 5-5. Image with rectangle
In addition to drawing lines, rectangles, polygons, or circles, you can also overlay a
transparent image onto another image. For example, this might be useful to hide any
detected element, such as the face of a minor or sensitive data.
Overlap Images
Let’s implement a method that overlays a foreground image onto a background image
at a specified position.
It handles transparency by blending the pixel values of the foreground and back‐
ground images based on the foreground image’s alpha channel (opacity).
OpenCV | 109
The steps followed by the algorithm are:
1. Convert the background and foreground images to the RGBA color space so they
can handle the transparency.
2. Copies the pixel values from the foreground to the background image only if the
location of the foreground pixel is not outside the background boundaries.
3. For each channel, the foreground and the background pixel values are blended
based on the opacity value.
4. The composed image is returned as an image matrix.
private Mat overlayImage(Mat backgroun, Mat foregroun,
Point location) throws IOException {
Mat bg = new Mat();
Mat fg = new Mat();
Imgproc.cvtColor(backgroun, bg, Imgproc.COLOR_RGB2RGBA);
Imgproc.cvtColor(foregroun, fg, Imgproc.COLOR_RGB2RGBA);
for (int y = (int) Math.max(location.y , 0); y < bg.rows(); ++y) {
int fY = (int) (y - location.y);
if(fY >= fg.rows())
break;
for (int x = (int) Math.max(location.x, 0); x < bg.cols(); ++x) {
int fX = (int) (x - location.x);
if(fX >= fg.cols()){
break;
}
double opacity;
double[] finalPixelValue = new double[4];
opacity = fg.get(fY , fX)[2];
finalPixelValue[0] = bg.get(y, x)[0];
finalPixelValue[1] = bg.get(y, x)[1];
finalPixelValue[2] = bg.get(y, x)[2];
finalPixelValue[3] = bg.get(y, x)[3];
for(int c = 0; c < bg.channels(); ++c){
if(opacity > 0){
double foregroundPx = fg.get(fY, fX)[c];
double backgroundPx = bg.get(y, x)[c];
float fOpacity = (float) (opacity / 255);
110 | Chapter 5: Image Processing
finalPixelValue[c] = ((backgroundPx * ( 1.0 - fOpacity))
+ (foregroundPx * fOpacity));
if(c==3){
finalPixelValue[c] = fg.get(fY,fX)[3];
}
}
}
bg.put(y, x, finalPixelValue);
}
}
return bg;
}
Convert background and foreground images to RGBA
Iterate through each pixel of the background image starting from the specified
location
Pixel is not out of bounds
Get the alpha value (opacity) of the current foreground pixel
Get the initial pixel values from the background
Blend the foreground and background pixel values based on the opacity
Update the background image with the blended pixel values
Next, let’s use image processing algorithms like binarization, gaussian blur, or canny
to change the image.
Image Processing
Let’s explore some algorithms that change the image content, for example, to remove/
blur background objects (making them less obvious), reduce noise to increase accu‐
racy in image analysis, or correct image perspective to provide a more calibrated view
for processing.
Gaussian Blur
Gaussian Blur is the process of blurring an image using a Gaussian function. Gaus‐
sian Blur is used in image processing for multiple purposes:
Noise reduction
Smooth the variations in pixel values; it helps remove small-scale noise.
OpenCV | 111
Scale-Space Representation
Generate multiple blurred versions of the image. It is used in multi-scale analysis
and feature detection at various scales.
Preprocessing for Edge Detection
Blur helps obtain cleaner and more accurate edge maps when used to detect
edges of objects.
Reducing Aliasing
Blur helps prevent aliasing.
To apply Gaussian blur with OpenCV, use the Imgproc.GaussianBlur method.
Imgproc.GaussianBlur(mat, blurredMat,
new Size(7, 7),
1
);
Input and Output matrix
Kernel size for blurring
Gaussian kernel standard deviation in X direction (sigmaX)
Applying the Gaussian blur algorithm on the hands image results in the Figure 5-6:
112 | Chapter 5: Image Processing
Figure 5-6. Gaussian Blur
After blurring, let’s see how to apply a binarization process to an image.
Binarization
Binarization is the process of iterating through all pixels and setting a value of 0
(black) or 1 (white) if the pixel value is smaller than a defined threshold.
It is useful to segment an image into foreground and background regions by separat‐
ing relevant elements from the background.
There are a lot of different binarization algorithms like THRESH_BINARY, ADAP
TIVE_THRESH_MEAN_C selecting the threshold for a pixel based on a small region
around it, or THRESH_OTSU for Otsu’s Binarization algorithm where the algorithm
finds the optimal threshold automatically.
The next code shows applying binarization process alone and with Oysu:
Imgproc.threshold(src, binary,
100,
255,
Imgproc.THRESH_BINARY);
OpenCV | 113
Mat grey = toGreyScale(src);
Imgproc.threshold(src, binary,
0,
255,
Imgproc.THRESH_BINARY + Imgproc.THRESH_OTSU
);
Input and Output matrix
Threshold value
Maximum value to use
Thresholding type
Otsu algorithm requires image in greyscale
Values are ignored as the Otsu algorithm automatically calculates the point
Otsu’s threshold algorithm
Applying the previous binarization algorithms on the hands image results in the
Figure 5-7.
Image 1
The original image with no transformation.
Image 2
The image after applying binary threshold.
Image 3
The image after applying Otsu threshold.
114 | Chapter 5: Image Processing
Figure 5-7. Binarization Image
Another important algorithm used in low light situations is noise reduction to
improve the quality of the image.
Next part you’ll learn how to apply noise reduction to an image.
Noise Reduction
Besides blurring an image to reduce noise, OpenCV implements image-denoising
algorithms for greyscale and color images. The class implementing denoising algo‐
rithms is org.opencv.photo.Photo, which also implements other algorithms for
photo manipulation, such as texture flattening, illumination change, detail enhance‐
ment, or pencil sketch.
Let’s take a look at that in action:
Photo.fastNlMeansDenoising(src, dst,
10); //
Filter strength. Big values perfectly remove noise but also remove image details,
while smaller values preserve details but also preserve some noise.
OpenCV | 115
So far, you executed these algorithms as a single unit: you apply the algorithm and the
image changes. The last algorithms we’ll show you in this section are the combination
of multiple image processing algorithms to change the image.
Edge Detection
Edge detection is a crucial image processing technique for identifying and locating
the boundaries or edges of objects within an image.
Some of the use cases are correcting the perspective of an image for future process‐
ing, detecting different areas present in an image (segmentation), extracting the
foreground of an image, or identifying a concrete part of an image.
There are multiple algorithm combinations to implement edge detection in an image,
and we’ll show you the most common one:
protected Mat edgeProcessing(Mat mat) {
final Mat processed = new Mat();
Imgproc.GaussianBlur(mat, processed, new Size(7, 7), 1);
Imgproc.cvtColor(processed, processed, Imgproc.COLOR_RGB2GRAY);
Imgproc.Canny(processed, processed, 200, 25);
Imgproc.dilate(processed, processed,
new Mat(),
new Point(-1, -1),
1);
return processed;
}
Blurs using a Gaussian filter
Transforms image to greyscale
Finds edges using Canny edge detection algorithm.
Dilates the image to add pixels to the boundary of the input image, making the
object more visible and filling small gaps in the image
Structuring element used for dilation; matrix of 3x3 is used when empty
Position of the anchor, in this case, the center of the image
Number of times dilation is applied.
116 | Chapter 5: Image Processing
Applying the previous algorithm results in Figure 5-8.
Image 1
It is the original image with no transformation.
Image 2
The image after applying gaussian filter.
Image 3
The image after applying Canny.
Image 4
The Image after dilate the image.
OpenCV | 117
Figure 5-8. Transformation Image
The interesting step here is the dilate step, which thickens the borders to make them
easy to detect or process.
Besides processing the image, OpenCV has the Imgproc.findContours method to
detect all the image contours and store them as points into a List:
List<MatOfPoint> allContours = new ArrayList<>();
Imgproc.findContours(edges, allContours,
new org.opencv.core.Mat(),
118 | Chapter 5: Image Processing
Imgproc.RETR_TREE,
Imgproc.CHAIN_APPROX_NONE);
Input matrix
List where contours are stored
Optional output vector containing information about the image topology
Contour retrieval mode, in this case, retrieves all of the contours and recon‐
structs a full hierarchy of nested contours
Contour approximation method, in this case, all points are used
The previous method returns a list of all detected contours. Each MatOfPoint object
contains a list of all the points that define a contour. A representation of one of
these objects might be [{433.0, 257.0}, {268.0, 1655.0}, {1271.0, 1823.0},
{1372.0, 274.0}], joining all the points with a line would be a contour of an
element detected in the image.
The problem is what’s happen if the algorithm detects more than one contour, how to
distinguish the contour of the required object from contours of other objects.
One way to solve this problem is filtering the results by setting some certain steps:
1. Get only the MatOfPoint objects that cover the most significant area, calling the
Imgproc.contourArea method, and remove the rest.
2. Approximate the resulting polygon with another polygon with fewer vertices
using the Imgproc.approxPolyDP method. It uses the Douglas-Peucker algo‐
rithm. With this change, the shape is smoothed, closer to human-eye reality.
3. Remove all polygons with less than 4 corners.
4. Limit the result to a certain number of elements. Depending on the domain
might be 1, or more.
These steps are executed in the following code:
final List<MatOfPoint> matOfPoints = allContours.stream()
.sorted((o1, o2) ->
(int) (Imgproc.contourArea(o2, false) -
Imgproc.contourArea(o1, false)))
.map(cnt -> {
MatOfPoint2f points2d = new MatOfPoint2f(cnt.toArray());
final double peri = Imgproc.arcLength(points2d, true);
MatOfPoint2f approx = new MatOfPoint2f();
Imgproc.approxPolyDP(points2d, approx, 0.02 * peri, true);
OpenCV | 119
return approx;
})
.filter(approx -> approx.total() >= 4)
.map(mt2f -> {
MatOfPoint approxf1 = new MatOfPoint();
mt2f.convertTo(approxf1, CvType.CV_32S);
return approxf1;
})
.limit(1) /
.toList();
Iterates over all detected contours
Sort areas in decrease order
approxPolyDP method returns points in the float type
Approximates the polygon
Filters the detected contours to have at least 4 corners
Transforms the points from float to int
Limits to one result
At this point, there is only one element in the list containing the list of points
conforming to the detected element, for this example, the card present in the photo.
To draw the contours to the original image use the Imgproc.drawContours method:
Mat copyOfOriginalImage = originalImage.clone();
Imgproc.drawContours(copyOfOriginalImage, matOfPoints,
-1,
GREEN,
5);
Copy the original image to keep it original with no modifications
Draws the contours detected in the previous step in the given image
Draw all detected contours
Scalar representing green
Thickness of the lines
The final image is shown in the Figure 5-9:
120 | Chapter 5: Image Processing
Figure 5-9. Card with contour
These steps are important not only for detecting elements in a photograph but also
for correcting the perspective of an image. Next, let’s see using Open CV to correct
the image prespective.
Perspective Correction
In some use cases, such as photographed documents, the document within the image
might be distorted, making it difficult to extract the information enclosed in it.
OpenCV | 121
For this reason, when working with photographed documents like passports, card
IDs, card licenses, etc., where we might have control over how the image is taken, it is
important to include a perspective correction as a preprocessing step.
If you look closely at the previous image, the card borders are not parallel with the
image borders, so let’s see how to fix the image’s perspective.
OpenCV has the Imgproc.warpPerspective to apply a perspective correction to a
photo. This perspective transformation is applied by having a reference point or
element to correct. This transformation can fix perspective for some aspects of the
photo while distorting others; what is essential here is detecting which element needs
a correction.
The steps to follow are:
1. Edge detecting the element within the image
2. Find the contour of the image
3. Map the contour points to the desired locations (for example, using L2 form)
4. Compute the perspective transformation matrix calling the Imgproc.getPerspec
tiveTransform method.
5. Apply the perspective transformation matrix to the input image calling the
Imgproc.warpPerspective method.
Figure 5-10 shows the detection of the four corner points and the translation to
correct the perspective:
122 | Chapter 5: Image Processing
Figure 5-10. Perspective Correction
Let’s do the perspective correction of the previous card image. We’ll show you the
algorithm after calling the edgeProcessing and finding and filtering the contours to
one with four corners.
protected Mat correctingPerspective(Mat img) {
Mat imgCopy = this.edgeProcessing(img);
final Optional<MatOfPoint> matOfPoints = allContours.stream()
...
.filter(approx -> approx.total() == 4)
.findFirst();
final MatOfPoint2f approxCorners = matOfPoints.get();
MatOfPoint2f corners = arrange(approxCorners);
MatOfPoint2f destCorners = getDestinationPoints(corners);
final Mat perspectiveTransform = Imgproc
.getPerspectiveTransform(corners, destCorners);
org.opencv.core.Mat dst = new org.opencv.core.Mat();
Imgproc.warpPerspective(img, dst, perspectiveTransform, img.size());
return dst;
}
OpenCV | 123
Edge Detection steps
Gets the calculated corner points
Order of points in the MatOfPoint2f needs to be adjusted for the getPerspecti
veTransform method
Calculate the destination points using the L2 norm
Compute the perspective transformation matrix to move image from original to
the destination corners
Apply the perspective transformation matrix
There are two methods not explained yet. One arranges the points in the correct
order to be consumed by getPerspectiveTransform:
private MatOfPoint2f arrange(MatOfPoint2f approxCorners){
Point[] pts = approxCorners.toArray();
return new MatOfPoint2f(pts[0], pts[3], pts[1], pts[2]);
}
Rearrange the list of points to new positions
The other method is getDestinationPoints, which calculates the destination points
of each corner to correct the image’s distortion.
In this case, we use the L2 norm (or Euclidean norm), which gives the distance from
the origin to the point, to translate the original (yet distorted) coordinates to new
coordinates where the image element is not distorted.
Figure 5-11 shows the L2 norm formula:
Figure 5-11. L2 Norm formula
The Figure 5-12 helps you to visualize this transformation.
The cross markers (+) are the original corner points of the element. You can see
that they form a not-perfect rectangle, but they fit perfectly to the element,so it is
distorted.
The star markers (*) are the points calculated using the Euclidean norm as the
element’s final coordinates.
124 | Chapter 5: Image Processing
Figure 5-12. Original vs New Coordinates
The code to calculate the new coordinates is shown in the following snippet:
private double calculateL2(Point p1, Point p2) {
double x1 = p1.x;
double x2 = p2.x;
double y1 = p1.y;
double y2 = p2.y;
double xDiff = Math.pow((x1 - x2), 2);
double yDiff = Math.pow((y1 - y2), 2);
OpenCV | 125
return Math.sqrt(xDiff + yDiff);
}
private MatOfPoint2f getDestinationPoints(MatOfPoint2f approxCorners) {
Point[] pts = approxCorners.toArray();
double w1 = calculateL2(pts[0], pts[1]);
double w2 = calculateL2(pts[2], pts[3]);
double width = Math.max(w1, w2);
double h1 = calculateL2(pts[0], pts[2]);
double h2 = calculateL2(pts[1], pts[3]);
double height = Math.max(h1, h2);
Point p0 = new Point(0,0);
Point p1 = new Point(width -1,0);
Point p2 = new Point(0, height -1);
Point p3 = new Point(width -1, height -1);
return new MatOfPoint2f(p0, p1, p2, p3);
}
Figure 5-13 shows the original card image without distortion after applying the
correctingPerspective method. See how the card lines parallel the image so no
distortion is appreciated.
126 | Chapter 5: Image Processing
Figure 5-13. Image with no distortion
Let’s see another use case related to image processing in this chapter. The following
section uses the OpenCV library to read barcodes or QR codes from images.
Reading QR/BarCode
OpenCV library implements two classes for QR or Barcode recognition.
OpenCV implements several algorithms to recognize codes, all implicitly called when
using the org.opencv.objdetect.GraphicalCodeDetector class.
OpenCV | 127
These algorithms are grouped into three categories:
Initialization
Constructs barcode detector.
Detect
Detects graphical code in an image and returns the quadrangle containing the
code. This step is important as the code can be in any part of the image, not a
specific position.
Decode
Reads the contents of the barcode. It returns a UTF8-encoded String or an
empty string if the code cannot be decoded. As a pre-step, it locally binarizes the
image to simplify the process.
Let’s explore using OpenCV for reading QR and Bar Codes:
Barcodes
The barcode’s content is decoded by matching it with various barcode encoding
methods. Currently, EAN-8, EAN-13, UPC-A and UPC-E standards are supported.
The class for recognizing barcodes is org.opencv.objdetect.BarcodeDetector,
implementing a method named detectAndDecode, calling both detection and
decoder parts, so all process is executed with a single call.
Given the barcode shown at Figure 5-14:
128 | Chapter 5: Image Processing
Figure 5-14. Image with a barcode
The following code shows getting the barcode as a String from the previous image:
protected String readBarcode(Mat img) {
BarcodeDetector barcodeDetector = new BarcodeDetector();
return barcodeDetector.detectAndDecode(img);
}
Initialize the class
Executes the detect and decode algorithms
Scanning a barcode is not a difficult, and in similar way, you scan a QR code.
OpenCV | 129
QR Code
OpenCV provides the org.opencv.objdetect.QRCodeDetector class for scanning
QR codes. In this case, you call the overloaded detectAndDecode method where the
second argument is an output Map of vertices of the found graphical code quadrangle.
QRCodeDetector qrCodeDetector = new QRCodeDetector();
Mat ouput = new Mat();
String qr = qrCodeDetector.detectAndDecode(img, output);
Initialize the class
Defines output Mat
Executes the detect and decode algorithms
Draw the detected marks on the image:
for (int i = 0; i< output.cols(); i++) {
Point p = new Point(pointsMat.get(0, i));
Imgproc.drawMarker(img, p, OpenCVMain.GREEN,
Imgproc.MARKER_CROSS, 5, 10)
}
It is a 4x1 matrix
Each column contains the coordinates of one point
Draws markers
After this brief and practical introduction to image processing, let’s move on how to
process images when they are a stream of images (e.g., a video or webcam).
Stream Processing
OpenCV provides some classes for reading, extracting information, and manipulat‐
ing videos. These videos can be a file, a sequence of images, a live video from (net‐
work) webcam, or any capturing device that is addressable by a URL or GStreamer
form.
The main class to manipulate videos, or get information like Frames Per Second
or the size of the video is org.opencv.videoio.VideoCapture. Moreover, this class
implements a method to read each frame as a matrix (Mat object) to process it, as
shown in the previous sections of this chapter.
The org.opencv.videoio.VideoWriter provides a method to store the processed
videos in several formats, the most accepted, the mpg4 format.
130 | Chapter 5: Image Processing
Let's dig into how to utilize that.
Processing Videos
Let’s develop a method that reads a video file and applies the binarization process to
all frames to finally store the processed video:
protected void processVideo(Path src, Path dst) {
VideoCapture capture = new VideoCapture();
if (!capture.isOpened()) {
capture.open(src.toAbsolutePath().toString());
}
double frmCount = capture.get(Videoio.CAP_PROP_FRAME_COUNT);
System.out.println("Frame Count: " + frmCount);
double fps = capture.get(Videoio.CAP_PROP_FPS);
Size size = new Size(capture.get(Videoio.CAP_PROP_FRAME_WIDTH),
capture.get(Videoio CAP_PROP_FRAME_HEIGHT));
VideoWriter writer = new VideoWriter(dst.toAbsolutePath().toString(),
VideoWriter.fourcc('a', 'v', 'c', '1'),
fps, size, true);
Mat img = new Mat();
while (true) {
capture.read(img);
if (img.empty())
break;
writer.write(this.binaryBinarization(img));
capture.release();
writer.release();
}
Instantiates the main class for video capturing
Load the file
Gets the number of frames
Gets the Frame Per Seconds
Gets the dimensions of the video
Stream Processing | 131
Creates the class to write video to disk
Reads a frame and decodes as a Mat
If there are no more frames, skip the loop
Process the matrix
Writes the processed matrix to the output stream
Closes the streams and writes the content
With these few lines of code, you process “offline” videos; in the next section, you’ll
explore processing video in real time.
Processing WebCam
Let’s implement a simple method of capturing a snapshot from the computer camera:
protected Mat takeSnapshot() {
VideoCapture capture = new VideoCapture(0);
Mat image = new Mat();
try {
TimeUnit.SECONDS.sleep(1);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
capture.read(image);
capture.release();
return image;
}
id of the video capturing device to open. The default one is 0
Wait till the device is ready
Captures the image
We suggest you use a library like Awaitility to implement waits. For
the sake of simplicity, we leave it as a sleep call.
132 | Chapter 5: Image Processing
Capturing from a camera is similar to making a video, but the class is configured to
the device’s location instead of a file.
You’ve now gained a good understanding of the Open CV project’s capabilities for
manipulating images and videos, detecting objects, and reading Barcodes and QR
codes in Java. But before finishing this chapter, we have some final words about Open
CV, its integration with DJL, and a Java alternative to Open CV.
OpenCV and Java
So far, you’ve probably noticed that even though OpenCV is well integrated with Java,
the API mimics the C/C++ programming language. For example:
• Using parameters for output inherited by the pass with reference (pointers) in
C++.
• Using integers instead of enums for configuration constants or parameter names.
• Not using exceptions to indicate errors, only booleans for setting if the operation
succeeds or not, or returning an empty value (matrix with 0, blank strings) as
none successful operation.
• There is a need to call the release method to close the object and free resources.
In Java, you could use try-with-resources.
• Unit class is a matrix instead of an image.
When using OpenCV, we recommend creating a Java wrapper around the library,
addressing some of these “problems” and implementing them in Java. You can do this
by:
• Using return statements instead of objects as reference. In case of multiple
return values, create a Java record.
• Configuring the integer value by putting constants in an enum with a field instead
of an integer constant.
• Making classes implement AutoClosable to release the resources.
Other options for image processing exist, such as BoofCV, which offers capabilities
similar to those of OpenCV. For this book, we advocated for OpenCV because of its
tight integration with the DJL library.
DJL creates a layer of abstraction around OpenCV, fixing some of the problems
mentioned in the previous section. Moreover, it implements some image processing
algorithms, such as drawing boundaries in an image, so you don’t need to implement
them yourself.
OpenCV and Java | 133
To use this integration, register the following dependency:
<dependency>
<groupId>ai.djl.opencv</groupId>
<artifactId>opencv</artifactId>
<version>0.29.0</version>
</dependency>
To load an image, use any of the methods provided by the ai.djl.modality.cv.Buf
feredImageFactory class. You can load an image from various sources like URLs,
local files, or input streams.
When loading an image, it returns a class of type ai.djl.modality.cv.Image, which
provides a suite of image manipulation functions to pre- and post-process the images
and save the final result.
Let’s try cropping an image to get only the left half of the image using this integration:
Image img = BufferedImageFactory.getInstance().fromFile(pic);
int width = img.getWidth();
int height = img.getHeight();
Image croppedImg = img.getSubImage(0, 0, width / 2, height);
croppedImg.save(
new FileOutputStream("target/lresizedHands.jpg"),
"jpg");
Reads image from a file
Gets information about the size of the image
Crops the image, creating a new copy
Stores the image to disk
Any of these methods can throw an exception in case of an error instead of returning
an empty response or a null.
The Image class has the getWrappedImage method to get the under‐
lying representation of the image in OpenCV object (usually a
Mat).
We’ll utilize this library deeply in the following chapter.
134 | Chapter 5: Image Processing
With a good understanding of image and video processing, let’s move forward to the
last section of this chapter, where we’ll use Optical Character Recognition (or OCR)
to transform an image with text into machine-encoded text.
OCR
Optical Character Recognition (OCR) is a group of algorithms that convert docu‐
ments, such as scanned paper documents, PDFs, or images taken by a digital camera,
into text data. This is useful for processing text content (for example, extracting
important information or summarizing the text) or storing text in a database to make
it searchable.
OCR process usually has three phases to detect characters:
Pre-Processing
Apply some image processing algorithms to improve character recognition.
These algorithms usually adjust the perspective of the image, binarize, reduce
noise, or perform layout analysis to identify columns of text.
Text recognition
This phase detects text areas, recognizing and converting individual characters
into digital text.
Post-Processing
The accuracy of the process increases if, after the detection of words, these words
are matched against a dictionary (this could be a generalistic dictionary or more
technical one for a specific field) to detect which words are valid within the
document. Also, this process can be more complex, not just detecting exact word
matches, but also similar words. For example, “Regional Cooperation” is more
common in English than “Regional Cupertino.”
Multiple OCR libraries exist, but the Tesseract library is one of the most used and
accurate.
Tesseract (https://github.com/tesseract-ocr/tesseract) is an optical character recognition
engine released as an open-source project under the Apache license. It can recognize
more than 116 languages, process right-to-left text, and perform layout analysis.
The library is written in C and C++, and, like the OpenCV library, a Java wrapper
surrounds it to make calls to the library transparently from Java classes.
To get started with Tesseract Java, which, we’ll refer to it as Tesseract from here on,
register the following dependency on your build tool:
<dependency>
<groupId>org.bytedeco</groupId>
<artifactId>tesseract-platform</artifactId>
OCR | 135
<version>5.3.4-1.5.10</version>
</dependency>
One important thing to do before using Tesseract is to download tessdata files.
These files are trained models for each supported language and you can download
them from Tesseract GitHub (https://github.com/tesseract-ocr/tessdata). For example,
the file in English is named eng.traineddata.
Download the file and store it at src/main/resources/eng.traineddata.
Now, let’s develop a simple application that scans and extracts PDF text.
The main class interacting with Tesseract is org.bytedeco.tesseract.TessBaseAPI.
This class is an interface layer on top of the Tesseract instance to make calls for
initializing the API, setting the content to scan, or getting the text from the given
image.
The first step is to instantiate the class and call the init method to initialize the OCR
engine, setting the path of the folder with all tessdata files and the language name to
load for the current instance.
For our example:
static {
api = new TessBaseAPI();
if (api.Init("src/main/resources", "eng") != 0) {
throw new RuntimeException("Could not initialize Tesseract.");
}
}
Executes this method only once as it takes a lot of time
After the class initialization, you can start scanning and processing images. The input
parameter must be of type org.bytedeco.leptonica.PIX which is the image to scan.
To load an image into the PIX object, use the org.bytedeco.leptonica.global.lep
tonica.pixRead static method.
Finally, use the SetImage method to set the PIX and GetUTF8Text to get the text
representation of the image.
TessBaseAPI is not thread-safe. For this reason, it is really impor‐
tant to protect access to the read and scan methods with any of the
Java synchronization methods to avoid concurrent processing.
Next snippet reads an image with a text, and returns the text as a String object.
private static final ReentrantLock reentrantLock = new ReentrantLock();
136 | Chapter 5: Image Processing
String content = "";
try (PIX image = pixRead(imagePath.toFile().getAbsolutePath())) {
reentrantLock.lock();
BytePointer bytePointer;
try {
api.SetImage(image);
bytePointer = api.GetUTF8Text();
content = bytePointer.getString();
} finally {
if (bytePointer != null) {
bytePointer.close();
}
reentrantLock.unlock();
}
} catch (Exception e) {
}
Using try-with-resources for automatic resource management
Loads image from file location
Lock the code that uses the shared resource
Sets the image to Tesseract
Gets the text scanned in the image as a pointer
Pointer to String
The last execution step is closing down Tesseract and freeing up all memory. This
step should be executed only when you no longer need to use Tesseract.
public static void cleanup() {
api.End();
}
Clean Up Resources
These are all the topics we cover in this book about image processing; in further
chapters, you’ll see examples of using the image processing algorithms applied to
AI/ML models.
OCR | 137
Next steps
In this chapter, you learned the basics of image/video processing. You’ll typically need
to apply these algorithms when AI models use images as input parameters.
Image processing is a vast topic that would require a book on its own, but we think a
brief introduction of the most common used algorithms to get a good understanding
of image processing in Java is important for the topic of this book.
Because of the direct relationship between Open CV and Open CV Java, you can
translate any tutorial, video, book, or examples written in the Open CV \C++ version
to Java.
By now you understand AI/ML and it’s integration with Java, as well as how to
infer models in Java using DJL, and consume these models with Java clients (REST
or gRPC) or with LangChain4j. Moreover, this chapter showed how to preprocess
images to adapt them to be suitable input parameters for a model.
However, there are pieces we haven’t covered yet such as:
• Model Context Protocol (MCP)
• Streaming models
• Security and guards
• Observavility
In the book’s last chapter, we’ll cover these important topics, which in our opinion
don’t fit in any of the previous chapters.
138 | Chapter 5: Image Processing
About the Authors
Markus Eisele is a technical marketing manager in the Red Hat Application Devel‐
oper Business Unit. He has been working with Java EE servers from different vendors
for more than 14 years and gives presentations on his favorite topics at international
Java conferences. He is a Java Champion, former Java EE Expert Group member,
and founder of Germany’s number-one Java conference, JavaLand. He is excited to
educate developers about how microservices architectures can integrate and comple‐
ment existing platforms, as well as how to successfully build resilient applications
with Java and containers. He is also the author of Modern Java EE Design Patterns and
Developing Reactive Microservices (O’Reilly). You can follow more frequent updates
on Twitter and connect with him on LinkedIn.
Alex Soto Bueno is a director of developer experience at Red Hat. He is passionate
about the Java world, software automation, and he believes in the open source soft‐
ware model. Alex is the coauthor of Testing Java Microservices (Manning), Quar‐ kus
Cookbook (O’Reilly), and the forthcoming Kubernetes Secrets Management (Man‐
ning), and is a contributor to several open source projects. A Java Champion since
2017, he is also an international speaker and teacher at Salle URL University. You can
follow more frequent updates on his Twitter feed and connect with him on LinkedIn.
Natale Vinto is a software engineer with more than 10 years of expertise on IT and
ICT technologies and a consolidated background on telecommunications and Linux
operating systems. As a solution architect with a Java development background, he
spent some years as EMEA specialist solution architect for OpenShift at Red Hat.
Today, Natale is a developer advocate for OpenShift at Red Hat, helping people within
communities and customers have success with their Kubernetes and cloud native
strategy. You can follow more frequent updates on Twitter and connect with him on
LinkedIn.