Testing softwares integrated with LLM

Testing softwares integrated with LLM^†

Testing methods, challenges and observations

Softwares integrated with LLM

LLMs are seldom used in isolation. LLMs performing tasks such as querying, content generation, summarization, classification and decision support are integrated with other software components.

Unlike traditional systems, LLM-based development rely on trial-and-error cycles, with developers modifying prompts, adjusting temperature and context parameters, and evaluating outputs through informal testing.

This lack of determinism in the development process stems from the models' black-box behavior and unpredicatable response to even minor prompt changes. Which in turn come in the way of designing predictable and robust systems, especially where user trust and safety is involved.

There sure is a need for more structured engineering practices that address the unique challenges of integrating LLMs with software systems.

Testing methods

(1) Manual ad hoc testing which are uplanned and informal, conducted without structured procedure or predefined objectives. These are largely spontaneous, triggered by new implementation changes or unexpected behaviors, and typically lack documentation or repeatable steps.

(2) Manual exploratory testing done with an intent to find how the LLM would respond to various inputs. These are conducted with a defined process: following a test charter or a set of test goals often focused on prompting variations, edge cases or behavioral boundaries.

(3) Manual scripted testing using pre-defined test cases providing a more systematic approach to evaluating whether the LLM output met design or domain requirements, allowing teams to track testing coverage and responsibility.

(4) Automated unit testing to validate the smaller, isolated components of the system that interact with the LLM. These are written for backend logic, parsing functions, or other modules involved in preparing or processing LLM data.

(5) Automated integration testing used to verify the correct flow of data between the LLM, database, backend logic, and UI components. In other words, ensure that the LLM's output can be processed, stored and presented as expected.

Testing challenges

(1) Integration issues arise when attempting to connect LLM outputs with backend logic, databases, or frontend components e.g. data format mismatches, API failures and unhandled exception when the model's output deviated from the expected. To debug/ fix these issues would require validating JSON payloads, parse model responses, handle async requests and so on.

(2) Non-deterministic behavior of the model which complicates the use of traditional test cases. Validation logic needs to allow for variablity in content while enforcing correctness of structure. This necessiates retry mechanisms or fallback behaviors to be implemented for cases when responses fail to meet format expectations. The exact same prompt could produce different responses at different time.

(3) Prompt engineering difficulties due to small changes to prompts causing significant shifts in model behavior, where some of those prompts were generated or modified dynamically within the code.

(4) Hallucinations where the model produces factually wrong, fabricated, or incoherent content, even when prompts are concise. These errors more often than not can not be traced back to a specific code issue but instead requires manual exploratory testing, user reviews, or prompt refinement to detect and manage output problems.

(5) Imbalanced/ biased outcomes wherein the model consistently favors certain patterns, such as always selecting the same correct answer in multiple-choice quizzes, over-relying on popular items in recommendations, or mishandling edge cases in user inputs. Though not harmful by itself, these biases may lead to misleading results.

All of the above imply that testing can't be a pure technical exercise, but needs to involve human judgement, contextual reasoning, and continuous refinement.

Notes

(1) Testing LLM integrated software is not only about checking output accuracy but also validating how LLM responses are handled and displayed by the rest of the system.

(2) Prompt engineering should be treated as part of software development, not as a one-time design activity. Prompts often require iteration, debugging and refinement during develpoment and maintenance.

(3) Testing workflows should accommodate non-determinism. Traditional test cases may not be sufficient for LLM responses, especially when variability is expected or even desired.

(4) In many cases, it is hard to know whether the output is truly incorrect or simply diverged from expectation. Accepting this uncertainty as part of the testing process, and creating space for manual review or collaborative evaluation, can help in avoiding over-testing or misclassifying useful behavior as failure.

Sudhir Shetty, Oct 14 2025.

† References

arXiv:2508.00198