Harnessing the power of Gen AI for evaluating AI systems

Evaluating AI systems presents numerous challenges; but leveraging large language models (LLMs) can help streamline the process, offering innovative solutions to enhance accuracy and efficiency.

As a data scientist with hands-on experience in developing AI solutions in the financial industry, I can confidently say that building these solutions requires a significant investment of time, resources, and expertise. Customers demand top-notch services, and companies are increasingly using AI to enhance trust. This means we are rolling out more AI solutions than ever before, necessitating robust practices for rigorous testing and evaluation of these models, all while meeting the demands of a competitive market.

However, testing AI models has always presented several challenges, such as scarcity of high-quality data and time-consuming process of data labeling. In 2020, the Gen AI boom began with OpenAI's release of GPT-3, which pushed AI capabilities to new heights and companies started exploring how to leverage Gen AI to address business problems. Here are four ways that LLMs can assist in evaluating the AI solutions we are so eager to deploy!

Test set generation

Data scarcity refers to the lack of sufficient high-quality data to effectively train the AI systems and validate their behavior. This is a common challenge in modeling real world problems and can delay the deployment of accurate and reliable AI systems. Traditionally, techniques like data augmentation (creating new data points from existing ones), synthetic data generation (fabricating artificial data such as simulated customer transactions), and transfer learning (using pre-trained models on similar tasks) have been employed. While these methods help, they often come with limitations, such as introducing biases or failing to capture the complexity of real-world scenarios.

Take for example, the problem of detecting fake product reviews. This is particularly challenging because these reviews are rare. This is a classic example of an anomaly detection problem, where the goal is to identify rare and unusual patterns within a larger dataset. Anomalies, like fake reviews, are often sparse and diverse, complicating the creation of a comprehensive dataset.

This is where LLMs help. They can address issues like sparse events by generating the necessary data through prompt engineering. By instructing the LLM to produce text in various tones and topics, we can create diverse datasets that include these rare anomalies. By providing ample examples for training and testing, this approach enhances the model’s ability to detect unusual patterns.

However, companies must exercise caution with this type of data generation, as LLMs can hallucinate. It is advisable to involve subject matter experts (SMEs) to help ensure data quality and adherence to the core concepts you are aiming to capture.

Automated data labeling

High-quality data is essential for any AI project, and AI teams often invest considerable time in collecting and labeling data in collaboration with SMEs. This process helps ensure that the SME’s domain knowledge is reflected in the data, resulting in a more accurate and relevant final product. For example, when labeling product reviews as genuine or fake, it is vital to have expert input to ensure accuracy. However, in the absence of SMEs, this can be a difficult endeavor, especially when the need to test innovative AI ideas quickly arises.

Here, LLMs can play a pivotal role. By using carefully crafted prompts, you can feed your data to the LLM for labeling. This approach can accelerate experimentation and allow for double-checking a subset of labels to ensure quality.

For instance, you could use an LLM to label a batch of product reviews and then verify samples with SMEs to confirm accuracy. Additionally, employing multiple LLMs for labeling and taking a majority vote can help enhance confidence in the final labels. An LLM can also serve as a tiebreaker when working with SMEs to reach a consensus.

White paper download → Considering AI solutions for your business? Ask the right questions.

For labeling tasks, consider utilizing the most performant LLMs available to help ensure high-quality labels. Of course, be sure to check your company policy regarding the data input into non-open-source LLMs!

Stress testing

Adversarial attacks are designed to test a solution’s robustness by introducing subtle changes. Common techniques include word substitution, where synonyms or similar words are swapped to see if the model’s output remains consistent. A similar technique, character-level perturbations, involves small alterations like typos or misspellings to assess how well the model handles errors. For example, testing a model that detects fake reviews could involve altering key phrases in genuine reviews to see if the model mistakenly flags them as fake.

LLMs can facilitate this process by generating challenging inputs that can push a model’s limits, helping to identify weaknesses and areas for improvement. They can create realistic typos or variations that mimic common user errors, allowing for a more comprehensive assessment of model robustness.

This automation not only speeds up testing but can also uncover vulnerabilities that might be overlooked with manual methods. Consider crafting your data generation prompts carefully to help ensure they remain focused on the main concepts you wish to detect with your model!

Data drift detection

Data drift in the field of AI refers to the phenomenon where the statistical properties of the input data that feed the AI system change over time, leading to a potential decline in model performance. These changes in the concepts we want our models to detect can occur frequently.

Assume we trained a model to identify fake product reviews from a clothing website and now wish to expand our scope and apply it to a drug store, yet we don't have labeled data. In such a case, we could use an LLM to generate synthetic reviews for drug store products and assess whether the model’s performance is degrading. This process can help inform decisions about model generalizability and expedite the retraining process, helping to ensure that our model is robust and effective across different domains.

Conclusion

LLMs are an exciting frontier in AI, offering remarkable capabilities that can significantly enhance our workflows. By carefully leveraging these models, we can accelerate processes such as data generation, labeling, and stress testing, ultimately improving the efficiency and accuracy of our AI systems.

However, it’s crucial to approach their use with caution, ensuring that we guide the LLM to generate the right output and involve SMEs in the process as much as possible. This collaborative approach allows us to harness the power of LLMs while safeguarding the integrity and relevance of our outputs, paving the way for speeding up the productization of innovative solutions that truly meet our clients' needs!

The opinions provided are those of the author and not necessarily those of Fidelity Investments or its affiliates. The information regarding AI tools provided herein is for informational purposes only and is not intended to constitute a recommendation, development, security assessment advice of any kind.

1170755.1.0

SaifrScreen for AML/KYC

SaifrScreen for trust & safety

SaifrReview for financial services

SaifrReview for life insurance and annuities

Saifr eComms

Microsoft

ServiceNow

Adobe

Ebooks

White papers

Case studies

Blog

Videos

Client support

FAQs

About Saifr

Press

Careers

Contact

Harnessing the power of Gen AI for evaluating AI systems

Test set generation

Automated data labeling

Stress testing

Data drift detection

Conclusion

Forough Haskel

Check out our latest blogs

Regulatory AI’s Expanding Role in AML/KYC

Team Saifr

2026 Trends: AI and Compliance in Financial Services

Team Saifr

AI regulation is everywhere…including at the state level

Allison Lagosh

Join our email list for the latest news and events in regulatory compliance and AI

Company

Solutions

Legal