Skip to main content

AQuA: Augmented Quality Assessment

As of April 2024, anyone in the bible translation movement can apply for an AQuA account. Please email us at idx_aqua@sil.org to learn more.

What is AQuA?

AQuA is a copilot tool developed by SIL Global that seeks to develop capacity and increase the thoroughness of translation quality assurance. It harnesses artificial intelligence (AI) techniques, assisting translation teams in objectively assessing multiple facets of translation quality by enhancing and complementing preexisting quality assurance work. AQuA runs several augmented quality assessment methods in the areas of accuracy, naturalness, and clarity, which are then analyzed by translation teams to help them make more informed decisions about the quality of their translations.

To this end, AQuA is creating tools for:

  1. translators, aiding them in identifying obvious problems so they can address them earlier in the translation process;
  2. church leaders and consultants, allowing them to do their checking work more thoroughly and more consistently;
  3. administrators and strategists, helping them to compare and evaluate different methodologies and products (including new AI-based translation methodologies proposed by TBTA, SIL, and Avodah).

By bringing attention to possible anomalous passages in a draft, AQuA accelerates the quality assessment process by helping:

  • translators make corrections (and produce an improved draft) prior to other quality checks;
  • church leaders and consultants gain an “at-a-glance” understanding of quality across a draft and quickly dig into the relevant, granular quality issues; and
  • administrators and strategists more easily track quality across projects to guide strategic allocations of checking resources.

Our hope is that AQuA would empower translation teams in the quality checking process, helping them to more easily spot the obvious errors, which in turn lets them focus more of their precious time and attention on the parts of a draft that really need their expertise.

James Calibrated Word Correspondence

AQuA as an AI Copilot

We are seeing a recent explosion of interest in AI-driven tools that are paired as “copilots” with human workers. These include copilots that are helping people to write code, draft emails, diagnose patients, write recipes, and much more.

The use of the term copilot in these instances emphasizes two important points:

  1. A person is in control of the process (i.e., they are the pilot)
  2. The person's work is enhanced when it is assisted by the AI tool (in terms of efficiency, thoroughness, accuracy, creativity, etc)

In Bible translation, there are individuals such as church leaders, translators, and consultants, who undergo specialty training to create high-quality translation drafts. AQuA, an AI quality-checking copilot, is able to help these translation experts locate anomalies and issues within translation drafts relating to the areas of accuracy, naturalness, and clarity that would likely not have been found otherwise.

It is important to note that AQuA is not designed to give insight into issues related to the other important qualities of a draft, such as the beauty or acceptability of a translation. AQuA is a tool which helps stakeholders in the translation process to more efficiently and thoroughly evaluate certain qualities of a draft translation; with the help of AQuA, teams can improve several qualities of their draft within the amount of time allotted to them.

Romans Missing Words

What does AQuA Measure?

The quality of a translation is not a singular entity; rather, it is best approached as a whole set of different sub-qualities. With AQuA, we are not attempting to come up with a singular estimate of quality, but only to assess a subset of these component qualities.

With AQuA, we do not assess these such-qualities directly; we only deal with qualities that can be measured algorithmically and expressed as a number. These measurable qualities are never an exact match to one of the abstract qualities (such as accuracy, naturalness, or clarity); instead they work like proxies for key aspects of the abstract qualities. For example, the AQuA metrics for Word Correspondence and Semantic Similarity are both quantifiable measures that relate to the accuracy of a translation, but do not give a full picture of its accuracy.

Acts Scatterplot

Methods of Assessment

A list of AQuA assessments are provided below, grouped under the abstract qualities of accuracy, naturalness, and clarity:

ACCURACY

Word Correspondence

The word correspondence assessment gives insight into how closely words align between two translations. Lower scores indicate less "word-for-wordness", while higher scores indicate more "word-for-wordness". Scores are calculated for each verse.

How it works:

Word correspondence scores are a measure of AQuA’s confidence in its ability to align functionally equivalent words between two translations. Although the user may select any two translations on which to run a word correspondence assessment, the most common use case is comparing a draft translation to a reference or model translation.

The word correspondence algorithm works in several steps. First, we use a variety of computational techniques in an attempt to align corresponding words between the two texts.

Here are the algorithms we use for word alignment:

All of the above algorithms train only on the Bible texts being analyzed for the word correspondence assessment. In other words, no pretraining data is required.

Each of these algorithms yields a score between 0 and 1 for every possible combination of word alignment pairs in each verse, representing the algorithm's confidence in each word alignment pair.

The word correspondence scores are calculated by averaging out the scores produced by the above algorithms for each word pair, and then, for each word, finding its most likely alignment (i.e. the alignment pair with the highest confidence score). Verse-level scores are calculated by averaging out these high-confidence word pairs across an entire verse. We then multiply the scores by 10, to produce a final score between 0 and 10.

We tokenize words by whitespace, so this assessment currently does not work for non-whitespace languages, such as Thai. This assessment also yields more accurate results when used on isolating languages, as opposed to polysynthetic languages. Future research will focus on non-whitespace, subword, and phrase-level tokenization to improve performance on a wider variety of languages.

Semantic Similarity

The semantic similarity assessment gives insight into differences in meaning between two translations. Lower scores indicate more difference in meaning, while higher scores indicate less difference in meaning. Scores are calculated for each verse.

How it works:

Although the user may select any two translations on which to run a semantic similarity assessment, the most common use case is comparing a back translation of a draft to a reference or model translation. For best results, the two translations being compared should be in languages of wider communication.

We use Google's Language Agnostic BERT Sentence Embeddings (LaBSE) model to assign a score from 0 to 10 to each verse pair. LaBSE is pretrained on a large dataset of sentence pairs in 109 languages of wider communication. You can read more about the LaBSE model here, or read the model paper here.

Missing and Added Words

The missing and added words assessment provides insight into words in one translation that have no likely equivalent in another translation.

How it works:

Although the user may select any two translations on which to run a missing and added words assessment, the most common use case is comparing a draft translation to a reference or model translation. We will refer to these two texts as the "project" and "comparison" texts, respectively.

To identify potential missing words, we build off of the word alignments generated during the word correspondence assessment. After running the word correspondence algorithm between the two selected translations, we then run it again between the comparison translation, selected by the user, and several other high-quality reference translations. Finally, we compare results across these assessments.

Words in the comparison text that have no likely alignment in the project text, but do have a likely alignment in each of the other reference translations, are flagged as potentially missing from the project text.

The process for identifying potential added words in the project text is similar. We once again examine the alignments between the project and comparison texts. If a word in the project text has no likely equivalent in the comparison text, it is flagged as potentially added.

We tokenize words by whitespace, so this assessment currently does not work for non-whitespace languages, such as Thai. This assessment also yields more accurate results when used on isolating languages, as opposed to polysynthetic languages. Future research will focus on non-whitespace, subword, and phrase-level tokenization to improve performance on a wider variety of languages.

NATURALNESS

Sentence and Word Length (Lix Readability)

The sentence and word length assessment uses a modified version of the Lix readability formula as a proxy for the readability of a text. Scores are calculated for each chapter.

How it works:

Lix readability is a formula based on the word and sentence length of a text. Lix readability works better for estimating the readability of non-English text than other readability metrics like Flesch Kincaid, which contain English-specific linguistic information.

The Lix readability formula is as follows:
% of long words in the text + average words per sentence in the text
where a word is classified as "long" if it contains a certain number of letters.

In AQuA, this formula becomes:
% of words in chapter over 7 characters long + average # of words per sentence in chapter

This assessment works best for languages written with an alphabetic script. We use whitespace to tokenize the text into words. This assessment does not take into account word choice or syntax, which also factor into a text's readability.

CLARITY

Comprehensibility (Coming Soon)

The comprehensibility assessment gives insight into key information that may have been omitted from a translation.

How it works:

The comprehensibility assessment draws from a dataset of question-and-answer pairs that ask about expected key information in a verse. This assessment can only be run on translations in languages of wider communication (including back translations).

We provide a verse from the selected translation as context to a question-answering model. We then ask that model a question pertaining to that specific verse from the dataset. We then compare the model's answer to the expected answer from the dataset. If the model's answer differs significantly from the expected answer, it suggests the possibility of missing or incomplete information within the verse.

AQuA is currently available via a standalone web application, with plans to integrate into Scripture Forge and platform.bible. If your team is interested in using AQuA, please email us at idx_aqua@sil.org.