Augmented Quality Assessment (AQuA)
What is AQuA?
AQuA is a tool developed by SIL International that seeks to develop capacity and increase the pace of translation quality assurance. It harnesses the latest artificial intelligence (AI) techniques, assisting human reviewers in objectively assessing multiple facets of translation quality by enhancing and complementing preexisting translation work. AQuA is able to produce an increasingly detailed suite of quality-related diagnostics, with several augmented quality assessment methods in the areas of accuracy, clarity, and naturalness.
To this end, AQuA is creating:
- power tools for translation consultants, which will allow them to do their checking work more thoroughly and more consistently;
- an early warning system for translators, which will alert them to obvious problems so they can address them earlier in the translation process;
- an equalizer for administrators and strategists, which will allow them to compare and evaluate methodologies and products on an equal footing (including new AI-based translation methodologies proposed by TBTA, SIL, and Avodah).
By identifying possibly problematic or anomalous passages in a draft, AQuA accelerates the quality assessment process by helping:
- translators make corrections (and produce an improved draft) prior to consultant checks;
- consultants gain an “at-a-glance” understanding of quality across a draft and quickly dig into the relevant, granular quality issues; and
- administrators and project managers track quality across projects to guide strategic allocations of checking resources.
Methods of Assessment
AQuA currently offers four methods of assessment in the areas of accuracy, naturalness, and clarity, with two more set to be released in the coming months:
The formal equivalence assessment gives insight into how formally equivalent or dynamic a draft translation is in comparison to a reference.
How it works:Formal equivalence scores are a measure of AQuA’s confidence in its ability to align functionally equivalent words between a draft and a reference. We call the act of aligning these words "word correspondence". Our word correspondence algorithm works in several steps. First, we use a variety of computational techniques to align corresponding words between the two texts. We employ three algorithms for word correspondence:
- Fast Align (the same algorithm used in the Paratext Interlinearizer)
- The "Match Algorithm", which aligns words based on their verse co-occurences
- Comparison of high-dimenstional word embeddings produced by an AI autoencoder model
Each of these techniques gives a score representing the confidence of a given word alignment. We now have a list of plausible word alignment pairs between the draft and the reference, and a confidence score for each alignment.
Correspondence scores are calculated by averaging out the different scores produced by the above algorithms for each word. We get the dynamicity score for a specific verse by averaging over the dynamicity scores for each word in the verse. Higher dynamicity scores mean that the algorithm was more easily able to align draft words with reference words, and therefore the two texts are likely more formally equivalent. Lower dynamicity scores mean that the words are more difficult to align, and therefore the translations are probably less formally equivalent.
We tokenize words by whitespace, so the formal equivalence assessment currently does not work for non-whitespace languages, such as Thai. This assessment also yields more accurate results when used on isolating languages, as opposed to polysynthetic languages. Future research will focus on non-whitespace, subword, and phrase-level tokenization to improve performance on a wider variety of languages.
The missing words assessment identifies words from a source text that may be missing from a draft.
How it works:
To identify potential missing words, we build off of the word alignments generated from the formal equivalence assessment. After running the word correspondence algorithm between the draft and source translations, we then run it again between the source translation and several other reference translations. Finally, we compare the scores across the assessments.
Words in the source translation that have a low correspondence score with regards to the draft but have a high alignment score with at least one word in each of the other reference translations are flagged as potentially missing from the draft. In other words, if the algorithm can find a likely match for a given word in a given verse between the source and a target word in several references, but not between the source and the draft, that word in the source likely doesn't have an equivalent in the draft.
We tokenize words by whitespace, so the missing words assessment currently does not work for non-whitespace languages, such as Thai. This assessment also yields more accurate results when used on isolating languages, as opposed to polysynthetic languages. Future research will focus on non-whitespace, subword, and phrase-level tokenization to improve performance on a wider variety of languages.
Added Words (Coming Soon)
The added words assessment identifies words that exist in a draft but have no likely equivalent in a source.
How it works:
The process for identifying potential added words is very similar to identifying missing words. First, we run word correspondence between the draft and the source texts. Each word in each verse in the draft gets paired with its most likely equivalent word in the source by choosing the alignment with the highest correspondence score. If the correspondence score for one of these "most likely" word pairs falls below a specified threshold (e.g., 0.1), it is flagged as a potential added word, indicating that the corresponding word may not exist in the source text.
We tokenize words by whitespace, so the added words assessment currently does not work for non-whitespace languages, such as Thai. This assessment also yields more accurate results when used on isolating languages, as opposed to polysynthetic languages. Future research will focus on non-whitespace, subword, and phrase-level tokenization to improve performance on a wider variety of languages.
The semantic similarity assessment measures the difference in meaning between a back translation1 of a draft and a reference translation. This assessment can show us where the meaning of the two passages is very similar, even if they use completely different words, as well as where the meanings are very different, even if they use very similar words.
How it works:We use Google's Language Agnostic BERT Sentence Embeddings (LaBSE) model to assign a score from 0 (most different) to 10 (most similar) to each verse and chapter pair. You can read more about the LaBSE model here, or read the model paper here.
The Lix readability assessment uses a simple formula, the Lix readability metric, as a proxy for the readability of a text.
How it works:
Lix is a readability formula based on word and sentence length of a text. Lix works better for estimating the readability of non-English text than other readability metrics like Flesch Kincaid, which contain English-specific metrics.The Lix readability formula is as follows:
% of long words in the text + average words per sentence in the text
where a word is classified as "long" if it contains a certain number of letters (for our purposes, greater than 7 letters).
Lix readability works best for languages written with an alphabetic script, with whitespace at word boundaries. It does not take into account word choice or syntax, which also factor into a text's readability. We are exploring opportunities to enhance the assessment by developing a more comprehensive and multilingual measure of readability with deep learning. This approach will take into account the unique characteristics and readability standards of various diverse languages, ensuring a more accurate and contextually relevant evaluation.
Comprehensibility (Coming Soon)
The comprehensibility assessment focuses on evaluating the understandability and completeness of information in a draft.
How it works:
This assessment incorporates machine-generated question-answer pairs that correspond to the expected information within each verse. It requires a back translation of the draft text into a language of wider communication.
Using a question-answering model, we provide the verse text as context and pose the corresponding question to the model. We then compare the model's answer to the expected answer from the question-answer pair. If the model's answer differs significantly from the expected answer, it suggests the possibility of missing or incomplete information within the verse.
We are currently piloting AQuA in a small number of Bible Translation field trials. If you are interested in being a test user, please fill out this contact form.
- For best results the back/reference translations should be in a language of wider communication. However, good results have been observed when assessing languages closely related to a LWC. For example, Modern Standard Arabic is a LWC, so an assessment of an Egyptian Arabic translation still yields useful results.↩