When AI Translations Fall Short
AI translation tools work well for common U.S. languages like Spanish, but are outperformed by human interpreters for less common languages.
Read Time: 3 minutes
Published:
- AI is increasingly being used in medical settings for real-time and high-speed translation, but studies show that these tools may introduce errors.
- AI translation models perform worse than professional interpreters in less common languages.
Although language is universal, true understanding depends on a shared mother tongue. In the United States, we speak over 500 languages. English may be the primary language for some, but it is the second or third language for a large part of the population. More than 22% of Americans reported speaking a language other than English at home. Individuals who are not fluent in English face barriers when accessing health care.
Studies have shown that knowing you will need to speak a language that you are not comfortable with in a health setting can delay necessary care and, as a result, worsen health conditions. Doctors are increasingly using artificial intelligence (AI) models as translators to make care easier for their non-English speaking patients. However, without proper caution, these tools may introduce medical errors.
Melissa Martos and colleagues evaluated the effectiveness of AI translation tools compared to professional human interpreters. The researchers instructed the tool to translate discharge instructions given to families at Seattle Children’s Hospital. English instructions were translated into Chinese, Somali, Vietnamese, and Spanish.

Human or AI translations were scored by professional interpreters on a scale from 1 to 5, with higher scores indicating better performance. Scores were based on measures including how natural translations sounded (graph A), how accurate they were (graph B), how well they conveyed the original meaning and tone (graph C), and the expected impact of their errors on clinical care (graph D).
On average, human interpreters (indicated by the left bars in each pairing) performed better than the AI models (right bars) in each test. This was the case across all languages except for Spanish (orange bars), where no notable differences were observed. The authors speculate that this difference in performance is likely due to the differing sentence structure (syntax) of each language. Spanish syntax is more similar to English than the other languages evaluated.
Interpreters remain a centerpiece of health care for individuals who are not fluent in English. Ensuring translations are accurate across all languages is essential to providing equitable care. If a translator, AI or human, is unclear about medical instructions, the patient will suffer. Martos’ findings support previous studies that report AI translation models performed similarly to human interpreters only in commonly spoken U.S. languages, such as Spanish. Patients who speak less commonly used languages continue to face limitations in interpretation. The authors recommend enhancing the training of computer models in a wider assortment of languages to ensure that all patients can access accurate translation technologies.