Translation Glossaries
Overviewβ
Translation glossaries are specialized dictionaries that provide domain-specific or preferred translations for terms, ensuring consistency and accuracy in automated translations. This service uses Google Cloud Translation API with custom glossaries to enhance translation quality for specific use cases.
What is a Glossary?β
A glossary is a collection of predefined term translations stored in a CSV file that helps the translation service provide more accurate, context-specific translations. Glossaries are particularly useful for:
- Technical terminology: Industry-specific terms that require precise translations
- Brand names: Company or product names that should remain consistent
- Domain-specific vocabulary: Medical, legal, or financial terms
- Preferred translations: When multiple valid translations exist, but you prefer specific ones
Creating a Glossaryβ
Useful linksβ
CSV File Formatβ
Your glossary must be a CSV file with the following requirements:
Structureβ
- Header row: Language codes (e.g.,
en,fr,de) - Data rows: Corresponding translations for each term
- Minimum columns: At least 2 language columns
- Encoding: UTF-8 (with or without BOM)
Example CSV Contentβ
en,fr,de
hello,bonjour,hallo
goodbye,au revoir,auf wiedersehen
customer service,service client,kundendienst
artificial intelligence,intelligence artificielle,kΓΌnstliche intelligenz
Language Code Requirementsβ
- Must be valid Google Cloud Translation language codes
- Length: 2-10 characters
- Format: Alphanumeric characters, hyphens, and underscores allowed
- Examples:
en,fr,de,en-US,zh-CN
Glossary codeβ
A unique string that cannot be edited and must have:
- Minimum 3 characters
- Only lowercase letters, numbers, and hyphens
- Examples:
medical-terms,customer-support,legal-vocab
Glossary case sensitivityβ
If this flag is set to true, the translation will respect the case of the words. For example, a pair Select,Select in a case-sensitive glossary will not be used when translating a sentence containing select.
Validation Rulesβ
The system validates your CSV file and will reject it if:
- The file is empty
- It has only one language column (minimum 2 required)
- It contains invalid language code formats
- Language codes are empty or contain invalid characters
- File encoding is not UTF-8
How Glossaries Are Used During Translationβ
Automatic Selectionβ
When a translation request is made, the system automatically:
- Checks for available glossaries that support both the source and target language codes
- Selects the first matching glossary from the database
- Applies the glossary to the translation request if found
Selection Criteriaβ
A glossary is selected for use when:
- The glossary's
supported_language_codescontains both the source and target language codes - The translation request includes a source language code (required for glossary usage)
Selection Priorityβ
If multiple glossaries support the same language pair, the system selects the first one found in the database. There is currently no explicit prioritization beyond database ordering.
Combining with Adaptive MTβ
When both an adaptive MT dataset and a glossary are available for a language pair, adaptive MT is used as the primary translation method and the glossary is applied on top for consistent terminology.
When Glossaries Are NOT Usedβ
Glossaries will not be applied in the following cases:
1. Missing Source Language Codeβ
{
"messages": ["Hello world"],
"target_language_code": "fr"
}
Reason: Google Cloud Translation API requires a source language for glossary application. No source_language_code is provided in the example above.
2. No Matching Glossaryβ
- No glossary exists that supports both the source and target language codes
- Example: Request for
enβjatranslation, but no glossary supports this language pair
3. Language Detection Modeβ
When the system auto-detects the source language, glossaries cannot be used because:
- The source language is unknown at request time
- Glossary selection requires knowing both languages in advance
4. Empty Contentβ
- When the translation request contains no text to translate
- The system returns an empty result without glossary processing
Glossary Managementβ
Creating/updating Glossariesβ
When you create or update a glossary:
- Create or edit the glossary in the admin interface
- Upload a new CSV file if needed
- The system automatically provisions the created or updated glossary to Google Cloud
- Old glossary data is replaced with the new version
Monitoringβ
- Glossary provisioning happens asynchronously
- Check system logs for provisioning status
- Failed provisioning attempts are logged as errors
API Usageβ
Translation Request Exampleβ
POST /api/translate
{
"messages": ["The customer service was excellent"],
"source_language_code": "en",
"target_language_code": "fr"
}
If a glossary exists supporting en and fr, it will automatically be applied to improve translation accuracy for terms like "customer service".
Responseβ
The API returns the same format regardless of whether a glossary was used, but the translation quality should be improved for terms covered by the glossary.
Troubleshootingβ
Common Issuesβ
-
Glossary not being used
- Verify both source and target languages are in the glossary's supported languages
- Ensure
source_language_codeis provided in the request - Check that the glossary was successfully provisioned
-
CSV validation errors
- Verify UTF-8 encoding
- Check that language codes follow the correct format
- Ensure at least 2 language columns exist
-
Provisioning failures
- Check system logs for Google Cloud API errors
- Verify Google Cloud credentials and permissions
- Ensure the CSV file is accessible in the configured storage bucket
Where to Find Glossaries in Adminβ
In the Django admin portal, use the left sidebar and open the Translation section. You will see Glossary listed there for managing glossaries.