Skip to main content

Translation Glossaries

Overview​

Translation glossaries are specialized dictionaries that provide domain-specific or preferred translations for terms, ensuring consistency and accuracy in automated translations. This service uses Google Cloud Translation API with custom glossaries to enhance translation quality for specific use cases.

What is a Glossary?​

A glossary is a collection of predefined term translations stored in a CSV file that helps the translation service provide more accurate, context-specific translations. Glossaries are particularly useful for:

  • Technical terminology: Industry-specific terms that require precise translations
  • Brand names: Company or product names that should remain consistent
  • Domain-specific vocabulary: Medical, legal, or financial terms
  • Preferred translations: When multiple valid translations exist, but you prefer specific ones

Creating a Glossary​

CSV File Format​

Your glossary must be a CSV file with the following requirements:

Structure​

  • Header row: Language codes (e.g., en,fr,de)
  • Data rows: Corresponding translations for each term
  • Minimum columns: At least 2 language columns
  • Encoding: UTF-8 (with or without BOM)

Example CSV Content​

en,fr,de
hello,bonjour,hallo
goodbye,au revoir,auf wiedersehen
customer service,service client,kundendienst
artificial intelligence,intelligence artificielle,kΓΌnstliche intelligenz

Language Code Requirements​

Glossary code​

A unique string that cannot be edited and must have:

  • Minimum 3 characters
  • Only lowercase letters, numbers, and hyphens
  • Examples: medical-terms, customer-support, legal-vocab

Glossary case sensitivity​

If this flag is set to true, the translation will respect the case of the words. For example, a pair Select,Select in a case-sensitive glossary will not be used when translating a sentence containing select.

Validation Rules​

The system validates your CSV file and will reject it if:

  • The file is empty
  • It has only one language column (minimum 2 required)
  • It contains invalid language code formats
  • Language codes are empty or contain invalid characters
  • File encoding is not UTF-8

How Glossaries Are Used During Translation​

Automatic Selection​

When a translation request is made, the system automatically:

  1. Checks for available glossaries that support both the source and target language codes
  2. Selects the first matching glossary from the database
  3. Applies the glossary to the translation request if found

Selection Criteria​

A glossary is selected for use when:

  • The glossary's supported_language_codes contains both the source and target language codes
  • The translation request includes a source language code (required for glossary usage)

Selection Priority​

If multiple glossaries support the same language pair, the system selects the first one found in the database. There is currently no explicit prioritization beyond database ordering.

Combining with Adaptive MT​

When both an adaptive MT dataset and a glossary are available for a language pair, adaptive MT is used as the primary translation method and the glossary is applied on top for consistent terminology.

When Glossaries Are NOT Used​

Glossaries will not be applied in the following cases:

1. Missing Source Language Code​

{
"messages": ["Hello world"],
"target_language_code": "fr"
}

Reason: Google Cloud Translation API requires a source language for glossary application. No source_language_code is provided in the example above.

2. No Matching Glossary​

  • No glossary exists that supports both the source and target language codes
  • Example: Request for en β†’ ja translation, but no glossary supports this language pair

3. Language Detection Mode​

When the system auto-detects the source language, glossaries cannot be used because:

  • The source language is unknown at request time
  • Glossary selection requires knowing both languages in advance

4. Empty Content​

  • When the translation request contains no text to translate
  • The system returns an empty result without glossary processing

Glossary Management​

Creating/updating Glossaries​

When you create or update a glossary:

  1. Create or edit the glossary in the admin interface
  2. Upload a new CSV file if needed
  3. The system automatically provisions the created or updated glossary to Google Cloud
  4. Old glossary data is replaced with the new version

Monitoring​

  • Glossary provisioning happens asynchronously
  • Check system logs for provisioning status
  • Failed provisioning attempts are logged as errors

API Usage​

Translation Request Example​

POST /api/translate
{
"messages": ["The customer service was excellent"],
"source_language_code": "en",
"target_language_code": "fr"
}

If a glossary exists supporting en and fr, it will automatically be applied to improve translation accuracy for terms like "customer service".

Response​

The API returns the same format regardless of whether a glossary was used, but the translation quality should be improved for terms covered by the glossary.

Troubleshooting​

Common Issues​

  1. Glossary not being used

    • Verify both source and target languages are in the glossary's supported languages
    • Ensure source_language_code is provided in the request
    • Check that the glossary was successfully provisioned
  2. CSV validation errors

    • Verify UTF-8 encoding
    • Check that language codes follow the correct format
    • Ensure at least 2 language columns exist
  3. Provisioning failures

    • Check system logs for Google Cloud API errors
    • Verify Google Cloud credentials and permissions
    • Ensure the CSV file is accessible in the configured storage bucket

Where to Find Glossaries in Admin​

In the Django admin portal, use the left sidebar and open the Translation section. You will see Glossary listed there for managing glossaries.

External Documentation​