Skip to main content

Adaptive MT Datasets

Overview

Adaptive Machine Translation (MT) datasets are specialized training data collections that improve translation quality by teaching Google Cloud Translation API domain-specific patterns and preferences. These datasets contain sentence pairs in source and target languages that help the translation service provide more accurate, context-aware translations for specific use cases.

What are Adaptive Datasets?

An Adaptive MT Dataset is a collection of translation examples stored as TSV files that helps the translation service learn domain-specific translation patterns. Adaptive datasets are particularly useful for:

  • Domain adaptation: Training on specific industry or context terminology
  • Consistent translations: Ensuring similar phrases are translated consistently
  • Quality improvement: Learning from high-quality human translations
  • Custom terminology: Teaching the model preferred translations for specific terms
  • Context-aware translation: Understanding how phrases should be translated in specific contexts

Creating an Adaptive Dataset

Dataset Configuration

An Adaptive MT Dataset requires:

Dataset Code

A unique identifier that cannot be edited and must have:

  • Minimum 1 character
  • Only alphanumeric characters and hyphens allowed
  • Examples: english-to-dutch, some-name

Language Pair

Each dataset supports exactly one language pair:

  • Source Language Code: The language code of the original text (e.g., en, fr, de)
  • Target Language Code: The language code for translations (e.g., fr, de, en)
  • Must be valid Google Cloud Translation language codes

Adding Training Files

TSV File Format

Your training data must be uploaded as TSV (Tab-Separated Values) files with the following requirements:

Structure
  • Two columns exactly: Source text and target text separated by a tab character
  • No header row: Each line contains a sentence pair
  • Encoding: UTF-8 (with or without BOM)
  • File extension: Must be .tsv (case-insensitive)
Example TSV Content
Hello world	Bonjour le monde
Good morning Bonjour
How can I help you? Comment puis-je vous aider?
Thank you for your patience Merci pour votre patience
Customer service Service client
Language Code Requirements

Both source and target language codes must follow the same format as glossaries:

  • Length: 2-10 characters
  • Format: Alphanumeric characters, hyphens, and underscores allowed
  • Examples: en, fr, de, en-US, zh-CN

Validation Rules

The system validates your TSV files and will reject them if:

  • The file doesn't have a .tsv extension
  • The file is empty or contains only whitespace
  • Any line doesn't have exactly 2 columns separated by a tab
  • Either the source text (first column) or target text (second column) is empty or contains only whitespace
  • File encoding is not UTF-8
  • The file contains invalid characters that can't be decoded

How Adaptive Datasets Are Used During Translation

Automatic Selection

When a translation request is made, the system automatically:

  1. Checks for available adaptive datasets that match both the exact source and target language codes
  2. Selects the first matching dataset from the database if found
  3. Uses adaptive MT translation instead of standard translation if a dataset is available
  4. Applies glossaries alongside adaptive datasets if both are available for the language pair

Selection Criteria

An adaptive dataset is selected for use when:

  • The dataset's source_language_code exactly matches the request's source language
  • The dataset's target_language_code exactly matches the request's target language
  • The translation request includes a source language code (required for adaptive MT usage)

Selection Priority

If multiple adaptive datasets support the same language pair, the system selects the first one found in the database. There is currently no explicit prioritization beyond database ordering.

Combining with Glossaries

When both an adaptive dataset and glossary are available for a language pair:

  • Adaptive MT translation is used as the primary translation method
  • Glossary is applied on top to ensure specific terms are translated according to the glossary
  • This provides the best of both: context-aware translation from the dataset plus consistent terminology from the glossary

When Adaptive Datasets Are NOT Used

Adaptive datasets will not be applied in the following cases:

1. Missing Source Language Code

{
"messages": ["Hello world"],
"target_language_code": "fr"
}

Reason: Google Cloud Adaptive MT API requires both source and target languages to be explicitly specified. No source_language_code is provided in the example above.

2. No Matching Dataset

  • No adaptive dataset exists with the exact source and target language code combination
  • Example: Request for enja translation, but no dataset exists for this specific pair

3. Language Detection Mode

When the system auto-detects the source language, adaptive datasets cannot be used because:

  • The source language is unknown at request time
  • Dataset selection requires knowing both languages in advance
  • The system falls back to standard translation with glossaries (if available)

4. Empty Content

  • When the translation request contains no text to translate
  • The system returns an empty result without dataset processing

Dataset and File Management

Creating Datasets

When you create an adaptive dataset:

  1. Create the dataset in the admin interface with a unique code and language pair
  2. The system automatically provisions the dataset to Google Cloud
  3. The dataset is ready to receive training files

Adding Training Files

When you upload TSV files:

  1. Upload TSV files through the admin interface
  2. Files are validated for proper format and content
  3. The system automatically imports the file data to Google Cloud
  4. Multiple files can be added to the same dataset for incremental training

Updating Files

When you update or replace TSV files:

  1. Delete the old file and upload a new one, or edit the existing file
  2. The system automatically deletes the old data from Google Cloud
  3. New file data is imported to replace the previous version
  4. Old training data is completely replaced with the new version

Monitoring

  • Dataset and file provisioning happens asynchronously
  • Check system logs for provisioning status
  • Failed provisioning attempts are logged as errors
  • File imports may take several minutes to complete on Google Cloud

API Usage

Translation Request Example

POST /api/translate
{
"messages": ["The customer service was excellent"],
"source_language_code": "en",
"target_language_code": "fr"
}

If an adaptive dataset exists for enfr, it will automatically be used to provide domain-aware translation. If a glossary also exists supporting both languages, it will be applied on top of the adaptive translation.

Response

The API returns the same format regardless of whether an adaptive dataset was used, but the translation quality should be significantly improved for content similar to the training data in the dataset.

Troubleshooting

Common Issues

  1. Adaptive dataset not being used

    • Verify the exact source and target language codes match between the dataset and request
    • Ensure source_language_code is provided in the request
    • Check that the dataset was successfully provisioned to Google Cloud
    • Review system logs for any provisioning errors
  2. TSV validation errors

    • Verify the file has a .tsv extension
    • Check that each line has exactly 2 columns separated by a tab character
    • Ensure neither column is empty or contains only whitespace
    • Verify UTF-8 encoding
    • Remove any header rows if present
  3. File provisioning failures

    • Check system logs for Google Cloud API errors
    • Verify Google Cloud credentials and permissions
    • Ensure the TSV file is accessible in the configured storage bucket
    • Confirm the dataset exists on Google Cloud before adding files
  4. Poor translation quality

    • Ensure training data is high-quality and relevant to your use case
    • Add more diverse sentence pairs to cover different contexts
    • Review that source and target texts are properly aligned
    • Consider the size of your training dataset (more data generally improves quality)

Best Practices

Training Data Quality

  • Use high-quality translations: Ensure your sentence pairs are accurate and natural
  • Maintain consistency: Use consistent terminology and style across all training pairs
  • Cover diverse contexts: Include various sentence structures and contexts for your domain
  • Sufficient quantity: Provide enough training pairs to cover your specific use cases (typically hundreds to thousands of pairs)

File Organization

  • Logical grouping: Organize related sentence pairs in the same file
  • Incremental updates: Add new files rather than replacing existing ones when expanding training data
  • Regular updates: Keep training data current with your evolving terminology and style preferences

Language Pair Strategy

  • One dataset per direction: Create separate datasets for enfr and fren if you need bidirectional translation
  • Specific language variants: Use specific language codes (e.g., en-US, fr-CA) when regional differences matter

Where to Find Adaptive Datasets in Admin

In the Django admin portal, use the left sidebar to open the Translation section. There you will find:

  • Adaptive MT datasets — for managing datasets
  • Adaptive MT files — for managing training files within datasets

External Documentation