Adaptive MT Datasets

Overview

Adaptive Machine Translation (MT) datasets are specialized training data collections that improve translation quality by teaching Google Cloud Translation API domain-specific patterns and preferences. These datasets contain sentence pairs in source and target languages that help the translation service provide more accurate, context-aware translations for specific use cases.

What are Adaptive Datasets?

An Adaptive MT Dataset is a collection of translation examples stored as TSV files that helps the translation service learn domain-specific translation patterns. Adaptive datasets are particularly useful for:

Domain adaptation: Training on specific industry or context terminology
Consistent translations: Ensuring similar phrases are translated consistently
Quality improvement: Learning from high-quality human translations
Custom terminology: Teaching the model preferred translations for specific terms
Context-aware translation: Understanding how phrases should be translated in specific contexts

Creating an Adaptive Dataset

Dataset Configuration

An Adaptive MT Dataset requires:

Dataset Code

A unique identifier that cannot be edited and must have:

Minimum 1 character
Only alphanumeric characters and hyphens allowed
Examples: english-to-dutch, some-name

Language Pair

Each dataset supports exactly one language pair:

Source Language Code: The language code of the original text (e.g., en, fr, de)
Target Language Code: The language code for translations (e.g., fr, de, en)
Must be valid Google Cloud Translation language codes

Adding Training Files

TSV File Format

Your training data must be uploaded as TSV (Tab-Separated Values) files with the following requirements:

Structure

Two columns exactly: Source text and target text separated by a tab character
No header row: Each line contains a sentence pair
Encoding: UTF-8 (with or without BOM)
File extension: Must be .tsv (case-insensitive)

Example TSV Content

Hello world	Bonjour le monde
Good morning	Bonjour
How can I help you?	Comment puis-je vous aider?
Thank you for your patience	Merci pour votre patience
Customer service	Service client

Language Code Requirements

Both source and target language codes must follow the same format as glossaries:

Length: 2-10 characters
Format: Alphanumeric characters, hyphens, and underscores allowed
Examples: en, fr, de, en-US, zh-CN

Validation Rules

The system validates your TSV files and will reject them if:

The file doesn't have a .tsv extension
The file is empty or contains only whitespace
Any line doesn't have exactly 2 columns separated by a tab
Either the source text (first column) or target text (second column) is empty or contains only whitespace
File encoding is not UTF-8
The file contains invalid characters that can't be decoded

How Adaptive Datasets Are Used During Translation

Automatic Selection

When a translation request is made, the system automatically:

Checks for available adaptive datasets that match both the exact source and target language codes
Selects the first matching dataset from the database if found
Uses adaptive MT translation instead of standard translation if a dataset is available
Applies glossaries alongside adaptive datasets if both are available for the language pair

Selection Criteria

An adaptive dataset is selected for use when:

The dataset's source_language_code exactly matches the request's source language
The dataset's target_language_code exactly matches the request's target language
The translation request includes a source language code (required for adaptive MT usage)

Selection Priority

If multiple adaptive datasets support the same language pair, the system selects the first one found in the database. There is currently no explicit prioritization beyond database ordering.

Combining with Glossaries

When both an adaptive dataset and glossary are available for a language pair:

Adaptive MT translation is used as the primary translation method
Glossary is applied on top to ensure specific terms are translated according to the glossary
This provides the best of both: context-aware translation from the dataset plus consistent terminology from the glossary

When Adaptive Datasets Are NOT Used

Adaptive datasets will not be applied in the following cases:

1. Missing Source Language Code

{
  "messages": ["Hello world"],
  "target_language_code": "fr"
}

Reason: Google Cloud Adaptive MT API requires both source and target languages to be explicitly specified. No source_language_code is provided in the example above.

2. No Matching Dataset

No adaptive dataset exists with the exact source and target language code combination
Example: Request for en → ja translation, but no dataset exists for this specific pair

3. Language Detection Mode

When the system auto-detects the source language, adaptive datasets cannot be used because:

The source language is unknown at request time
Dataset selection requires knowing both languages in advance
The system falls back to standard translation with glossaries (if available)

4. Empty Content

When the translation request contains no text to translate
The system returns an empty result without dataset processing

Dataset and File Management

Creating Datasets

When you create an adaptive dataset:

Create the dataset in the admin interface with a unique code and language pair
The system automatically provisions the dataset to Google Cloud
The dataset is ready to receive training files

Adding Training Files

When you upload TSV files:

Upload TSV files through the admin interface
Files are validated for proper format and content
The system automatically imports the file data to Google Cloud
Multiple files can be added to the same dataset for incremental training

Updating Files

When you update or replace TSV files:

Delete the old file and upload a new one, or edit the existing file
The system automatically deletes the old data from Google Cloud
New file data is imported to replace the previous version
Old training data is completely replaced with the new version

Monitoring

Dataset and file provisioning happens asynchronously
Check system logs for provisioning status
Failed provisioning attempts are logged as errors
File imports may take several minutes to complete on Google Cloud

API Usage

Translation Request Example

POST /api/translate
{
  "messages": ["The customer service was excellent"],
  "source_language_code": "en",
  "target_language_code": "fr"
}

If an adaptive dataset exists for en → fr, it will automatically be used to provide domain-aware translation. If a glossary also exists supporting both languages, it will be applied on top of the adaptive translation.

Response

The API returns the same format regardless of whether an adaptive dataset was used, but the translation quality should be significantly improved for content similar to the training data in the dataset.

Troubleshooting

Common Issues

Adaptive dataset not being used
- Verify the exact source and target language codes match between the dataset and request
- Ensure source_language_code is provided in the request
- Check that the dataset was successfully provisioned to Google Cloud
- Review system logs for any provisioning errors
TSV validation errors
- Verify the file has a .tsv extension
- Check that each line has exactly 2 columns separated by a tab character
- Ensure neither column is empty or contains only whitespace
- Verify UTF-8 encoding
- Remove any header rows if present
File provisioning failures
- Check system logs for Google Cloud API errors
- Verify Google Cloud credentials and permissions
- Ensure the TSV file is accessible in the configured storage bucket
- Confirm the dataset exists on Google Cloud before adding files
Poor translation quality
- Ensure training data is high-quality and relevant to your use case
- Add more diverse sentence pairs to cover different contexts
- Review that source and target texts are properly aligned
- Consider the size of your training dataset (more data generally improves quality)

Best Practices

Training Data Quality

Use high-quality translations: Ensure your sentence pairs are accurate and natural
Maintain consistency: Use consistent terminology and style across all training pairs
Cover diverse contexts: Include various sentence structures and contexts for your domain
Sufficient quantity: Provide enough training pairs to cover your specific use cases (typically hundreds to thousands of pairs)

File Organization

Logical grouping: Organize related sentence pairs in the same file
Incremental updates: Add new files rather than replacing existing ones when expanding training data
Regular updates: Keep training data current with your evolving terminology and style preferences

Language Pair Strategy

One dataset per direction: Create separate datasets for en → fr and fr → en if you need bidirectional translation
Specific language variants: Use specific language codes (e.g., en-US, fr-CA) when regional differences matter

Where to Find Adaptive Datasets in Admin

In the Django admin portal, use the left sidebar to open the Translation section. There you will find:

Adaptive MT datasets — for managing datasets
Adaptive MT files — for managing training files within datasets

External Documentation

Google Cloud Adaptive Translation — Google documentation for adaptive translation

Overview​

What are Adaptive Datasets?​

Creating an Adaptive Dataset​

Dataset Configuration​

Dataset Code​

Language Pair​

Adding Training Files​

TSV File Format​

Structure​

Example TSV Content​

Language Code Requirements​

Validation Rules​

How Adaptive Datasets Are Used During Translation​

Automatic Selection​

Selection Criteria​

Selection Priority​

Combining with Glossaries​

When Adaptive Datasets Are NOT Used​

1. Missing Source Language Code​

2. No Matching Dataset​

3. Language Detection Mode​

4. Empty Content​

Dataset and File Management​

Creating Datasets​

Adding Training Files​

Updating Files​

Monitoring​

API Usage​

Translation Request Example​

Response​

Troubleshooting​

Common Issues​

Best Practices​

Training Data Quality​

File Organization​

Language Pair Strategy​

Where to Find Adaptive Datasets in Admin​

External Documentation​

Overview

What are Adaptive Datasets?

Creating an Adaptive Dataset

Dataset Configuration

Dataset Code

Language Pair

Adding Training Files

TSV File Format

Structure

Example TSV Content

Language Code Requirements

Validation Rules

How Adaptive Datasets Are Used During Translation

Automatic Selection

Selection Criteria

Selection Priority

Combining with Glossaries

When Adaptive Datasets Are NOT Used

1. Missing Source Language Code

2. No Matching Dataset

3. Language Detection Mode

4. Empty Content

Dataset and File Management

Creating Datasets

Adding Training Files

Updating Files

Monitoring

API Usage

Translation Request Example

Response

Troubleshooting

Common Issues

Best Practices

Training Data Quality

File Organization

Language Pair Strategy

Where to Find Adaptive Datasets in Admin

External Documentation