Adaptive MT Datasets
Overview
Adaptive Machine Translation (MT) datasets are specialized training data collections that improve translation quality by teaching Google Cloud Translation API domain-specific patterns and preferences. These datasets contain sentence pairs in source and target languages that help the translation service provide more accurate, context-aware translations for specific use cases.
What are Adaptive Datasets?
An Adaptive MT Dataset is a collection of translation examples stored as TSV files that helps the translation service learn domain-specific translation patterns. Adaptive datasets are particularly useful for:
- Domain adaptation: Training on specific industry or context terminology
- Consistent translations: Ensuring similar phrases are translated consistently
- Quality improvement: Learning from high-quality human translations
- Custom terminology: Teaching the model preferred translations for specific terms
- Context-aware translation: Understanding how phrases should be translated in specific contexts
Creating an Adaptive Dataset
Dataset Configuration
An Adaptive MT Dataset requires:
Dataset Code
A unique identifier that cannot be edited and must have:
- Minimum 1 character
- Only alphanumeric characters and hyphens allowed
- Examples:
english-to-dutch,some-name
Language Pair
Each dataset supports exactly one language pair:
- Source Language Code: The language code of the original text (e.g.,
en,fr,de) - Target Language Code: The language code for translations (e.g.,
fr,de,en) - Must be valid Google Cloud Translation language codes
Adding Training Files
TSV File Format
Your training data must be uploaded as TSV (Tab-Separated Values) files with the following requirements:
Structure
- Two columns exactly: Source text and target text separated by a tab character
- No header row: Each line contains a sentence pair
- Encoding: UTF-8 (with or without BOM)
- File extension: Must be
.tsv(case-insensitive)
Example TSV Content
Hello world Bonjour le monde
Good morning Bonjour
How can I help you? Comment puis-je vous aider?
Thank you for your patience Merci pour votre patience
Customer service Service client
Language Code Requirements
Both source and target language codes must follow the same format as glossaries:
- Length: 2-10 characters
- Format: Alphanumeric characters, hyphens, and underscores allowed
- Examples:
en,fr,de,en-US,zh-CN
Validation Rules
The system validates your TSV files and will reject them if:
- The file doesn't have a
.tsvextension - The file is empty or contains only whitespace
- Any line doesn't have exactly 2 columns separated by a tab
- Either the source text (first column) or target text (second column) is empty or contains only whitespace
- File encoding is not UTF-8
- The file contains invalid characters that can't be decoded
How Adaptive Datasets Are Used During Translation
Automatic Selection
When a translation request is made, the system automatically:
- Checks for available adaptive datasets that match both the exact source and target language codes
- Selects the first matching dataset from the database if found
- Uses adaptive MT translation instead of standard translation if a dataset is available
- Applies glossaries alongside adaptive datasets if both are available for the language pair
Selection Criteria
An adaptive dataset is selected for use when:
- The dataset's
source_language_codeexactly matches the request's source language - The dataset's
target_language_codeexactly matches the request's target language - The translation request includes a source language code (required for adaptive MT usage)
Selection Priority
If multiple adaptive datasets support the same language pair, the system selects the first one found in the database. There is currently no explicit prioritization beyond database ordering.
Combining with Glossaries
When both an adaptive dataset and glossary are available for a language pair:
- Adaptive MT translation is used as the primary translation method
- Glossary is applied on top to ensure specific terms are translated according to the glossary
- This provides the best of both: context-aware translation from the dataset plus consistent terminology from the glossary
When Adaptive Datasets Are NOT Used
Adaptive datasets will not be applied in the following cases:
1. Missing Source Language Code
{
"messages": ["Hello world"],
"target_language_code": "fr"
}
Reason: Google Cloud Adaptive MT API requires both source and target languages to be explicitly specified. No source_language_code is provided in the example above.
2. No Matching Dataset
- No adaptive dataset exists with the exact source and target language code combination
- Example: Request for
en→jatranslation, but no dataset exists for this specific pair
3. Language Detection Mode
When the system auto-detects the source language, adaptive datasets cannot be used because:
- The source language is unknown at request time
- Dataset selection requires knowing both languages in advance
- The system falls back to standard translation with glossaries (if available)
4. Empty Content
- When the translation request contains no text to translate
- The system returns an empty result without dataset processing
Dataset and File Management
Creating Datasets
When you create an adaptive dataset:
- Create the dataset in the admin interface with a unique code and language pair
- The system automatically provisions the dataset to Google Cloud
- The dataset is ready to receive training files
Adding Training Files
When you upload TSV files:
- Upload TSV files through the admin interface
- Files are validated for proper format and content
- The system automatically imports the file data to Google Cloud
- Multiple files can be added to the same dataset for incremental training
Updating Files
When you update or replace TSV files:
- Delete the old file and upload a new one, or edit the existing file
- The system automatically deletes the old data from Google Cloud
- New file data is imported to replace the previous version
- Old training data is completely replaced with the new version
Monitoring
- Dataset and file provisioning happens asynchronously
- Check system logs for provisioning status
- Failed provisioning attempts are logged as errors
- File imports may take several minutes to complete on Google Cloud
API Usage
Translation Request Example
POST /api/translate
{
"messages": ["The customer service was excellent"],
"source_language_code": "en",
"target_language_code": "fr"
}
If an adaptive dataset exists for en → fr, it will automatically be used to provide domain-aware translation. If a glossary also exists supporting both languages, it will be applied on top of the adaptive translation.
Response
The API returns the same format regardless of whether an adaptive dataset was used, but the translation quality should be significantly improved for content similar to the training data in the dataset.
Troubleshooting
Common Issues
-
Adaptive dataset not being used
- Verify the exact source and target language codes match between the dataset and request
- Ensure
source_language_codeis provided in the request - Check that the dataset was successfully provisioned to Google Cloud
- Review system logs for any provisioning errors
-
TSV validation errors
- Verify the file has a
.tsvextension - Check that each line has exactly 2 columns separated by a tab character
- Ensure neither column is empty or contains only whitespace
- Verify UTF-8 encoding
- Remove any header rows if present
- Verify the file has a
-
File provisioning failures
- Check system logs for Google Cloud API errors
- Verify Google Cloud credentials and permissions
- Ensure the TSV file is accessible in the configured storage bucket
- Confirm the dataset exists on Google Cloud before adding files
-
Poor translation quality
- Ensure training data is high-quality and relevant to your use case
- Add more diverse sentence pairs to cover different contexts
- Review that source and target texts are properly aligned
- Consider the size of your training dataset (more data generally improves quality)
Best Practices
Training Data Quality
- Use high-quality translations: Ensure your sentence pairs are accurate and natural
- Maintain consistency: Use consistent terminology and style across all training pairs
- Cover diverse contexts: Include various sentence structures and contexts for your domain
- Sufficient quantity: Provide enough training pairs to cover your specific use cases (typically hundreds to thousands of pairs)
File Organization
- Logical grouping: Organize related sentence pairs in the same file
- Incremental updates: Add new files rather than replacing existing ones when expanding training data
- Regular updates: Keep training data current with your evolving terminology and style preferences
Language Pair Strategy
- One dataset per direction: Create separate datasets for
en→frandfr→enif you need bidirectional translation - Specific language variants: Use specific language codes (e.g.,
en-US,fr-CA) when regional differences matter
Where to Find Adaptive Datasets in Admin
In the Django admin portal, use the left sidebar to open the Translation section. There you will find:
- Adaptive MT datasets — for managing datasets
- Adaptive MT files — for managing training files within datasets
External Documentation
- Google Cloud Adaptive Translation — Google documentation for adaptive translation