mirror of
https://github.com/facebookresearch/fairseq.git
synced 2024-11-12 21:52:01 +03:00
f3d5045a71
Summary: # Before submitting - There is no related issue for this pull request. - [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [x] Did you make sure to update the docs? - We did not see any necessity for tests. ## What does this PR do? Add German RoBERTa model (GottBERT) Pull Request resolved: https://github.com/pytorch/fairseq/pull/2992 Reviewed By: alexeib Differential Revision: D25494927 Pulled By: myleott fbshipit-source-id: b6790124d7c3c8dc387c141706cd8a527cc950ab |
||
---|---|---|
.. | ||
README.md |
GottBERT: a pure German language model
Introduction
GottBERT is a pretrained language model trained on 145GB of German text based on RoBERTa.
Example usage
fairseq
Load GottBERT from torch.hub (PyTorch >= 1.1):
import torch
gottbert = torch.hub.load('pytorch/fairseq', 'gottbert-base')
gottbert.eval() # disable dropout (or leave in train mode to finetune)
Load GottBERT (for PyTorch 1.0 or custom models):
# Download gottbert model
wget https://dl.gottbert.de/fairseq/models/gottbert-base.tar.gz
tar -xzvf gottbert.tar.gz
# Load the model in fairseq
from fairseq.models.roberta import GottbertModel
gottbert = GottbertModel.from_pretrained('/path/to/gottbert')
gottbert.eval() # disable dropout (or leave in train mode to finetune)
Filling masks:
masked_line = 'Gott ist <mask> ! :)'
gottbert.fill_mask(masked_line, topk=3)
# [('Gott ist gut ! :)', 0.3642110526561737, ' gut'),
# ('Gott ist überall ! :)', 0.06009674072265625, ' überall'),
# ('Gott ist großartig ! :)', 0.0370681993663311, ' großartig')]
Extract features from GottBERT
# Extract the last layer's features
line = "Der erste Schluck aus dem Becher der Naturwissenschaft macht atheistisch , aber auf dem Grunde des Bechers wartet Gott !"
tokens = gottbert.encode(line)
last_layer_features = gottbert.extract_features(tokens)
assert last_layer_features.size() == torch.Size([1, 27, 768])
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = gottbert.extract_features(tokens, return_all_hiddens=True)
assert len(all_layers) == 13
assert torch.all(all_layers[-1] == last_layer_features)
Citation
If you use our work, please cite:
@misc{scheible2020gottbert,
title={GottBERT: a pure German Language Model},
author={Raphael Scheible and Fabian Thomczyk and Patric Tippmann and Victor Jaravine and Martin Boeker},
year={2020},
eprint={2012.02110},
archivePrefix={arXiv},
primaryClass={cs.CL}
}