Commit Graph

60 Commits

Author SHA1 Message Date
Changhan Wang
0ac3f3270c add TTS
Summary: [fairseq-py] add TTS

Reviewed By: wnhsu

Differential Revision: D30720666

fbshipit-source-id: b5288acec72bea1d3a9f3884a4ed51b616c7a403
2021-09-13 18:13:45 -07:00
Pierre Andrews
68a81202a3 Indexed Huffman Coded dataset (#2029)
Summary:
## What does this PR do?

Currently, binarized dataset are stored as a bin representation of int tensors. At best, each int is coded as uint16 on disk.

When coding a fixed size vocabulary dataset where we know the frequency of each symbol and where some symbols are more common than other, we can do better. This happens in particular when binarizing a dataset split in subword units as the most common "tokenizers" like bpe and spm will choose subwords with high frequencies over subwords with low frequencies.

In practice, if we know the frequency of all symbols (or a good estimate), we can use entropy encoding methods to compress the data. The idea is to assign a compressed representation where frequent symbols will have shorter representations than unfrequent symbols.

In this PR, we build a Huffman code from a frequency table and use this code to encode a dataset. The PR provides the huffman coder implementation (using the single queue approach as we usually start with a sorted set of symbols) as well as a memory map implementation of a dataset that stores the data compressed with a huffman code and can return indexed tensors from it.

Over a whole dataset, depending on how many symbols we sample to evaluate the frequency, we can save between 25% and 30% of storage space.

## Follow Ups

currently the binarizer/preprocess script make too many assumptions about the dataset writers so the huffman dataset writer cannot be used straight out of the box with it. I will make follow ups PRs to provide easy to use scripts to build such datasets. But it's as simple as doing:
```
code_builder = HuffmanCodeBuilder()
with open(sample_file, 'r', encoding="utf-8") as input:
    for line in input:
        code_builder.add(*line.strip().split(" "))

coder = code_builder.build_code()

with HuffmanMMapIndexedDatasetBuilder('/tmp/testing_huffman', coder) as builder:
    with open(dataset_file, 'r', encoding="utf-8") as input:
        for line in input:
            builder.add_item(line.strip().split(' '))
```

a lot of the `HuffmanMMapIndexedDataset` code comes from the normal `MMapIndexedDataset` and we could probably extract commonalities in a base class

the `HuffmanCoder` is also really a special kind of `Dictionary` and again, a common base class could be abstracted out of them.

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2029

Reviewed By: dianaml0

Differential Revision: D29557468

Pulled By: Mortimerp9

fbshipit-source-id: a01b6d98f38f937934cadebb3786133e257adefe
2021-08-31 01:12:35 -07:00
Omry Yadan
53802e7812 Compatibility fix with Hydra 1.1 (#3722)
Summary:
One of the changes in Hydra 1.1 is that the default composition order is changing.
This is documented [here](https://hydra.cc/docs/upgrades/1.0_to_1.1/default_composition_order).
In Hydra 1.1, a config is overriding values introduced by the defaults list while in Hydra 1.0 - the defaults list was overriding the values in the config.

fairseq is currently depending on the previous behavior:
The class `FairseqConfig` defines config values, and it's expecting them to be overridden by the defaults list.
This result in a different config being created when running `fairseq_cli/hydra_train.py` with Hydra 1.0 and with 1.1.

Hydra 1.1 introduced the `_self_` keyword in the defaults list to control the composition order. In order to achieve the behavior of Hydra 1.0, `_self_` should be added as the first item in the defaults list.

To allow for a smoother migration, Hydra 1.0 is ignoring `_self_` starting from 1.0.7 (previous versions will issue an error).

This diff adds `_self_` as the first item in the defaults list the fairseq config, and introduce a dependency a Hydra 1.0 version that is equal or newer to 1.0.7.

### Testing:
I ensured that the following yield the same composed config:
Default config with Hydra 1.0.6, 1.0.7 and 1.1.0

`examples/wav2vec/config/finetuning/base_10h.yaml` with Hydra 1.0.6, 1.0.7 and 1.1.0.

This can be achieved by outputing the generated config using `--cfg job` and compating the outputs.

Pull Request resolved: https://github.com/pytorch/fairseq/pull/3722

Reviewed By: dianaml0

Differential Revision: D29917677

Pulled By: jieru-hu

fbshipit-source-id: 7e645b83cccb03fc80a6702e302c4643d2b14a78
2021-07-26 16:36:43 -07:00
Michael Lewis
7dafb05754 BASE layers (#1654)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes # (issue).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1654

Reviewed By: myleott

Differential Revision: D27128074

Pulled By: shruti-bh

fbshipit-source-id: ac86d383cd53c9c9bdd946fea839a37b719d95e3
2021-03-29 18:02:50 -07:00
Frankie Robertson
61e46bb997 Fix attempt to unlink directory copied into source package (Python 3.9) (#3235)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [N/A] Did you make sure to update the docs?
- [N/A] Did you write any new necessary tests?

## What does this PR do?

Currently when installing the newest source package from PyPI I get an error like so:

```
Collecting fairseq
  Using cached fairseq-0.10.2.tar.gz (938 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  ERROR: Command errored out with exit status 1:
   command: /home/frankier/sources/datasets/.venv/bin/python3 /tmp/tmp_ujftsgi_in_process.py get_requires_for_build_wheel /tmp/tmpmn0eumq2
       cwd: /tmp/pip-install-dg5d6q9y/fairseq
  Complete output (31 lines):
  Traceback (most recent call last):
    File "setup.py", line 214, in <module>
      do_setup(package_data)
    File "setup.py", line 136, in do_setup
      setup(
    File "/tmp/pip-build-env-hag0sxvp/overlay/lib/python3.9/site-packages/setuptools/__init__.py", line 152, in setup
      _install_setup_requires(attrs)
    File "/tmp/pip-build-env-hag0sxvp/overlay/lib/python3.9/site-packages/setuptools/__init__.py", line 147, in _install_setup_requires
      dist.fetch_build_eggs(dist.setup_requires)
    File "/tmp/pip-build-env-hag0sxvp/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 60, in fetch_build_eggs
      raise SetupRequirementsError(specifier_list)
  setuptools.build_meta.SetupRequirementsError: ['cython', 'numpy', 'setuptools>=18.0']

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/tmp/tmp_ujftsgi_in_process.py", line 280, in <module>
      main()
    File "/tmp/tmp_ujftsgi_in_process.py", line 263, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/tmp/tmp_ujftsgi_in_process.py", line 114, in get_requires_for_build_wheel
      return hook(config_settings)
    File "/tmp/pip-build-env-hag0sxvp/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 149, in get_requires_for_build_wheel
      return self._get_build_requires(
    File "/tmp/pip-build-env-hag0sxvp/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 130, in _get_build_requires
      self.run_setup()
    File "/tmp/pip-build-env-hag0sxvp/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 145, in run_setup
      exec(compile(code, __file__, 'exec'), locals())
    File "setup.py", line 217, in <module>
      os.unlink(fairseq_examples)
  IsADirectoryError: [Errno 21] Is a directory: 'fairseq/examples'
  ----------------------------------------
ERROR: Command errored out with exit status 1: /home/frankier/sources/datasets/.venv/bin/python3 /tmp/tmp_ujftsgi_in_process.py get_requires_for_build_wheel /tmp/tmpmn0eumq2 Check the logs for full command output.
```

I believe the reason for this is that the source package contains the examples directory because it was put there during package creation (it seems the symlink because a directory). Now, when setup.py is run again, it seems the setup.py attempts to unlink the directory, which is not possible because only symlinks can be unlinked. This PR therefore only attempts to unlink it if it is a symlink. I have not thoroughly tested whether my proposed cause is the true cause, but this should fix it in any case.

Note that the source package is fetched because there is no wheel for Python 3.9, so most users will not see this because they will use the wheel.

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/pytorch/fairseq/pull/3235

Reviewed By: alexeib

Differential Revision: D26513259

Pulled By: myleott

fbshipit-source-id: 775d6c636a5867b9983bb6419829f13ee414e2fd
2021-02-20 06:23:45 -08:00
Myle Ott
cfbf0dddbc Small changes to make tests more reliable (#1572)
Summary:
After this, `python setup.py test` should be more reliable (including when multiple GPUs are present)

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1572

Reviewed By: alexeib

Differential Revision: D25984113

Pulled By: myleott

fbshipit-source-id: 7fef27ae90c079c07f592ed9fb350ccf8b56d23d
2021-01-21 07:33:54 -08:00
Sam Shleifer
bff7f85206 fastseq ngram blocking (#1509)
Summary:
Command:
```bash
fairseq-generate \
    ~myleott/data/data-bin/wmt16_en_de_bpe32k/ \
    --path /checkpoint/myleott/s3/models/wmt16.en-de.joined-dict.transformer/model.pt \
    --beam 4 --remove-bpe --lenpen 0.6 --batch-size 256 --no-repeat-ngram-size 3 \
    --gen-subset test --fp16
```

master/devfair: 297.8s (10.08 sentences/s, 286.47 tokens/s)
branch/devfair: 31.9s (94.27 sentences/s, 2678.66 tokens/s)

master/v100: 227.4s (13.21 sentences/s, 375.24 tokens/s)
branch/v100: 13.1s (228.68 sentences/s, 6497.99 tokens/s)
(all BLEU4=29.17)

### ToDo:
- tests

### Future Work
- test other fastseq proposed improvements.

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1509

Reviewed By: myleott

Differential Revision: D25587857

Pulled By: sshleifer

fbshipit-source-id: d42af5c50e3f94c90e878f92da5ce5ef3fc8b988
2020-12-30 12:58:09 -08:00
Myle Ott
8d7ee5bf81 Fix hydra with Python 3.8 (#1511)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1511

Test Plan: Imported from OSS

Reviewed By: alexeib

Differential Revision: D25570468

Pulled By: myleott

fbshipit-source-id: 98efc6983479e163e6cf0a7ef33decaa1bc429f1
2020-12-15 17:47:39 -08:00
Myle Ott
bc4ebcafb4 Fix tests (#1482)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1482

Reviewed By: michaelauli

Differential Revision: D25318618

Pulled By: myleott

fbshipit-source-id: bed171ffe5ca10e8359be96a15d0fe9bb1a630ea
2020-12-03 18:18:11 -08:00
Myle Ott
3b77a61600 Add fairseq-hydra-train and update docs (#1449)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1449

Test Plan: Imported from OSS

Reviewed By: alexeib

Differential Revision: D25094525

Pulled By: myleott

fbshipit-source-id: 430387d11196d3292933bb168cf09ea16ebc0d3b
2020-11-20 06:00:59 -08:00
Myle Ott
41a61bd4e2 Add GitHub Action to build Python wheels (+ minor cleanup in build scripts) (#1447)
Summary:
Here's an example run in a forked repo: https://github.com/fairseq/fairseq/runs/1419699104

We can upload the wheels to PyPI to make `pip install fairseq` easier for folks.

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1447

Reviewed By: lematt1991

Differential Revision: D25060753

Pulled By: myleott

fbshipit-source-id: 9fdc28cc7c8a172daac668dd09684ec43e2ff11a
2020-11-18 14:31:23 -08:00
alexeib
09a5d864fc move configs into fairseq dir (#1403)
Summary:
this way they get shipped together with fairseq package

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1403

Reviewed By: myleott

Differential Revision: D24803076

Pulled By: alexeib

fbshipit-source-id: a9aa6e47a8ef26fae4d54691f1616a721b8f6112
2020-11-06 23:27:32 -08:00
Myle Ott
e0737c3c29 Dynamically generate versions based on commit hash (#2774)
Summary:
This will produce version strings like `1.0.0a0+3065963`, similar to PyTorch version strings.

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2774

Reviewed By: alexeib

Differential Revision: D24453517

Pulled By: myleott

fbshipit-source-id: 03a0c324ed6124bbc513ba7edc954abd71d63a0f
2020-10-22 12:51:04 -07:00
Myle Ott
9b0611e678 Fix torch.hub (fixes #2756) (#2762)
Summary:
Typically `torch.hub.load(...)` doesn't call `pip install`, so our Cython components never get built. We have a hack in our hubconf that builds these components by running the equivalent of `python setup.py build_ext --inplace` using the setuptools sandbox: f6677b6755/hubconf.py (L52-L55).

Unfortunately, this sandbox gets mad if you modify the filesystem, which is what this recent change does: f6677b6755/setup.py (L203-L205). Combined this breaks torch.hub.

The solution is that when we're doing `build_ext`, don't setup the symlinks. This is fine, since `build_ext` doesn't actually build a package, so we don't care about including config or examples.

Pull Request resolved: https://github.com/pytorch/fairseq/pull/2762

Reviewed By: alexeib

Differential Revision: D24430228

Pulled By: myleott

fbshipit-source-id: e05d075a003ddfde196cb8a86b32882d73808015
2020-10-20 15:46:55 -07:00
Myle Ott
9b8b464070 Package config and examples with fairseq (#1356)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1356

Reviewed By: alexeib

Differential Revision: D24385688

Pulled By: myleott

fbshipit-source-id: 72c4a702d93d2854a6409d42913d7413207cb61e
2020-10-19 09:24:04 -07:00
Myle Ott
a48f235636 Apply black+isort (#1357)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1357

Reviewed By: alexeib

Differential Revision: D24377772

fbshipit-source-id: 51581af041d42d62166b33a35a1a4228b1a76f0c
2020-10-18 18:14:51 -07:00
Changhan Wang
1d1c145387 speech-to-text OSS
Summary:
Imported from https://github.com/fairinternal/fairseq-py/pull/1284. Updated according to PR comments.

Main changes:
* New task: `fairseq.tasks.speech_to_text`
  * Multilingual support: multiple train sub-splits, temperature-based sampling, language ID tokens
* New dataset: `fairseq.data.audio.speech_to_text_dataset`
* Added accuracy metrics and BOS prefix removal to label smoothed cross entropy
* New models: Transformer (`fairseq.models.speech_to_text.s2t_transformer`) and BLSTM (`fairseq.models.speech_to_text.berard`)
* Extended scorers:
  * Added a base scorer class: `fairseq.scorers.BaseScorer` (the parent class for all scorers except the BLEU scorer in CPP)
  * Added an evaluation tokenizer: `fairseq.scorers.eval_tokenizer` which leverages sacreBLEU's built-in tokenizers and allows character-level tokenization as well as punctuation removal (for WER scoring).
  * Added chrF scorer: `fairseq.scorers.chrf`
* Online Mel-filter bank speech feature extraction (via CPP-based pyKaldi or Python-based TorchAudio): `fairseq.data.audio.audio_utils`
* Online speech feature transforms: `fairseq.data.audio.feature_transforms.*`
* Fixed the subsampled sequence lengths in VGGTransformer (`examples.speech_recognition.models.vggtransformer`)
* Examples under `examples/speech_to_text`:
  * LibriSpeech (ASR): better results than VGGTransformer with smaller Transformer-based models
  * MuST-C (ST): comparable to [SOTA results](https://arxiv.org/pdf/2004.10234.pdf) but with less tricks

Reviewed By: jmp84

Differential Revision: D24065273

fbshipit-source-id: 5f842ca9c826f92d4af660705611885fe440a9ab
2020-10-14 12:30:05 -07:00
Myle Ott
f902a363ab Small fixes (#1325)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1325

Reviewed By: ngoyal2707

Differential Revision: D24024198

Pulled By: myleott

fbshipit-source-id: c3b776970d625eff21a26bf7c86cd28ef9e9d2ef
2020-10-02 10:51:09 -07:00
Mu Tian
e7f76c4481 hydra-fairseq - add dataclass
Summary: hydra fairseq - add main common dataclasses as structured config

Reviewed By: alexeib

Differential Revision: D23375458

fbshipit-source-id: 4cb2802e523990d4e2b1a87e3cf1bc4dc852bc5b
2020-09-04 17:08:30 -07:00
alexeib
621e834103 wav2vec 2.0 (#1220)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/1220

Test Plan: Please see examples/wav2vec/README.md for instructions

Reviewed By: edunov

Differential Revision: D22707565

Pulled By: alexeib

fbshipit-source-id: 0c0d4ca7acc933ef7c0062f8dce550b94e414680
2020-08-04 14:19:56 -07:00
Myle Ott
e198482e71 Fix binaries in root dir (#995)
Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/995

The symlinks approach didn't work with `python train.py`.

Differential Revision: D19451900

fbshipit-source-id: 2988eb48077cf8e0e078b9fca527a675132187db
2020-01-17 13:09:09 -08:00
Elijah Rippeth
cec0da2927 add other platforms to CI. (#1595)
Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- [x] Did you write any new necessary tests?

## What does this PR do?

Runs CI for `fairseq` on all major platforms provided by GitHub actions.

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1595

Differential Revision: D19438282

Pulled By: myleott

fbshipit-source-id: a64db46d7785e6f583848f27699f6463c4dc3170
2020-01-17 00:15:10 -08:00
Jiatao Gu
a316bd99b7 CUDA implementation of Levenshtein distance for NAT training (#960)
Summary:
## What does this PR do?
CUDA implementation for Levenshtein distance for NAT and other potential application.
It will make training Levenshtein Transformer slightly faster and clean the functions.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/960

Test Plan: Imported from GitHub. Tested locally.

Reviewed By: cndn

Differential Revision: D19207096

Pulled By: MultiPath

fbshipit-source-id: 4890bbaa851ffd302648c0d949173158dc3167e2
2019-12-21 02:45:15 -08:00
Myle Ott
05514f8a82 Update README to indicate we only support Python >= 3.6 (fixes #1317)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/952

Differential Revision: D19133348

Pulled By: myleott

fbshipit-source-id: 51f96ddb13386143fe0088f19f7cb0674755811f
2019-12-16 19:46:53 -08:00
Myle Ott
df2f84ce61 v0.8.0 -> v0.9.0 (#1452)
Summary:
Possibly breaking changes:
- Set global numpy seed (4a7cd58)
- Split `in_proj_weight` into separate k, v, q projections in MultiheadAttention (fdf4c3e)
- TransformerEncoder returns namedtuples instead of dict (27568a7)

New features:
- Add `--fast-stat-sync` option (e1ba32a)
- Add `--empty-cache-freq` option (315c463)
- Support criterions with parameters (ba5f829)

New papers:
- Simple and Effective Noisy Channel Modeling for Neural Machine Translation (49177c9)
- Levenshtein Transformer (86857a5, ...)
- Cross+Self-Attention for Transformer Models (4ac2c5f)
- Jointly Learning to Align and Translate with Transformer Models (1c66792)
- Reducing Transformer Depth on Demand with Structured Dropout (dabbef4)
- Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa) (e23e5ea)
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (a92bcda)
- CamemBERT: a French BERT (b31849a)

Speed improvements:
- Add CUDA kernels for LightConv and DynamicConv (f840564)
- Cythonization of various dataloading components (4fc3953, ...)
- Don't project mask tokens for MLM training (718677e)
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1452

Differential Revision: D18798409

Pulled By: myleott

fbshipit-source-id: 860a0d5aaf7377c8c9bd63cdb3b33d464f0e1727
2019-12-03 15:19:33 -08:00
Myle Ott
cb6c67bcdb Make torch.hub interface automatically apply tokenization and BPE
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/926

Differential Revision: D18685772

Pulled By: myleott

fbshipit-source-id: 0f99d79ed6ee72e9d3ced786d75ab9504d0dfcf0
2019-11-26 07:49:37 -08:00
Myle Ott
4d21c157ad Have setup.py clean remove compiled Cython files
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/907

Differential Revision: D18480215

Pulled By: myleott

fbshipit-source-id: b02002f631f6d47380f309d4f464bd135d623280
2019-11-13 10:51:22 -08:00
Myle Ott
a0f75996b1 Fix building of docs
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1340

Differential Revision: D18289455

Pulled By: myleott

fbshipit-source-id: a1c8163a35273b6c646d300142701e8a317d7378
2019-11-02 16:52:50 -07:00
Changhan Wang
86857a58bf Levenshtein Transformer paper code
Summary:
Code for our NeurIPS paper [Levenshtein Transformer](https://arxiv.org/abs/1905.11006)
* Added Levenshtein Transformer model, task and criterion class
* Added iterative NAT Transformer, insertion Transformer and CMLM Transformer model class for baselines
* Add an option for prepending BOS to dictionary class and translation task class

Reviewed By: myleott

Differential Revision: D17297372

fbshipit-source-id: 54eca60831ae95dc721c2c34e882e1810ee575c7
2019-09-27 13:58:45 -07:00
Naman Goyal
1f0f7cd82c added cython to install_requires
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/856

Reviewed By: myleott

Differential Revision: D17162411

Pulled By: myleott

fbshipit-source-id: e70ecc802398bbba2b5326e9700f2121c422fd18
2019-09-03 09:08:38 -07:00
Myle Ott
8d4588b1ba Cleaner handling of numpy-based extensions in setup.py
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/853

Differential Revision: D17147879

Pulled By: myleott

fbshipit-source-id: b1f5e838533de62ade52fa82112ea5308734c70f
2019-08-31 16:53:34 -07:00
Myle Ott
746e59a262 Improve support for python setup.py build_ext --inplace
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/852

Differential Revision: D17147452

Pulled By: myleott

fbshipit-source-id: 5fd9c7da3cc019c7beec98d41db1aef1329ee57a
2019-08-31 13:44:22 -07:00
Myle Ott
d2410c4207 Minor cleanup for setup.py
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1078

Differential Revision: D17072514

Pulled By: myleott

fbshipit-source-id: 69a8c8c9cc7caa7e04c414329a5d79e6e1a6621c
2019-08-27 10:07:40 -07:00
Naman Goyal
396ff7f59f installing numpy headers for cython
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/848

Differential Revision: D17060283

fbshipit-source-id: c7e61cae76a0566cc3e2ddc3ab4d48f8dec9d777
2019-08-27 07:11:34 -07:00
Naman Goyal
8a8c0691ba fix cython dependency in the setup (#847)
Summary:
Fixes broken build for `pytext` 4fc39538ae

Earlier version of setup tools required `cython` to be installed before even starting setup.py. This one fixes it.
More details: https://github.com/pypa/setuptools/blob/master/CHANGES.rst#180
and https://stackoverflow.com/questions/37471313/setup-requires-with-cython
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/847

Differential Revision: D16997450

fbshipit-source-id: 5f65026c228a1b94280ca73937078ee3e21ce4f8
2019-08-26 07:19:21 -07:00
Naman Goyal
4fc39538ae Cythonize token block dataset (#834)
Summary:
Cythonized token block dataset code, it's `> 100x` faster. Token block for entire `bookwiki+CC+stories+openweb` is just ~`39.9` seconds.

TODO:
1) I think, I can make it 2x more faster.
2) cleanup.

EDIT History:
~~First pass at parellelizing `token_block_dataset`. The code feels somewhat complicated and cluttered.
This is 2-3x faster though on my tests on `bookwiki` dataset with both `complete` and `complete_doc` modes.
myleott Can you take a look for correctness as I am still not 100% sure that I am not missing corner cases.~~
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/834

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

Test workflow: f133816198

Reviewed By: myleott

Differential Revision: D16970257

Pulled By: myleott

fbshipit-source-id: ec45a308193c9e9f3e7075336c15df4723228d6f
2019-08-23 07:32:36 -07:00
Myle Ott
ffffe04ea1 v0.7.2 -> v0.8.0 (#1017)
Summary:
Changelog:
- Relicensed under MIT license
- Add RoBERTa
- Add wav2vec
- Add WMT'19 models
- Add initial ASR code
- Changed torch.hub interface (`generate` renamed to `translate`)
- Add `--tokenizer` and `--bpe`
- f812e52: Renamed data.transforms -> data.encoders
- 654affc: New Dataset API (optional)
- `47fd985`: Deprecate old Masked LM components
- `5f78106`: Set mmap as default dataset format and infer format automatically
- Misc fixes for sampling
- Misc fixes to support PyTorch 1.2
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1017

Differential Revision: D16799880

Pulled By: myleott

fbshipit-source-id: 45ad8bc531724a53063cbc24ca1c93f715cdc5a7
2019-08-14 05:02:45 -07:00
Myle Ott
d015d23a1f Add fairseq-validate
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/765

Differential Revision: D16763357

Pulled By: myleott

fbshipit-source-id: 758b03158e486ee82786e2d5bf4e46073b50c503
2019-08-13 13:07:04 -07:00
Myle Ott
abb7ed4c91 Update READMEs for torch.hub
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/795

Differential Revision: D16620488

Pulled By: myleott

fbshipit-source-id: 1998a9ccd8816fc7f590861fb4898f910a36bc1e
2019-08-02 06:24:17 -07:00
Myle Ott
e75cff5f2c Relicense fairseq under MIT license (#786)
Summary:
The previous BSD+PATENTS license was controversial. We have been
approved to relicense fairseq under the MIT license.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/786

Differential Revision: D16560654

Pulled By: myleott

fbshipit-source-id: f78b1beb4f2895dd7b9bfc79f5f952a2bfb94034
2019-07-30 07:48:23 -07:00
Myle Ott
b002d0096e v0.7.1 -> v0.7.2 (#891)
Summary:
No major API changes since the last release. Cutting a new release since we'll be merging significant (possibly breaking) changes to logging, data loading and the masked LM implementation soon.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/891

Differential Revision: D16377132

Pulled By: myleott

fbshipit-source-id: f1cb88e671ccd510e53334d0f449fe18585268c7
2019-07-19 06:33:40 -07:00
Louis MARTIN
cc292afaed Add specific compile flags for macOS (#862)
Summary:
Fairseq wouldn't install on macOS.
A workaround was found here: https://github.com/pytorch/fairseq/issues/289
This is now automatic in setup.py, maybe be there's a cleaner way to do it.

I checked that it compiles fine on Linux and macOS.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/862

Differential Revision: D16142105

Pulled By: myleott

fbshipit-source-id: 998ac7781d7a1ac047f4f9239c1fe16eab4be0dd
2019-07-06 12:31:55 -07:00
Myle Ott
881381cfc7 v0.7.1: fix PyPI setup and tests
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/818

Differential Revision: D15916265

Pulled By: myleott

fbshipit-source-id: c66c0bd988d3472c4150226952f34ee8d4c3db86
2019-06-20 06:28:37 -07:00
Myle Ott
bd710e75ae v0.7.0 (#817)
Summary:
Notable (possibly breaking) changes:
- d45db80: Remove checkpoint utility functions from utils.py into checkpoint_utils.py
- f2563c2: Move LM definitions into separate files
- dffb167: Updates to model API:
  - `FairseqModel` -> `FairseqEncoderDecoderModel`
  - add `FairseqDecoder.extract_features` and `FairseqDecoder.output_layer`
  - `encoder_out_dict` -> `encoder_out`
  - rm unused `remove_head` functions
- 34726d5: Move `distributed_init` into `DistributedFairseqModel`
- cf17068: Simplify distributed launch by automatically launching multiprocessing on each node for all visible GPUs (allows launching just one job per node instead of one per GPU)
- d45db80: Change default LR scheduler from `reduce_lr_on_plateau` to `fixed`
- 96ac28d: Rename `--sampling-temperature` -> `--temperature`
- fc1a19a: Deprecate dummy batches
- a1c997b: Add memory mapped datasets
- 0add50c: Allow cycling over multiple datasets, where each one becomes an "epoch"

Plus many additional features and bugfixes
Pull Request resolved: https://github.com/pytorch/fairseq/pull/817

Differential Revision: D15913844

Pulled By: myleott

fbshipit-source-id: d5b5d678efdd9dd3e4d7ca848ddcf1ec2b21bf6b
2019-06-19 19:08:50 -07:00
Bairen Yi
a8f28ecb63 Python3.5 compat (#794)
Summary:
See #467. Ping myleott to review.

This is a work-related contribution. Ping lark to review.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/794

Differential Revision: D15756816

Pulled By: myleott

fbshipit-source-id: 6dce3ff3a713bf5f60e5782bc260b2ca9d2c0a9b
2019-06-11 04:10:08 -07:00
Myle Ott
66f033e6a2 Update setup.py
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/580

Differential Revision: D14494390

Pulled By: myleott

fbshipit-source-id: 524cc16a106f2af630357e2ebdf7dde35fa7d494
2019-03-15 21:30:41 -07:00
Myle Ott
e6422528da 0.6.1 -> 0.6.2 (#577)
Summary:
Changelog:
- 998ba4f: Add language models from Baevski & Auli (2018)
- 4294c4f: Add mixture of experts code from Shen et al. (2019)
- 0049349: Add example for multilingual training
- 48d9afb: Speed improvements, including fused operators from apex
- 44d27e6: Add Tensorboard support
- d17fa85: Add Adadelta optimizer
- 9e1c880: Add `FairseqEncoderModel`
- b65c579: Add `FairseqTask.inference_step` to modularize generate.py
- 2ad1178: Add back `--curriculum`
- Misc bug fixes and other features

Pull Request resolved: https://github.com/pytorch/fairseq/pull/577

Differential Revision: D14481233

Pulled By: myleott

fbshipit-source-id: 4ff8625ef1c0b24273fc65df7c5658e3c932e8b7
2019-03-15 10:27:01 -07:00
Myle Ott
139e3a3c40 Add sacrebleu to requirements
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/542

Differential Revision: D14258895

Pulled By: myleott

fbshipit-source-id: 950a840e1d001a472be8d4737c9e4de5224137b3
2019-02-28 07:54:28 -08:00
Myle Ott
b65c579bed Modularize generate.py (#351)
Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/351

This makes it easier for tasks to plugin to generate.py/interactive.py
Pull Request resolved: https://github.com/pytorch/fairseq/pull/520

Differential Revision: D14183881

Pulled By: myleott

fbshipit-source-id: ede5e53ddc1215ed3b12b8f1eba048c946913c33
2019-02-22 10:08:52 -08:00
Myle Ott
fbd4cef9a5 Add fairseq to PyPI (#495)
Summary:
- fairseq can now be installed via pip: `pip install fairseq`
- command-line tools are globally accessible: `fairseq-preprocess`, `fairseq-train`, `fairseq-generate`, etc.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/495

Differential Revision: D14017761

Pulled By: myleott

fbshipit-source-id: 10c9f6634a3056074eac2f33324b4f1f404d4235
2019-02-08 22:03:29 -08:00