Commit Graph

51 Commits

Author SHA1 Message Date
bentrevett
5aee2c9197 update code with joblib 2021-07-14 12:56:29 +01:00
bentrevett
47bb3895bc add type hints 2021-07-13 22:47:54 +01:00
bentrevett
0ba2261d29 threaded the lm_dataformat.Reader 2021-07-13 22:26:49 +01:00
bentrevett
08bb88ad4b update readme 2021-07-13 17:55:50 +01:00
bentrevett
f0160ebbbe improve deduplication
- remove code_hash from DocumentID (not needed)
- can now write non-duplicate files to given `output_dir`
- number of examples per file given by `archive_commit_frequency`
2021-07-13 17:53:01 +01:00
bentrevett
c691ca9e51 add usage readme 2021-07-09 12:15:12 +01:00
bentrevett
284cfdd79f added code hash to document id 2021-07-09 12:14:49 +01:00
bentrevett
92d42747c5 added duplicate detector 2021-07-09 11:36:39 +01:00
arampacha
acfcfea185
dataset loading script v0 (#31) 2021-07-09 01:01:56 +03:00
arampacha
19462b5239
Grad accum (#30)
* adds gradient accumulation

* adds saving when training is finished
2021-07-08 22:16:49 +03:00
Arun Raja
2736967001
checkpoint functions added to script (#27)
* saves optimizer state together with model
* enables to resume from saved checkpoint
* removes old checkpoint up to `save_totat_limit`

Co-authored-by: arampacha <aruthart@gmail.com>
2021-07-08 14:34:16 +03:00
Nathan Cooper
9c4c40756f
Merge pull request #28 from ncoop57/datapreproc
Update data downloading scripts
2021-07-07 14:58:20 -04:00
ncoop57
1811dc3f14 Merge branch 'main' of https://github.com/ncoop57/gpt-code-clippy into datapreproc 2021-07-07 18:57:17 +00:00
ncoop57
c7d3719bf4 Update data scripts for downloading stuff without OOMing 2021-07-07 18:57:07 +00:00
Arun Raja
d128c921ef
Merge pull request #26 from ncoop57/checkpoint
nb with checkpoint functions
2021-07-08 00:23:47 +08:00
arunraja-hub
06ffddfca1 nb with checkpoint functions 2021-07-07 16:21:24 +00:00
arampacha
539dadc223
adds wandb tracking (#25) 2021-07-07 17:02:42 +03:00
arampacha
a7c5573eda
sync training script with official (#24) 2021-07-07 12:35:20 +03:00
arampacha
420ab78f56
fix for bf16 (#22) 2021-07-06 23:44:48 +03:00
Nathan Cooper
9615befa43
Merge pull request #21 from ncoop57/datapreproc
Helper scripts for downloading data and notifying/reading responses of repo owners
2021-07-06 15:25:46 -04:00
ncoop57
89a4b56475 Add helper scripts for downloading the data 2021-07-06 19:24:43 +00:00
ncoop57
aa6af3770c Fix up notification and reading repo reply ly scripts and add script for checking vulnerabilities 2021-07-06 18:23:45 +00:00
ncoop57
9f5240a0b9 Add splitting repo file into thirds for easier downloading 2021-07-06 18:23:02 +00:00
Arun Raja
5896f03912
New run_clm script (#19)
updates training script with fix  from https://github.com/huggingface/transformers/pull/12514
2021-07-06 17:21:47 +03:00
ncoop57
ee44f989cc Update gitignore for vscode 2021-07-06 00:22:49 +00:00
ncoop57
a9ec346cdb Merge branch 'main' of https://github.com/ncoop57/gpt-code-clippy into datapreproc 2021-07-06 00:22:12 +00:00
arampacha
a53dfe2f03
Adds fine-tuning notebooks (#18) 2021-07-05 21:24:02 +03:00
ncoop57
a335891a49 Add scripts to notify repo owners/admins and to check replies from notifications 2021-07-05 01:12:10 +00:00
Nathan Cooper
ff58c4ae95
Merge pull request #11 from ncoop57/gptneo
Gptneo additional model configurations
2021-07-04 19:24:10 -04:00
arampacha
ceceef4ce4 gpt-neo-test nb clean-up 2021-07-04 21:59:52 +00:00
arampacha
1d0172d8b6 adds bash scripts for 13b and 27b models 2021-07-04 21:58:54 +00:00
Nathan Cooper
98991a3937
Merge pull request #10 from ncoop57/datapreproc
Add proper submodule
2021-07-04 16:52:28 -04:00
ncoop57
c82888a9a9 Add proper submodule 2021-07-04 20:51:44 +00:00
Nathan Cooper
bc96526fba
Merge pull request #9 from ncoop57/datapreproc
Data Processing
2021-07-04 16:50:10 -04:00
ncoop57
2c3905096f Add nb generating new combined dataset, converting it to a format for github-downloader from EleutherAI and instructions for how to run github-downloader to extract the text from the files in the repos in a format for an language model 2021-07-04 20:49:16 +00:00
ncoop57
aed4b0f6a2 Remove absolute path for relative 2021-07-04 19:51:05 +00:00
ncoop57
62e5f43d90 Add downloading to data to folder via gdown to nb 2021-07-04 19:43:37 +00:00
ncoop57
19d3baee5b Merge branch 'main' of https://github.com/ncoop57/gpt-code-clippy into datapreproc 2021-07-04 19:31:21 +00:00
Nathan Cooper
73cf954d4a
Merge pull request #8 from ncoop57/gptneo
Add GPT Neo Trainer Code
2021-07-04 15:30:34 -04:00
ncoop57
677c95aa85 Remove unnecessary file and add shebang to bash script 2021-07-04 18:38:53 +00:00
arampacha
64204f47f0 bash script 2021-07-04 17:22:58 +00:00
arampacha
5408f06695 gpt-neo-training 2021-07-04 17:19:42 +00:00
arampacha
a92495b6f6 gpt-neo-train-script 2021-07-04 16:15:30 +00:00
reshinthadithyan
060f10fda5 BLEU-4 as Key 2021-07-04 09:44:45 +05:30
reshinthadithyan
4618e408a2 Adding extrinsic_eval.py where metrics are called. 2021-07-04 09:39:20 +05:30
reshinthadithyan
aa64977e92 adding init metrics - bleu 2021-07-04 09:08:15 +05:30
arampacha
5bd3cec769 nb upd 2021-07-02 21:49:50 +00:00
arampacha
02389437f1 exploratory nb 2021-07-02 20:43:38 +00:00
Graham Neubig
81b334ba8e Added github scraping scripts 2021-07-02 11:26:48 -04:00
Nathan Cooper
805a1726b3
Update README.md 2021-07-01 20:12:11 -04:00