Commit Graph

102 Commits

Author SHA1 Message Date
Mrinal18
39ded45239 Merge branch 'evaluation' of https://github.com/ncoop57/gpt-code-clippy into evaluation
adding codebleu evaluation
2021-07-15 08:58:28 +00:00
Mrinal18
8bb263223a adding codebleu evaluation 2021-07-15 08:33:50 +00:00
arampacha
30f2ee4127
reinstantiate opt_state recursively when there are nested lists (#47) 2021-07-15 03:09:40 +03:00
ncoop57
6d4c9bfa23 Convert variable names to hashes for smaller memory footprint 2021-07-14 23:58:09 +00:00
Ben Trevett
cb64f77b52
Merge pull request #46 from ncoop57/deduplication
add deduplication for streaming and fixed readme
2021-07-14 15:36:18 +01:00
bentrevett
8c7fb0ed99 added streaming duplicate remover (probably v. slow) 2021-07-14 15:34:35 +01:00
bentrevett
07aacc88da update readme 2021-07-14 15:13:04 +01:00
bentrevett
5502b19552 update readme 2021-07-14 15:12:16 +01:00
Ben Trevett
a4abfc85b9
Merge pull request #45 from ncoop57/deduplication
change deduplication to use huggingface dataset unique method
2021-07-14 15:09:27 +01:00
bentrevett
0eb1dd5960 remove streaming arg 2021-07-14 15:08:42 +01:00
bentrevett
706381e23d change deduplication to use HuggingFace datasets 2021-07-14 15:08:30 +01:00
Ben Trevett
4b2b7ca00c
Merge pull request #44 from ncoop57/fix_dedup_parallel
fix deduplication_parallel
2021-07-14 13:26:12 +01:00
bentrevett
7579ec9697 fix deduplication_parallel 2021-07-14 13:25:40 +01:00
ncoop57
b1354dd181 Add parallel, but buggy version of dedup script 2021-07-14 12:01:38 +00:00
arampacha
3aef752fce
upd clm script (#43)
* update resuming behaviour

* adds temporary hack to enable resuming w multisteps and adafactor
2021-07-14 00:52:59 +03:00
bentrevett
47bb3895bc add type hints 2021-07-13 22:47:54 +01:00
Ben Trevett
309c4095c0
Merge pull request #42 from ncoop57/deduplication
use threaded=True for the lm_dataformat.Reader
2021-07-13 22:27:34 +01:00
bentrevett
0ba2261d29 threaded the lm_dataformat.Reader 2021-07-13 22:26:49 +01:00
Ben Trevett
dac5b42c2d
Merge pull request #41 from ncoop57/deduplication
update deduplication code to write deduplicated data to folder
2021-07-13 17:56:50 +01:00
bentrevett
08bb88ad4b update readme 2021-07-13 17:55:50 +01:00
bentrevett
f0160ebbbe improve deduplication
- remove code_hash from DocumentID (not needed)
- can now write non-duplicate files to given `output_dir`
- number of examples per file given by `archive_commit_frequency`
2021-07-13 17:53:01 +01:00
arampacha
92539be4cb
Training script update (#40)
* adds cleaner streaming script

* resume from checkpoint w/ MultiSteps

* adds gradient clipping

* upd run_clm_flax.py
2021-07-13 18:38:04 +03:00
Santiago
cdbf1572d6 WIP: EDA
feat: add data preprocessing
TODO: implement keyword detection and deduplication stats
2021-07-13 13:26:10 +00:00
arunraja-hub
4aabbacbe9 bs=1 and corrected save checkpoint function 2021-07-12 17:28:31 +00:00
ncoop57
41a7423ca4 Add runner for 1.3b model 2021-07-12 00:00:04 +00:00
ncoop57
b110dd905b Increase saving lim 2021-07-11 23:58:12 +00:00
ncoop57
127caba3d8 Update requirments 2021-07-11 23:45:58 +00:00
ncoop57
1e29442203 Rename to make clearer 2021-07-11 23:39:20 +00:00
ncoop57
489e424169 Change weight decay to match gpt codex 2021-07-11 23:00:40 +00:00
ncoop57
acead42c66 Update model name 2021-07-11 22:51:57 +00:00
ncoop57
5e80c79576 Update hyperparams 2021-07-11 22:50:59 +00:00
ncoop57
a3b9f56a02 Add requirements file and run config script 2021-07-11 22:38:22 +00:00
ncoop57
8cb24cc0ea Merge branch 'main' of https://github.com/ncoop57/gpt-code-clippy into main 2021-07-11 22:18:32 +00:00
ncoop57
6c7ce1e8d4 Add script to add new tokens to model and save it 2021-07-11 22:18:25 +00:00
Arun Raja
93ab79859b
added save_model_checkpoint (#38) 2021-07-11 21:29:49 +03:00
ncoop57
d4a3a699aa Fix bug with prefetch data iterator 2021-07-11 18:06:11 +00:00
arampacha
f59296a80b
Training (#39) 2021-07-11 20:31:36 +03:00
arampacha
59cdcdf638
fix num steps passed to scheduler (#37) 2021-07-11 11:38:30 +03:00
ncoop57
22fcb3c5de Add initial evaluation code 2021-07-11 02:24:38 +00:00
ncoop57
3cb074353a Resolve conflicts 2021-07-10 23:48:56 +00:00
ncoop57
36bf7d245e Add additional repos and submodules 2021-07-10 23:31:55 +00:00
reshinthadithyan
5ce84e2bcc adding parse score to evaluator 2021-07-10 23:41:07 +05:30
reshinthadithyan
98fecd06b9 add parse check and tree-sitter 2021-07-10 23:36:19 +05:30
ncoop57
1382b4c907 Merge branch 'eval_metrics' of https://github.com/ncoop57/gpt-code-clippy into evaluation 2021-07-10 17:57:18 +00:00
ncoop57
6b8e77951b Merge branch 'main' of https://github.com/ncoop57/gpt-code-clippy into evaluation 2021-07-10 17:55:33 +00:00
arampacha
96c267b7c7
Streaming script update (#36) 2021-07-10 20:53:40 +03:00
arampacha
240816c55e
GPTNeo hypers (#35)
* store full train state

* close dl processes when done

* gpt3 scheduler
2021-07-10 17:52:50 +03:00
ncoop57
02dbf0d828 Merge branch 'main' of https://github.com/ncoop57/gpt-code-clippy into datapreproc 2021-07-10 12:55:21 +00:00
arampacha
9af5a7faaf dl hotfix 2021-07-10 10:50:57 +00:00
arampacha
5df86d34b3
Streaming upd (#34)
* adds mp version of prefetch

* adds script v2 w setreaming eval ds

* quick fix for push_to_hub
2021-07-10 10:08:39 +03:00