Commit Graph

120 Commits

Author SHA1 Message Date
taisazero
d70d95082d Added file to download licenses 2021-07-17 07:40:33 +00:00
ncoop57
efcf5fcde5 Merge branch 'camera-ready' of github.com:ncoop57/gpt-code-clippy into camera-ready 2021-07-15 15:59:45 -04:00
ncoop57
a2e6c08a3c WIP on human eval evaluation 2021-07-15 15:59:42 -04:00
Mrinal Mathur
a3a90effed
Merge pull request #49 from ncoop57/Mrinal18-camera-ready-1
Adding documentation on usage
2021-07-16 01:25:46 +05:30
ncoop57
f49fda52e2 Merge branch 'main' of github.com:ncoop57/gpt-code-clippy into main 2021-07-15 19:04:57 +00:00
ncoop57
4ffcfd5528 Add new downloading scripts 2021-07-15 19:04:54 +00:00
Arun Raja
d3b5b8f530
Merge pull request #52 from ncoop57/13b
added partitions.py for 1.3b parallel training
2021-07-15 22:59:22 +08:00
arunraja-hub
2521f896f4 added partitions.py for 1.3b parallel training 2021-07-15 14:58:29 +00:00
Arun Raja
9ba71b77d5
Merge pull request #51 from ncoop57/13b
shell script for 1.3b streaming [WIP]
2021-07-15 21:44:04 +08:00
arunraja-hub
65f6279f47 shell script for 1.3b streaming 2021-07-15 13:42:11 +00:00
Mrinal Mathur
0d63ea7fa9
Updated training file information 2021-07-15 17:46:38 +05:30
arampacha
d88e51f8a1 adds apps dataset loading and reindent scripts 2021-07-15 14:13:24 +03:00
arampacha
1cbe0edf1e updates run_clm_streaming_flax.py 2021-07-15 14:13:24 +03:00
Mrinal Mathur
c5966efd86
Adding documentation on usage 2021-07-15 15:12:58 +05:30
ncoop57
9c5c7aa4e1 Even more reorganizing 2021-07-14 20:36:03 -04:00
ncoop57
8855f0c40e More reorganization 2021-07-14 20:24:37 -04:00
ncoop57
51dc4c3852 Update readme with outline 2021-07-14 20:22:13 -04:00
ncoop57
6ccb72c0ad Merge branch 'docs' of github.com:ncoop57/gpt-code-clippy into camera-ready 2021-07-14 20:12:47 -04:00
ncoop57
745372a744 SOme initial clean up 2021-07-14 20:12:26 -04:00
arampacha
30f2ee4127
reinstantiate opt_state recursively when there are nested lists (#47) 2021-07-15 03:09:40 +03:00
ncoop57
6d4c9bfa23 Convert variable names to hashes for smaller memory footprint 2021-07-14 23:58:09 +00:00
Ben Trevett
cb64f77b52
Merge pull request #46 from ncoop57/deduplication
add deduplication for streaming and fixed readme
2021-07-14 15:36:18 +01:00
bentrevett
8c7fb0ed99 added streaming duplicate remover (probably v. slow) 2021-07-14 15:34:35 +01:00
bentrevett
07aacc88da update readme 2021-07-14 15:13:04 +01:00
bentrevett
5502b19552 update readme 2021-07-14 15:12:16 +01:00
Ben Trevett
a4abfc85b9
Merge pull request #45 from ncoop57/deduplication
change deduplication to use huggingface dataset unique method
2021-07-14 15:09:27 +01:00
bentrevett
0eb1dd5960 remove streaming arg 2021-07-14 15:08:42 +01:00
bentrevett
706381e23d change deduplication to use HuggingFace datasets 2021-07-14 15:08:30 +01:00
Ben Trevett
4b2b7ca00c
Merge pull request #44 from ncoop57/fix_dedup_parallel
fix deduplication_parallel
2021-07-14 13:26:12 +01:00
bentrevett
7579ec9697 fix deduplication_parallel 2021-07-14 13:25:40 +01:00
ncoop57
b1354dd181 Add parallel, but buggy version of dedup script 2021-07-14 12:01:38 +00:00
arampacha
3aef752fce
upd clm script (#43)
* update resuming behaviour

* adds temporary hack to enable resuming w multisteps and adafactor
2021-07-14 00:52:59 +03:00
bentrevett
47bb3895bc add type hints 2021-07-13 22:47:54 +01:00
Ben Trevett
309c4095c0
Merge pull request #42 from ncoop57/deduplication
use threaded=True for the lm_dataformat.Reader
2021-07-13 22:27:34 +01:00
bentrevett
0ba2261d29 threaded the lm_dataformat.Reader 2021-07-13 22:26:49 +01:00
Ben Trevett
dac5b42c2d
Merge pull request #41 from ncoop57/deduplication
update deduplication code to write deduplicated data to folder
2021-07-13 17:56:50 +01:00
bentrevett
08bb88ad4b update readme 2021-07-13 17:55:50 +01:00
bentrevett
f0160ebbbe improve deduplication
- remove code_hash from DocumentID (not needed)
- can now write non-duplicate files to given `output_dir`
- number of examples per file given by `archive_commit_frequency`
2021-07-13 17:53:01 +01:00
arampacha
92539be4cb
Training script update (#40)
* adds cleaner streaming script

* resume from checkpoint w/ MultiSteps

* adds gradient clipping

* upd run_clm_flax.py
2021-07-13 18:38:04 +03:00
Santiago
cdbf1572d6 WIP: EDA
feat: add data preprocessing
TODO: implement keyword detection and deduplication stats
2021-07-13 13:26:10 +00:00
arunraja-hub
4aabbacbe9 bs=1 and corrected save checkpoint function 2021-07-12 17:28:31 +00:00
ncoop57
73aa110de9 Add model card and data sheet templates 2021-07-12 13:15:22 -04:00
ncoop57
41a7423ca4 Add runner for 1.3b model 2021-07-12 00:00:04 +00:00
ncoop57
b110dd905b Increase saving lim 2021-07-11 23:58:12 +00:00
ncoop57
127caba3d8 Update requirments 2021-07-11 23:45:58 +00:00
ncoop57
1e29442203 Rename to make clearer 2021-07-11 23:39:20 +00:00
ncoop57
489e424169 Change weight decay to match gpt codex 2021-07-11 23:00:40 +00:00
ncoop57
acead42c66 Update model name 2021-07-11 22:51:57 +00:00
ncoop57
5e80c79576 Update hyperparams 2021-07-11 22:50:59 +00:00
ncoop57
a3b9f56a02 Add requirements file and run config script 2021-07-11 22:38:22 +00:00