mirror of https://github.com/CodedotAl/gpt-code-clippy.git synced 2024-08-16 10:20:28 +03:00

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57

starred-codedotal-repo starred-repo

Go to file

ncoop57 4d091e2b02 Merge branch 'main' of github.com:ncoop57/gpt-code-clippy into main		2021-07-20 20:24:33 +00:00
data_scripts	Add new downloading scripts	2021-07-15 19:04:54 +00:00
deduplication	Convert variable names to hashes for smaller memory footprint	2021-07-14 23:58:09 +00:00
demo_backup	[WIP] adding carbon support	2021-07-19 12:38:59 +08:00
dependency_repos	Add nb generating new combined dataset, converting it to a format for github-downloader from EleutherAI and instructions for how to run github-downloader to extract the text from the files in the repos in a format for an language model	2021-07-04 20:49:16 +00:00
github-scraping	nb upd	2021-07-02 21:49:50 +00:00
metrics	BLEU-4 as Key	2021-07-04 09:44:45 +05:30
nbs	added finetuning 125M on apps notebook	2021-07-19 08:21:31 -04:00
scripts	Add script to add new tokens to model and save it	2021-07-11 22:18:25 +00:00
.gitignore	GPTNeo hypers (#35 )	2021-07-10 17:52:50 +03:00
.gitmodules	Add proper submodule	2021-07-04 20:51:44 +00:00
apps_eval_util.py	Evaluation for apps (#53 )	2021-07-16 14:06:13 +03:00
apps.py	fixes to apps fine-tuning script	2021-07-17 23:30:16 +00:00
Causal_Language_Model_Training_on_TPU_with_🤗_Transformers_&_JAX.ipynb	gpt-neo-train-script	2021-07-04 16:15:30 +00:00
code_clippy_filter.py	Dataloaders (#55 )	2021-07-17 03:07:39 +03:00
code_clippy.py	dataset loading script v0 (#31 )	2021-07-09 01:01:56 +03:00
code-clippy-app.ipynb	generation and app noteboocks	2021-07-19 00:49:15 +00:00
EDA.ipynb	WIP: EDA	2021-07-13 13:26:10 +00:00
eval_apps.py	Evaluation for apps (#53 )	2021-07-16 14:06:13 +03:00
finetune_apps.sh	generation and app noteboocks	2021-07-19 00:49:15 +00:00
flax2pt.py	flax2pt script	2021-07-16 10:34:11 +00:00
flax-gpt-neo-clm-v2.ipynb	Adds fine-tuning notebooks (#18 )	2021-07-05 21:24:02 +03:00
flax-gpt-neo-clm-v3.ipynb	nb with checkpoint functions	2021-07-07 16:21:24 +00:00
flax-gpt-neo-clm.ipynb	Adds fine-tuning notebooks (#18 )	2021-07-05 21:24:02 +03:00
generate_apps.ipynb	generation and app noteboocks	2021-07-19 00:49:15 +00:00
gpt-neo-test.ipynb	fix for bf16 (#22 )	2021-07-06 23:44:48 +03:00
LICENSE	Initial commit	2021-07-01 16:13:47 -04:00
README.md	Update README.md	2021-07-01 20:12:11 -04:00
reindent.py	adds apps dataset loading and reindent scripts	2021-07-15 14:13:24 +03:00
requirements.txt	Update requirments	2021-07-11 23:45:58 +00:00
run_clm_apps.py	average loss over task tokens only (#57 )	2021-07-18 06:31:02 +03:00
run_clm_flax_code_search.py	Add scripts used to train models on code search net datasets	2021-07-20 20:24:26 +00:00
run_clm_flax.py	reinstantiate opt_state recursively when there are nested lists (#47 )	2021-07-15 03:09:40 +03:00
run_clm_gpt_neo_13b_streaming.sh	shell script for 1.3b streaming	2021-07-15 13:42:11 +00:00
run_clm_gpt_neo_13b.sh	bs=1 and corrected save checkpoint function	2021-07-12 17:28:31 +00:00
run_clm_gpt_neo_27b.sh	checkpoint functions added to script (#27 )	2021-07-08 14:34:16 +03:00
run_clm_gpt_neo.sh	Training (#39 )	2021-07-11 20:31:36 +03:00
run_clm_mp_apps.py	Evaluation for apps (#53 )	2021-07-16 14:06:13 +03:00
run_clm_streaming_1_3b_1e-4lr_1024bs.sh	Add runner for 1.3b model	2021-07-12 00:00:04 +00:00
run_clm_streaming_125m_1e-4lr_1024bs.sh	Increase saving lim	2021-07-11 23:58:12 +00:00
run_clm_streaming_dedup_125m_1e-4lr_2048bs.sh	Dataloaders (#55 )	2021-07-17 03:07:39 +03:00
run_clm_streaming_filter_flax.py	Dataloaders (#55 )	2021-07-17 03:07:39 +03:00
run_clm_streaming_filter.sh	Dataloaders (#55 )	2021-07-17 03:07:39 +03:00
run_clm_streaming_flax_v2.py	Training (#39 )	2021-07-11 20:31:36 +03:00
run_clm_streaming_flax.py	updates run_clm_streaming_flax.py	2021-07-15 14:13:24 +03:00
run_clm_streaming_wikitext.sh	updates run_clm_streaming_flax.py	2021-07-15 14:13:24 +03:00
run_clm_streaming.sh	Evaluation for apps (#53 )	2021-07-16 14:06:13 +03:00
run_clm_wikitext.sh	reinstantiate opt_state recursively when there are nested lists (#47 )	2021-07-15 03:09:40 +03:00
run_code_search_all.sh	Add scripts used to train models on code search net datasets	2021-07-20 20:24:26 +00:00
run_code_search.sh	Add scripts used to train models on code search net datasets	2021-07-20 20:24:26 +00:00
utils.py	Dataloaders (#55 )	2021-07-17 03:07:39 +03:00

README.md

GPT-Code-Clippy (GPT-CC)

Open Source GitHub Copilot for auto generating code

I would like to train an open source version of the new awesome GitHub Copilot AI tool, which is based on GPT3. Similar to the awesome people behind GPT-Neo, having such an open source model would greatly help researchers understand what this type of biases and limitations this kind of code autocompletion model might have such as generating insecure code (i do research in this area and i know my team would love an open sourced version to run experiments on, i.e. try and break it 🤓)

2. Language

The model will be trained on different programming languages such as C, C++, java, python, etc.

3. Model

GPT-Neo

4. Datasets

Datasets that contain hopefully high quality source code

Possible links to publicly available datasets include:

Some additional datasets may need creating that are not just method level.

5. Training scripts

I believe the standard CLM language model script would do for this.

We can make use of https://www.github.com/huggingface/transformers/tree/master/examples%2Fflax%2Flanguage-modeling%2Frun_clm_flax.py

6. (Optional) Challenges

The data additional data may be a challenge. From what I can see in copilot, it looks to be training on entire files, not code snippets. There are file level datasets that exist but they are a few years old and i don't think they cover many programming languages. The ones I listed above have multiple languages but are only methods.

However, githubs API is pretty easy to use and so it would be pretty easy to create one from scratch, especially if we get some insights into how the copilot dataset was generated 🤓

7. (Optional) Desired project outcome

I'd love to have this open source model setup in a similar Visual Studio Code extension to the GitHub Copilot one. I've actually made a tutorial on doing this using the GPT-Neo model, so we could easily clean it up and release it free of charge forever because from what I've seen on Twitter the GitHub Copilot might eventually be put behind a paywall 😢.

8. (Optional) Reads

The following links can be useful to better understand the project and what has previously been done.

https://github.blog/2021-06-29-introducing-github-copilot-ai-pair-programmer/
https://youtu.be/nC3NrhoNeP4 (tutorial on how we could setup the demo of the model once it's done cooking)