Go to file
2021-07-15 08:58:28 +00:00
deduplication Convert variable names to hashes for smaller memory footprint 2021-07-14 23:58:09 +00:00
dependency_repos Resolve conflicts 2021-07-10 23:48:56 +00:00
github-scraping nb upd 2021-07-02 21:49:50 +00:00
metrics Merge branch 'evaluation' of https://github.com/ncoop57/gpt-code-clippy into evaluation 2021-07-15 08:58:28 +00:00
nbs Commiting initial scripts for generating repos with loose licenses 2021-07-09 20:02:10 +00:00
scripts Merge branch 'evaluation' of https://github.com/ncoop57/gpt-code-clippy into evaluation 2021-07-15 08:58:28 +00:00
.DS_Store add parse check and tree-sitter 2021-07-10 23:36:19 +05:30
.gitignore GPTNeo hypers (#35) 2021-07-10 17:52:50 +03:00
.gitmodules Resolve conflicts 2021-07-10 23:48:56 +00:00
Causal_Language_Model_Training_on_TPU_with_🤗_Transformers_&_JAX.ipynb gpt-neo-train-script 2021-07-04 16:15:30 +00:00
code_clippy.py dataset loading script v0 (#31) 2021-07-09 01:01:56 +03:00
EDA.ipynb WIP: EDA 2021-07-13 13:26:10 +00:00
flax-gpt-neo-clm-v2.ipynb Adds fine-tuning notebooks (#18) 2021-07-05 21:24:02 +03:00
flax-gpt-neo-clm-v3.ipynb nb with checkpoint functions 2021-07-07 16:21:24 +00:00
flax-gpt-neo-clm.ipynb Adds fine-tuning notebooks (#18) 2021-07-05 21:24:02 +03:00
gpt-neo-test.ipynb fix for bf16 (#22) 2021-07-06 23:44:48 +03:00
LICENSE Initial commit 2021-07-01 16:13:47 -04:00
README.md Update README.md 2021-07-01 20:12:11 -04:00
requirements.txt Update requirments 2021-07-11 23:45:58 +00:00
run_clm_flax.py reinstantiate opt_state recursively when there are nested lists (#47) 2021-07-15 03:09:40 +03:00
run_clm_gpt_neo_13b.sh bs=1 and corrected save checkpoint function 2021-07-12 17:28:31 +00:00
run_clm_gpt_neo_27b.sh checkpoint functions added to script (#27) 2021-07-08 14:34:16 +03:00
run_clm_gpt_neo.sh Training (#39) 2021-07-11 20:31:36 +03:00
run_clm_streaming_1_3b_1e-4lr_1024bs.sh Add runner for 1.3b model 2021-07-12 00:00:04 +00:00
run_clm_streaming_125m_1e-4lr_1024bs.sh Increase saving lim 2021-07-11 23:58:12 +00:00
run_clm_streaming_flax_clean.py reinstantiate opt_state recursively when there are nested lists (#47) 2021-07-15 03:09:40 +03:00
run_clm_streaming_flax_v2.py Training (#39) 2021-07-11 20:31:36 +03:00
run_clm_streaming_flax.py dl hotfix 2021-07-10 10:50:57 +00:00
run_clm_streaming_wikitext.sh Training script update (#40) 2021-07-13 18:38:04 +03:00
run_clm_streaming.sh Training (#39) 2021-07-11 20:31:36 +03:00
run_clm_wikitext.sh reinstantiate opt_state recursively when there are nested lists (#47) 2021-07-15 03:09:40 +03:00
utils.py Training script update (#40) 2021-07-13 18:38:04 +03:00

GPT-Code-Clippy (GPT-CC)

Open Source GitHub Copilot for auto generating code

I would like to train an open source version of the new awesome GitHub Copilot AI tool, which is based on GPT3. Similar to the awesome people behind GPT-Neo, having such an open source model would greatly help researchers understand what this type of biases and limitations this kind of code autocompletion model might have such as generating insecure code (i do research in this area and i know my team would love an open sourced version to run experiments on, i.e. try and break it 🤓)

2. Language

The model will be trained on different programming languages such as C, C++, java, python, etc.

3. Model

GPT-Neo

4. Datasets

Datasets that contain hopefully high quality source code

Possible links to publicly available datasets include:

Some additional datasets may need creating that are not just method level.

5. Training scripts

I believe the standard CLM language model script would do for this.

We can make use of https://www.github.com/huggingface/transformers/tree/master/examples%2Fflax%2Flanguage-modeling%2Frun_clm_flax.py

6. (Optional) Challenges

The data additional data may be a challenge. From what I can see in copilot, it looks to be training on entire files, not code snippets. There are file level datasets that exist but they are a few years old and i don't think they cover many programming languages. The ones I listed above have multiple languages but are only methods.

However, githubs API is pretty easy to use and so it would be pretty easy to create one from scratch, especially if we get some insights into how the copilot dataset was generated 🤓

7. (Optional) Desired project outcome

I'd love to have this open source model setup in a similar Visual Studio Code extension to the GitHub Copilot one. I've actually made a tutorial on doing this using the GPT-Neo model, so we could easily clean it up and release it free of charge forever because from what I've seen on Twitter the GitHub Copilot might eventually be put behind a paywall 😢.

8. (Optional) Reads

The following links can be useful to better understand the project and what has previously been done.