GPT-Code-Clippy (GPT-CC)

Courtesy of the awesome Aimee Trevett!

Introduction

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Datasets

The dataset used to train GPT-CC is obtained from SEART GitHub Search using the following criteria:

>10 GitHub stars
>2 commits
Must have a licence
Exclude forks
Size < 70708 bytes

These repositories are then combined with all of the GitHub repositories contain in The Pile.

The repositories are then filtered for duplicate files. Filtering is performed by regexing each file in each repository to obtain a list of "variables" (the tokens which only contain alphanumeric characters) and then filtering out any files which contain the same sequence of "variables. The deduplication script is available here.

The final dataset is available here. The dataset without the duplicates filtered out is also available here.

TODO: link to the dataset available on the HuggingFace datasets hub, see: https://github.com/huggingface/datasets/pull/2666

Models

The GPT-CC models are fine-tuned versions of GPT-2 and GPT-Neo.

The available models are:

TODO: which is the recommended model?

Training

Training is done using the training scripts available here.

TODO: which is the recommended way to train GPT-CC?

Evaluation

The models are also evaluated on the APPS and HumanEval datasets.

Human Eval Results

Model	pass@1	pass@2	pass@5	pass@10
EleutherAI/gpt-neo	0.12%	0.24%	0.61%	1.22%
dedup-filtered-no-resize-2048bs	0.00%	0.00%	0.00%	0.00%
1024-filtered	0.00%	0.00%	0.00%	0.00%
dedup-2048	0.00%	0.00%	0.00%	0.00%

TODO: evaluation results.

Demo

A Visual Studio Code which uses the HuggingFace Inference API is available.

TODO: more information about this when complete.

Acknowledgements

TODO: everyone to add their names here!

4.7 KiB Raw Blame History