gpt-code-clippy/deduplication
2021-07-14 23:58:09 +00:00
..
deduplication_parallel.py fix deduplication_parallel 2021-07-14 13:25:40 +01:00
deduplication_streaming.py Convert variable names to hashes for smaller memory footprint 2021-07-14 23:58:09 +00:00
deduplication.py remove streaming arg 2021-07-14 15:08:42 +01:00
README.md update readme 2021-07-14 15:13:04 +01:00
script.py change deduplication to use HuggingFace datasets 2021-07-14 15:08:30 +01:00

Deduplication

A tool for checking code duplicates by assuming two files with the same sequence of variables are duplicates.

Usage:

deduplication.py --data_dir <data_dir> --output_dir <output_dir>

data_dir should be a directory containing .zst compressed files. Each file should be in the jsonl format. Each jsonl entry should have a text field with the code as a string, and a meta field which is a dictionary containing repo_name and file_name. output_dir will be the directory containing the deduplicated data, in the same format as the input data.

The deduplication tool will load each file, tokenize the code, obtaining a list of variables within the code. The variables are obtained by regexing out all tokens which are not made up of only alphanumeric characters. We then get the list of unique variable sequences, filter our dataset so we only have one of each sequence, and write the filtered dataset to output_dir in the same format as the input data in data_dir.