mirror of https://github.com/CodedotAl/gpt-code-clippy.git synced 2024-09-11 15:05:44 +03:00

History

ncoop57 6d4c9bfa23 Convert variable names to hashes for smaller memory footprint		2021-07-14 23:58:09 +00:00
..
deduplication_parallel.py	fix deduplication_parallel	2021-07-14 13:25:40 +01:00
deduplication_streaming.py	Convert variable names to hashes for smaller memory footprint	2021-07-14 23:58:09 +00:00
deduplication.py	remove streaming arg	2021-07-14 15:08:42 +01:00
README.md	update readme	2021-07-14 15:13:04 +01:00
script.py	change deduplication to use HuggingFace datasets	2021-07-14 15:08:30 +01:00

README.md

Deduplication

A tool for checking code duplicates by assuming two files with the same sequence of variables are duplicates.

Usage:

deduplication.py --data_dir <data_dir> --output_dir <output_dir>

data_dir should be a directory containing .zst compressed files. Each file should be in the jsonl format. Each jsonl entry should have a text field with the code as a string, and a meta field which is a dictionary containing repo_name and file_name. output_dir will be the directory containing the deduplicated data, in the same format as the input data.

The deduplication tool will load each file, tokenize the code, obtaining a list of variables within the code. The variables are obtained by regexing out all tokens which are not made up of only alphanumeric characters. We then get the list of unique variable sequences, filter our dataset so we only have one of each sequence, and write the filtered dataset to output_dir in the same format as the input data in data_dir.