Skip to content

IsNoobgrammer/MinHash-LSH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MinHash-LSH

Batched Minhash-LSH deduplication for large datasets This Repo is Highly Adopted from datasketch and text-dedup by default uses SHA-64 for hash signature

to get started

git clone https://github.com/IsNoobgrammer/MinHash-LSH.git
cd MinHash-LSH
pip install scipy datasets torch -qU

To use

from datasets import load_dataaset
from LSH import deduplicate_dataset

ds=load_dataset("fhai50032/HINGLISH-LIMA",split="train")
column="Hinglish"

dedup_ds=deduplicate_dataset(ds,column)

args

    ds: Dataset,
    column,
    threshold=0.8, 
    num_perm=256,
    batch_size=10_000,
    num_proc= 1 if os.name == "nt" else os.cpu_count() ,
    ngram_size=5,
    min_length=5,
    bands_rows=(None,None)

About

Batched Minhash-LSH deduplication for large datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  NODES
COMMUNITY 1
Project 3
USERS 1