Skip to content

Commit

Permalink
add more documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
duhaime committed Mar 11, 2018
1 parent 70f75a4 commit 6e98658
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 9 deletions.
34 changes: 25 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
# minhash.js
```
_ _ _ _
_ __ ___ (_) _ __ | |__ __ _ ___ | |__ (_) ___
| '_ ` _ \ | | | '_ \ | '_ \ / _` | / __| | '_ \ | | / __|
| | | | | | | | | | | | | | | | | (_| | \__ \ | | | | _ | | \__ \
|_| |_| |_| |_| |_| |_| |_| |_| \__,_| |___/ |_| |_| (_) _/ | |___/
|__/
```

The [Minhash algorithm](https://en.wikipedia.org/wiki/MinHash) is a similarity estimation technique that is often used to identify near-duplicate documents in large text collections. This package offers a JavaScript implementation of the algorithm for use in Node.js or web applications.
[Minhashing](https://en.wikipedia.org/wiki/MinHash) is an efficient similarity estimation technique that is often used to identify near-duplicate documents in large text collections. This package offers a JavaScript implementation of the minhash algorithm and an efficient [Locality Sensitive Hashing Index](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) for finding similar minhashes in Node.js or web applications.

## Installation

Expand All @@ -18,6 +25,8 @@ If you prefer, you can instead load the package directly in a browser:
#### Minhash Usage
Minhashes are hash representations of the contents within a set. The following example minhashes and then estimates the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) between two sets:
```javascript
import { Minhash } from 'minhash'; // If using Node.js
Expand All @@ -26,9 +35,11 @@ var s1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
var s2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
'estimating', 'the', 'similarity', 'between', 'documents'];
// create a hash for each set of words to compare
var m1 = new Minhash();
var m2 = new Minhash();
// update each hash
s1.map(function(w) { m1.update(w) });
s2.map(function(w) { m2.update(w) });
Expand All @@ -38,6 +49,10 @@ m1.jaccard(m2);
#### LshIndex Usage
While one can compare the Jaccard similarity between a minhash and all others in a collection, the complexity of doing so is O(n), as one needs to compare the query set to every other set.
To estimate the results of the same comparison in sub-linear time, one can instead build a [Locality Sensitive Hash Index](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf), which maps hash sequences from a minhash signature to the list of document identifiers that contain the given hash sequence. Using this indexing technique, one can effectively find sets similar to a query set:
```javascript
import { Minhash, LshIndex } from 'minhash'; // If using Node.js
Expand All @@ -53,12 +68,12 @@ var m1 = new Minhash();
var m2 = new Minhash();
var m3 = new Minhash();
// write a function that updates each hash
// update each hash
s1.map(function(w) { m1.update(w) });
s2.map(function(w) { m2.update(w) });
s3.map(function(w) { m3.update(w) });
// add each document to an LSH index
// add each document to a Locality Sensitive Hashing index
var index = new LshIndex();
index.insert('m1', m1);
index.insert('m2', m2);
Expand All @@ -71,12 +86,13 @@ console.log('Jaccard similarity >= 0.5 to m1:', matches);
### Example
To execute the sample Node.js script, you can run `node examples/index.js`.
The [sample application](https://duhaime.github.io/minhash/) uses minhash.js to compute the similarity between several [sample documents](https://duhaime.github.io/minhash/texts):
### Development
![app preview](https://github.com/duhaime/nmf/tree/master/images/preview.png?raw=true)
To run the development utilities, you'll need to install the dependencies: `npm install`.
There is also a sample Node.js script that can be run with `node examples/index.js`.
To run the test suite, run `npm run test`.
### Development
To compile and minify minhash.min.js, run `npm run build`.
To run the test suite — `npm run test`.
To compile and minify minhash.min.js — `npm run build`.
Binary file added images/preview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 6e98658

Please sign in to comment.