-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
439 lines (370 loc) · 18.3 KB
/
index.html
File metadata and controls
439 lines (370 loc) · 18.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Text2Label: A Quick and Dirty Client-Side Text "Classifier"</title>
<script src="https://code.jquery.com/jquery-1.11.1.min.js"></script>
<script src="https://d3js.org/d3-dsv.v1.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/axios/0.16.2/axios.min.js"></script>
</head>
<body style="margin:0;padding:0;">
<div id="msg" style="position:absolute;top:0;z-index:100;width:100%;margin:0;">
<div style="margin:0 auto;max-width:400px;text-align:center;background:yellow;padding:15px;">
<b> Loading... This could take a moment or three. ;) </b>
</div>
</div>
<div style="margin:20px auto 100px auto;max-width:900px;">
<center>
<h1>Text2Label: A Quick and Dirty Client-Side Text "Classifier"</h1>
By <a href="https://twitter.com/Colarusso" target=_blank>@colarusso</a>
</center>
</br>
<hr>
<p>
This is not a true text classifier, hence the quotes. It is a JavaScript-only hack that mostly does the job. I needed a solution that would run on something like GitHub Pages (where you only can do client-side scripting). Anyhow, you start with a two-column CSV file. The first column contains example texts, and the second column contains labels for each text item. Below, I've used single words, but it can support sentences. Using these we build vectors based on the texts and associate them with their labels. Instead of using this as training data for an honest-to-goodness classifier, we just match the vectors of novel texts with existing examples, looking for the closest match.
</p>
<h2>Build Vectors</h2>
<p>Did you notice how this page took a while to load? Well that's because it loaded a bunch of word vectors (used to vectorize your text). You may be able to get away with fewer words. So feel free to play around with the selection below. If you change this selection, however, you'll need to rebuild your vectors to see the results. Also, this page loaded almost 115 MB of word vectors. You'll be using only a subset of these regardless of what option you choose, no more than about 60%. Note: I originally got these vectors from <a href="https://github.com/turbomaze/word2vecjson" target=_blank>word2vecjson</a>.</p>
<select id="vcount" disabled=TRUE>
<option value=25000>25,000 word vectors (~70 MB)</option>
<option value=10000 SELECTED>10,000 word vectors (~28 MB)</option>
<option value=5000>5,000 word vectors (~14 MB)</option>
<option value=1000>1,000 word vectors (~3 MB)</option>
</select>
<p>
That being said, if you have words that aren't found in the list of word vectors, we'll make a placeholder vector for them. That is, we'll take the 300 dimensions of a word vector and set all of them to zero. The first unknown word will have the first dimension set to one, the second will have the second dimension set to one, and so on through 300. We do the inverse for the next 300 unknown words, with all but a single dimension being one. This doesn't capture semantic info about these words, but it makes sure they are considered during matching. We added this feature mostly to catch terms of art and proper names.
</p>
<p>
You can type your "training data" into the textarea below, or click "Choose File" to upload a CSV. After that, click "Build Vectors," and it will construct a set of vectors based on your texts and associate them with their labels. If you change this content, you'll need to rebuild your vectors to see the results.
</p>
<p>
<textarea id="markup" name="markup" style="width:100%;height:200px;border: solid 1px #aaa;"wrap="off" disabled=TRUE>mammal, animal
mouse, animal
insect, animal
fish, animal
bird, animal
plant, vegetable
tree, vegetable
grass, vegetable
weed, vegetable
flower, vegetable
apple, vegetable
stuff, mineral
rock, mineral
sand, mineral
block, mineral
table, mineral
cup, mineral
house, mineral
toy, mineral
machine, mineral
keyboard, mineral
rocket, mineral
jewelry, mineral
smoke, mineral</textarea>
<input id="upload" type="file" disabled=TRUE/>
</p>
<p>
<input type="button" id="build" value="Build Vectors" style="width:100%" onclick="make_vectors('markup');" disabled=TRUE/>
</p>
<p>
<div id="thinking"></div>
<div id="notes"></div>
</p>
<h2>Data Collection (Optional)</h2>
<p>It might be nice to see what text people are trying to classify, along with the answers they are getting, but since we're trying to run this on a server without server-side scripting or a database, we need to get creative. So I added this <a href="https://airtable.com/" target=_blank>AirTable</a> hack. Basically, you create an AirTable and use there API to add a row for every user text. Note: your table must contain two columns, one named "text" and one named "label". If you leave any of the following three fields blank, this functionality will not be added to your code. If you want to use AirTable in this way, just add the relevant info. Keep in mind these will all be visible to your end users should they inspect the code. So don't use credentials that you don't want out in public.</p>
<a href="https://github.com/colarusso/flashcards#your-api-key" target=_blank>API Key</a>: <input id="api_key" style="width:150px;" disabled=TRUE value=""/>
<a href="https://github.com/colarusso/flashcards#your-base-id" target=_blank>Base ID</a>: <input id="base_id" style="width:150px;" disabled=TRUE value=""/>
<a href="https://github.com/colarusso/flashcards#your-table-name" target=_blank>Table Name</a>: <input id="table_name" style="width:100px;" disabled=TRUE value=""/>
<p>
Of course, the idea here is that you can take this data, remove those rows that were right, correct those that were wrong, and add these fixed data to your "training" set. You can then rebuild your vectors and fill in these blind spots.
</p>
<h2>Test "Understanding"</h2>
<p>
Test things out and see if you need more examples. Once you're happy, download the two JavaScript files below, and use their functions to find the best match. Again, to find this match, the files vectorize some string of text and try to find the most similar item in the list of vectors you built above. It does this using cosine similarity.
</p>
<p>
<input id="test" name="test" style="width:400px;" onkeypress="if (event.keyCode==13){test_understanding()}" disabled=TRUE/> <input id="test_button" type="button" value='Test "Understanding"' onclick="test_understanding()" disabled=TRUE/>
</p>
<p>
<div id="answers"></div>
</p>
<h2>Download Your Custom Code</h2>
<p>Note: Not all browsers support the download feature. You may want to use <a href="https://www.google.com/chrome/" target=_blank>Chrome</a>. This is only an issue for the download feature on this page, not the code you are downloading. It should work across browsers.</p>
<p>
<input id="file1" type="button" value='Download word2vec.js File' onclick="download_word_vectors()" disabled=TRUE/>
<input id="file2" type="button" value='Download text2label.js File' onclick="download_question_vectors()" disabled=TRUE/>
</p>
<h2>Demo Code</h2>
<p>
To use your vectors and run a match like that above, you'll need both the "word2vec.js" and "text2label.js" files. For an example of how to use these, look at the code below. Then check it out <a href="demo/" target=_blank>in action</a>. FYI, the demo uses vectors built on the default Animal, Vegetable, or Mineral data you saw when this page first loaded. Also, you may want to move the src calls to the end of the page so they don't hold up loading.
</p>
<pre style="background:black;color:white;padding:5px;overflow-x:scroll;"><code><html>
<head>
<span style="color:red"><!-- Include the following line only if you're writing data to AitTable -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/axios/0.16.2/axios.min.js"></script> </span>
<script src="js/word2vec.js"></script>
<script src="js/text2label.js"></script>
<script>
function test_understanding(string) {
// This function makes use of text2lable to find the best answer
answers = getNClosestAnswer(1, vectorize(string))
// getNClosestAnswer allows for the return of multiple labels
// here we've limited it to one. Additionally, we're filtering by
// QLabels to apply consistent labels. To allow for multiple instances
// of the same labels we append a #n to the label. This removes that.
document.getElementById('answer').innerHTML = QLabels[answers[0][0]]
<span style="color:red">// Include the following line only if you're writing data to AitTable
write_to_table(string,QLabels[answers[0][0]])</span>
}
</script>
</head>
<body>
<input id="test" style="width:300px;" onkeypress="if (event.keyCode==13){test_understanding('test')}"/>
<input type="button" value='Get label' onclick="test_understanding(document.getElementById('test').value)"/>
<p id="answer"></p>
</body>
</html>
</code></pre>
<hr>
<font size=-1><a href="https://github.com/colarusso/text2label" target=_blank>GitHub Repo</a></font>
<textarea id="code" name="code" style="display:none;" wrap="off">
/*-----------------------------------------------------------------
FUNCTIONS FROM https://github.com/turbomaze/word2vecjson
-----------------------------------------------------------------*/
function getCosSim(f1, f2) {
return Math.abs(f1.reduce(function(sum, a, idx) {
return sum + a*f2[idx];
}, 0)/(mag(f1)*mag(f2))); //magnitude is 1 for all feature vectors
}
function mag(a) {
return Math.sqrt(a.reduce(function(sum, val) {
return sum + val*val;
}, 0));
}
function norm(a) {
var a_mag = mag(a);
return a.map(function(val) {
return val/a_mag;
});
}
/*-----------------------------------------------------------------
NEW FUNCTIONS
-----------------------------------------------------------------*/
var QVecs;
var QLabels;
function vectorize(string){
// get a list of words
words = string.replace(/[^a-zA-Z\s]+/g,'').replace(/\s{2,}/g,' ').toLowerCase().trim().split(" ");
var vector = [];
var usable_words = 0;
var found = []
var notfound = []
// get the vector for each word
for (i = 0, len = words.length, text = ""; i < len; i++) {
// only do this if the word is in our list
if (wordVecs.hasOwnProperty(words[i])) {
found.push(words[i]);
//console.log(wordVecs[words[i]]);
if (usable_words==0) {
vector = wordVecs[words[i]]
} else {
// average the word vectors
for (j = 0; j < 300; j++) {
vector[j] = vector[j] + wordVecs[words[i]][j];
}
}
usable_words += 1
} else {
notfound.push(words[i]);
}
}
for (j = 0; j < 300; j++) {
vector[j] = vector[j]/usable_words;
}
// return a best-guess vector for the string
return norm(vector);
}
function getNClosestAnswer(n, vec) {
var sims = [];
for (var ans in QVecs) {
var sim = getCosSim(vec, QVecs[ans]);
sims.push([ans, sim]);
}
sims.sort(function(a, b) {
return b[1] - a[1];
});
return sims.slice(0, n);
}
</textarea>
<textarea id="code_w_airtable" name="code_w_airtable" style="display:none;" wrap="off">
function write_to_table(text,label) {
apicall = "https://api.airtable.com/v0/"+base_id+"/"+table_name+"/";
axios.post(
apicall,
{"fields":{"text":text,"label":label}},
{headers: { Authorization: "Bearer "+ api_key}}
).then(function(response){
console.log("Wrote to AirTable")
}).catch(function(error){
console.log(error)
alert("There was a problem writing to your Airtable ("+error+"). Check your credenials et al.");
})
}
</textarea>
</div>
</body>
<script src="assets/js/text2label.js"></script>
<script type="text/javascript">
// h/t http://answers.splunk.com/answers/125819/fill-textarea-from-a-file.html
//External data file handling starts here
var control = document.getElementById("upload");
control.addEventListener("change", function(event){
if (window.File && window.FileReader && window.FileList && window.Blob) {
var reader = new FileReader();
reader.onload = function(event){
var contents = event.target.result;
document.getElementById('markup').value = contents;
};
reader.onerror = function(event){
console.error("File could not be read! Code " + event.target.error.code);
};
console.log("Filename: " + control.files[0].name);
reader.readAsText(control.files[0]);
} else {
alert('This feature is not supported by your browser.');
}
}, false);
// h/t http://runnable.com/U5HC9xtufQpsu5aj/use-javascript-to-save-textarea-as-a-txt-file
function saveTextAsFile(textToWrite,name)
{
// I'm using file system support as a proxy for support for this feature.
// Check based on one found at: http://blog.teamtreehouse.com/building-an-html5-text-editor-with-the-filesystem-apis
// Handle vendor prefixes.
window.requestFileSystem = window.requestFileSystem ||
window.webkitRequestFileSystem;
// Check for support.
if (window.requestFileSystem) {
// create a new Blob (html5 magic) that conatins the data from your form feild
var textFileAsBlob = new Blob([textToWrite], {type:'text/plain'});
// Specify the name of the file to be saved
var fileNameToSaveAs = name;
// Optionally allow the user to choose a file name by providing
// an imput field in the HTML and using the collected data here
// var fileNameToSaveAs = txtFileName.text;
// create a link for our script to 'click'
var downloadLink = document.createElement("a");
// supply the name of the file (from the var above).
// you could create the name here but using a var
// allows more flexability later.
downloadLink.download = fileNameToSaveAs;
// provide text for the link. This will be hidden so you
// can actually use anything you want.
downloadLink.innerHTML = "My Hidden Link";
// allow our code to work in webkit & Gecko based browsers
// without the need for a if / else block.
window.URL = window.URL || window.webkitURL;
// Create the link Object.
downloadLink.href = window.URL.createObjectURL(textFileAsBlob);
// when link is clicked call a function to remove it from
// the DOM in case user wants to save a second file.
downloadLink.onclick = destroyClickedElement;
// make sure the link is hidden.
downloadLink.style.display = "none";
// add the link to the DOM
document.body.appendChild(downloadLink);
// click the new link
downloadLink.click();
} else {
alert('This feature is not supported by your browser. You could, however, cut-and-paste your text into an editor to save it.');
}
}
function destroyClickedElement(event)
{
// remove the link from the DOM
document.body.removeChild(event.target);
}
// h/t https://www.bennadel.com/blog/1504-ask-ben-parsing-csv-strings-with-javascript-exec-regular-expression-command.htm
// This will parse a delimited string into an array of
// arrays. The default delimiter is the comma, but this
// can be overriden in the second argument.
function CSVToArray( strData, strDelimiter ){
// Check to see if the delimiter is defined. If not,
// then default to comma.
strDelimiter = (strDelimiter || ",");
// Create a regular expression to parse the CSV values.
var objPattern = new RegExp(
(
// Delimiters.
"(\\" + strDelimiter + "|\\r?\\n|\\r|^)" +
// Quoted fields.
"(?:\"([^\"]*(?:\"\"[^\"]*)*)\"|" +
// Standard fields.
"([^\"\\" + strDelimiter + "\\r\\n]*))"
),
"gi"
);
// Create an array to hold our data. Give the array
// a default empty first row.
var arrData = [[]];
// Create an array to hold our individual pattern
// matching groups.
var arrMatches = null;
// Keep looping over the regular expression matches
// until we can no longer find a match.
while (arrMatches = objPattern.exec( strData )){
// Get the delimiter that was found.
var strMatchedDelimiter = arrMatches[ 1 ];
// Check to see if the given delimiter has a length
// (is not the start of string) and if it matches
// field delimiter. If id does not, then we know
// that this delimiter is a row delimiter.
if (
strMatchedDelimiter.length &&
(strMatchedDelimiter != strDelimiter)
){
// Since we have reached a new row of data,
// add an empty row to our data array.
arrData.push( [] );
}
// Now that we have our delimiter out of the way,
// let's check to see which kind of value we
// captured (quoted or unquoted).
if (arrMatches[ 2 ]){
// We found a quoted value. When we capture
// this value, unescape any double quotes.
var strMatchedValue = arrMatches[ 2 ].replace(
new RegExp( "\"\"", "g" ),
"\""
);
} else {
// We found a non-quoted value.
var strMatchedValue = arrMatches[ 3 ];
}
// Now that we have our value string, let's add
// it to the data array.
arrData[ arrData.length - 1 ].push( strMatchedValue );
}
// Return the parsed data.
return( arrData );
}
</script>
<script src="data/wordvecs25000.js"></script>
<script src="data/wordvecs10000.js"></script>
<script src="data/wordvecs5000.js"></script>
<script src="data/wordvecs1000.js"></script>
<script>
$( "#vcount" ).prop( "disabled", false );
$( "#markup" ).prop( "disabled", false );
$( "#upload" ).prop( "disabled", false );
$( "#build" ).prop( "disabled", false );
$( "#test" ).prop( "disabled", false );
$( "#test_button" ).prop( "disabled", false );
$( "#api_key" ).prop( "disabled", false );
$( "#base_id" ).prop( "disabled", false );
$( "#table_name" ).prop( "disabled", false );
$( "#file1" ).prop( "disabled", false );
$( "#file2" ).prop( "disabled", false );
$('#msg').hide();
</script>
</html>