streamLDA/documentation.txt at master · bjzu/streamLDA · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

=== Algorithm Overview ===

The primary method that the user will call is the update_lambda() method.

olda = onlineLDA(params) (`params' are the various user-specified params)
olda.update_lambda(new_docs)
|
|___self.do_e_step(new_docs)
|   |___ parse_doc_list(new_docs)
|        * calculates the word counts for each word for each document.
|        * expands the dimensions as necessary for new vocabulary terms
|        * returns wordids (wordids[i][j] is the jth unique token in doc i)
|        * returns wordcts (wordcts[i][j] is the frequency of token j in doc i)
|
|   * calculates gamma and phi for this batch of new_docs (in keeping with the `online'
|     algorithm, gamma and phi are not a function of previous batches)
|   * returns gamma
|   * returns something called sstats which is the second term in the lambda value:
|     \sum_d n_{dw} * phi_{dwk}
|
|___self.approx_bound(docs, gamma)
|   * calculates the likelihood of the new docs given the old value (from the
|     previous round) of lambda. this is sort of like saying, if we stuck with
|     these values, how good a predictor of the data would they be?
|     essentially, what is the quality of the classifier at this stage?
|   * returns bound
|
|___updates self._lambda
|___updates self._Elogbeta (the expectation of log(Beta)
|___updates self._expElogbeta
|   note that this Beta term, which is used in calculating phi, is updated here because
|   it is a model-level parameter, whereas the theta term of the expression for phi is
|   calculated for each document for each topic, in the do_e_step() function.
|___updates self._updatct (the number of batches seen so far; e.g. the number
    of times update_lambda has been called.

=== Initialization ===

Phi
Gamma
Lambda
Vocabulary


=== Model Parameters ===

=== Classification ===


=== Unseen Vocabulary Terms ===

In a fixed vocabulary problem, we can determine a set of word counts a priori.

In an online setting with a fixed vocabulary, it is also possible to see a
"new" word in the sense that it might be in the vocabulary but not have been
seen yet.