ML-CGS¶
You can see that Online CGS is also a hybrid algorithm. It infers to topic indicators z at each token in individual document by Gibb sampling. After that, it defines a approximatee sufficient statistics to update global variable \(\lambda\). By borrowing idea from ML-FW and ML-OPE, ML-CGS will estimate directly topics \(\beta\) instead of \(\lambda\)
First, ML-CGS will estimate \(\theta\) from S sampled topic indicators \({z}^{1,2,...,S}\) in each mini-batch [1]
And then, we can define a sufficient statistics \(\hat{\beta}\) to update \(\beta\) following [2]
class tmlib.lda.MLCGS¶
tmlib.lda.MLCGS(data=None, num_topics=100, alpha=0.01, eta=0.01, tau0=1.0, kappa=0.9, burn_in=25, samples=25, lda_model=None)
Parameters¶
data: object
DataSet
object used for loading mini-batches data to analyze
num_topics: int, default: 100
number of topics of model.
alpha: float, default: 0.01
hyperparameter of model LDA that affect sparsity of topic proportions \(\theta\)
eta (\(\eta\)): float, default: 0.01
hyperparameter of model LDA that affect sparsity of topics \(\beta\)
tau0 (\(\tau_{0}\)): float, default: 1.0
In the update \(\lambda\) step, a parameter used is step-size \(\rho\) (it is similar to the learning rate in gradient descent optimization). The step-size changes after each training iteration t
\[\rho_t = (t + \tau_0)^{-\kappa}\]And in this, the delay tau0 (\(\tau_{0}\)) >= 0 down-weights early iterations
kappa (\(\kappa\)): float, default: 0.9
kappa (\(\kappa\)) \(\in\) (0.5, 1] is the forgetting rate which controls how quickly old information is forgotten
burn_in: int, default: 25
Topic indicator at each token in indivisual document is sampled many times. But at the first several iterations, the samples will be discarded. The parameter burn_in is number of the first iterations that we discard the samples
samples: int, default: 25
After burn-in sweeps, we begin saving sampled topic indicators and we have saved S samples \({z}^{1,...,S}\) (by default, S = 25)
lda_model: object of class
LdaModel
.If this is None value, a new object
LdaModel
will be created. If not, it will be the model learned previously
Attributes¶
num_terms: int,
size of the vocabulary set of the training corpus
num_topics: int,
alpha: float,
eta (\(\eta\)): float,
tau0 (\(\tau_{0}\)): float,
kappa (\(\kappa\)): float,
burn_in: int,
samples: int,
lda_model: object of class LdaModel
Methods¶
__init__ (data=None, num_topics=100, alpha=0.01, eta=0.01, tau0=1.0, kappa=0.9, burn_in=25, samples=25, lda_model=None)
static_online (wordtks, lengths)
Excute the learning algorithm, includes: inference for individual document and update \(\lambda\). 2 parameters wordtks, lengths represent for term-sequence data of mini-batch. It is the value of 2 attribute word_ids_tks and cts_lens in class Corpus
Return: tuple (time of E-step, time of M-step, statistic_theta). statistic_theta is a statistic estimated from sampled topic indicators \({z}^{1,...,S}\). It plays a similar role with \(\gamma\) in VB
learn_model (save_model_every=0, compute_sparsity_every=0, save_statistic=False, save_top_words_every=0, num_top_words=10, model_folder=None, save_topic_proportions=None)
This used for learning model and to save model, statistics of model.
Parameters:
- save_model_every: int, default: 0. If it is set to 2, it means at iterators: 0, 2, 4, 6, …, model will is save into a file. If setting default, model won’t be saved.
- compute_sparsity_every: int, default: 0. Compute sparsity and store in attribute statistics. The word “every” here means as same as save_model_every
- save_statistic: boolean, default: False. Saving statistics or not. The statistics here is the time of E-step, time of M-step, sparsity of document in corpus
- save_top_words_every: int, default: 0. Used for saving top words of topics (highest probability). Number words displayed is num_top_words parameter.
- num_top_words: int, default: 20. By default, the number of words displayed is 10.
- model_folder: string, default: None. The place which model file, statistics file are saved.
- save_topic_proportions: string, default: None. This used to save topic proportions \(\theta\) of each document in training corpus. The value of it is path of file
.h5
Return: the learned model (object of class LdaModel)
infer_new_docs (new_corpus)
This used to do inference for new documents. new_corpus is object
Corpus
. This method return a statistic which used for estimating topic proportions \(\theta\)
Example¶
from tmlib.lda import MLCGS from tmlib.datasets import DataSet # data preparation data = DataSet(data_path='data/ap_train_raw.txt', batch_size=100, passes=5, shuffle_every=2) # learning and save the model, statistics in folder 'models-ml-cgs' ml_cgs = MLCGS(data=data, num_topics=20, alpha=0.2) model = ml_cgs.learn_model(save_model_every=1, compute_sparsity_every=1, save_statistic=True, save_top_words_every=1, num_top_words=10, model_folder='models-ml-cgs') # inference for new documents vocab_file = data.vocab_file # create object ``Corpus`` to store new documents new_corpus = data.load_new_documents('data/ap_infer_raw.txt', vocab_file=vocab_file) statistic_theta = ml_cgs.infer_new_docs(new_corpus)
[1] |
|
[2] |
|