Online CVB0

CVB0 [2] is derived from CVB [1] (Collapsed Variational Bayes), it’s an improved version of CVB. Similar to VB, it applies the variational inference to estimate the latent variables. But instead of estimating both \(\theta\) and z, CVB0 actually only estimates the topic assignment z . The approximation of distribution P (z, \(\theta\), \(\beta\) | C) as follow:

\[Q(z, \theta, \beta) = Q(\theta, \beta | z, \gamma, \lambda) * \prod_{d \in C} Q(z_d | \phi_d)\]

Online CVB0 [3] is an online version of CVB0. It’ll infer to the variational parameter \(\phi\) and update the global variable by a stochastic algorithm. The global variable here is a statistic \(N^{\beta}\) (topic statistic) updated from \(\phi\). This statistic plays a similar role with \(\lambda\) in VB and we can estimate topics from this statistic. To estimate the topic proportions, Online CVB0 also used the other statistic is \(N^{\theta}\) called document statistic, and it also is updated from \(\phi\) in a stochastic way.

class OnlineCVB0

tmlib.lda.OnlineCVB0(data=None, num_topics=100, alpha=0.01, eta=0.01, tau_phi=1.0, kappa_phi=0.9, s_phi=1.0, tau_theta=10.0, kappa_theta=0.9, s_theta=1.0, burn_in=25, lda_model=None)

Parameters

  • data: object DataSet

    object used for loading mini-batches data to analyze

  • num_topics: int, default: 100

    number of topics of model.

  • alpha: float, default: 0.01

    hyperparameter of model LDA that affect sparsity of topic proportions \(\theta\)

  • eta (\(\eta\)): float, default: 0.01

    hyperparameter of model LDA that affect sparsity of topics \(\beta\)

  • tau_phi : float, default: 1.0

    In the update global variable step, a parameter used is step-size \(\rho\) (it is similar to the learning rate in gradient descent optimization). The step-size changes after each training iteration t

    \[\rho_t = s * (t + \tau_0)^{-\kappa}\]

    The learning parameters s, \(\tau_0\) and \(\kappa\) can be changed manually and called the step-size schedule. But there exist some constrain: \(\tau_0 \geq 0\) and \(\kappa \in (0.5, 1]\).

    The step-size schedule (s_phi, tau_phi, kappa_phi) is used for update \(N^{\beta}\) and (s_theta, tau_theta, kappa_theta) used for update \(N^{\theta}\).

  • kappa_phi : float, default: 0.9

  • s_phi: float, default: 1.0

  • tau_theta: float, default: 10.0

  • kappa_theta: float, default: 0.9

  • s_theta: float, default: 1.0

  • burn_in: int, default: 25

    Online CVB0 needs to perform a small number of extra passes per document to learn the document statistics before updating the topic statistics. And here, the parameter burn-in is number of passes we use

  • lda_model: object of class LdaModel, default: None.

    If this is None value, a new object LdaModel will be created. If not, it will be the model learned previously

Attributes

  • data: object DataSet

  • num_terms: int,

    size of the vocabulary set of the training corpus

  • num_topics: int,

  • alpha: float,

  • eta (\(\eta\)): float,

  • tau_phi: float,

  • kappa_phi: float,

  • s_phi: float,

  • tau_theta: float,

  • kappa_theta: float,

  • s_theta: float,

  • burn_in: int,

  • lda_model: object of class LdaModel

Methods

  • __init__ (data=None, num_topics=100, alpha=0.01, eta=0.01, tau_phi=1.0, kappa_phi=0.9, s_phi=1.0, tau_theta=10.0, kappa_theta=0.9, s_theta=1.0, burn_in=25, lda_model=None)

  • static_online (wordtks, lengths)

    Execute the Online CVB0. 2 parameters wordtks, lengths represent for term-sequence data of mini-batch. It is the value of 2 attribute word_ids_tks and cts_lens in class Corpus

    Return: tuple (time of E-step, time of M-step, N_theta).

  • learn_model (save_model_every=0, compute_sparsity_every=0, save_statistic=False, save_top_words_every=0, num_top_words=10, model_folder=None, save_topic_proportions=None)

    This used for learning model and to save model, statistics of model.

    Parameters:

    • save_model_every: int, default: 0. If it is set to 2, it means at iterators: 0, 2, 4, 6, …, model will is save into a file. If setting default, model won’t be saved.
    • compute_sparsity_every: int, default: 0. Compute sparsity and store in attribute statistics. The word “every” here means as same as save_model_every
    • save_statistic: boolean, default: False. Saving statistics or not. The statistics here is the time of E-step, time of M-step, sparsity of document in corpus
    • save_top_words_every: int, default: 0. Used for saving top words of topics (highest probability). Number words displayed is num_top_words parameter.
    • num_top_words: int, default: 20. By default, the number of words displayed is 10.
    • model_folder: string, default: None. The place which model file, statistics file are saved.
    • save_topic_proportions: string, default: None. This used to save topic proportions \(\theta\) of each document in training corpus. The value of it is path of file .h5

    Return: the learned model (object of class LdaModel)

  • infer_new_docs (new_corpus)

    This used to do inference for new documents. new_corpus is object Corpus. This method return the document statistics \(\bm{N}^{\theta}\) in new corpus

Example

from tmlib.lda import OnlineCVB0
from tmlib.datasets import DataSet

# data preparation
data = DataSet(data_path='data/ap_train_raw.txt', batch_size=100, passes=5, shuffle_every=2)
# learning and save the model, statistics in folder 'models-online-cvb0'
onl_cvb0 = OnlineCVB0(data=data, num_topics=20, alpha=0.2)
model = onl_cvb0.learn_model(save_model_every=1, compute_sparsity_every=1, save_statistic=True, save_top_words_every=1, model_folder='models-online-cvb0')


# inference for new documents
vocab_file = data.vocab_file
# create object ``Corpus`` to store new documents
new_corpus = data.load_new_documents('data/ap_infer_raw.txt', vocab_file=vocab_file)
N_theta = onl_cvb0.infer_new_docs(new_corpus)
[1]
  1. Teh, D. Newman, and M. Welling, “A collapsed variational bayesian inference algorithm for latent dirichlet allocation,” in Advances in Neural Information Processing Systems, vol. 19, 2007, p.1353.
[2]
  1. Asuncion, M. Welling, P. Smyth, and Y. Teh, “On smoothing and inference for topic models,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 2009, pp. 27–34
[3]
  1. Foulds, L. Boyles, C. DuBois, P. Smyth, and M. Welling, “Stochastic collapsed variational bayesian inference for latent dirichlet allocation,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2013, pp. 446–454.