ML-FW

If you’ve read Online FW and Streaming FW, you can see that they are hybrid algorithms which combine OPE inference with variational Bayes for estimating the posterior of the global variables. They have to maintain variational parameters (\(\lambda\)) for the Dirichlet distribution over topics, instead of the topics themselve. Nonetheless, the combinations are not very natural since we have to compute \(\phi\) (variational parameter of topic indicators z) from topic proportions \(\theta\) in order to update the model. Such a conversion might incur some information losses

It is more natural if we can use directly \(\theta\) in the update of the model at each minibatch. To this end, we use an idea from [1]. Instead of following Bayesian approach to estimate the distribution over topics (P(\(\beta\) | \(\lambda\))), one can consider the topics as parameters and estimate them directly from data. It means, we can estimate directly \(\beta\). This is the idea of ML-FW for learning LDA

One more advance of ML-FW is which enables us enables us to learn LDA from either large corpora or data streams (both online or stream environment)

For more detail ML-FW, see in [2]

class tmlib.lda.MLFW

tmlib.lda.MLFW(data=None, num_topics=100, tau0=1.0, kappa=0.9, iter_infer=50, lda_model=None)

Parameters

  • data: object DataSet

    object used for loading mini-batches data to analyze

  • num_topics: int, default: 100

    number of topics of model.

  • eta (\(\eta\)): float, default: 0.01

    hyperparameter of model LDA that affect sparsity of topics \(\beta\)

  • tau0 (\(\tau_{0}\)): float, default: 1.0

    In the update \(\lambda\) step, a parameter used is step-size \(\rho\) (it is similar to the learning rate in gradient descent optimization). The step-size changes after each training iteration t

    \[\rho_t = (t + \tau_0)^{-\kappa}\]

    And in this, the delay tau0 (\(\tau_{0}\)) >= 0 down-weights early iterations

  • kappa (\(\kappa\)): float, default: 0.9

    kappa (\(\kappa\)) \(\in\) (0.5, 1] is the forgetting rate which controls how quickly old information is forgotten

  • iter_infer: int, default: 50.

    Number of iterations of FW algorithm to do inference step

  • lda_model: object of class LdaModel.

    If this is None value, a new object LdaModel will be created. If not, it will be the model learned previously

Attributes

  • num_docs: int,

    Number of documents in the corpus.

  • num_terms: int,

    size of the vocabulary set of the training corpus

  • num_topics: int,

  • eta (\(\eta\)): float,

  • tau0 (\(\tau_{0}\)): float,

  • kappa (\(\kappa\)): float,

  • iter_infer: int,

    Number of iterations of FW algorithm to do inference step

  • lda_model: object of class LdaModel

Methods

  • __init__ (data=None, num_topics=100, tau0=1.0, kappa=0.9, iter_infer=50, lda_model=None)

  • static_online (wordids, wordcts)

    First does an E step on the mini-batch given in wordids and wordcts, then uses the result of that E step to update the topics in M step.

    Parameters:

    • wordids: A list whose each element is an array (terms), corresponding to a document. Each element of the array is index of a unique term, which appears in the document, in the vocabulary.
    • wordcts: A list whose each element is an array (frequency), corresponding to a document. Each element of the array says how many time the corresponding term in wordids appears in the document.

    Return: tuple (time of E-step, time of M-step, theta): time the E and M steps have taken and the list of topic mixtures of all documents in the mini-batch.

  • e_step (wordids, wordcts)

    Does e step

    Note that, FW can provides sparse solution (theta:topic mixture) when doing inference for each documents. It means that the theta have few non-zero elements whose indexes are stored in list of lists ‘index’.

    Return: tuple (theta, index): topic mixtures and their nonzero elements’ indexes of all documents in the mini-batch.

  • m_step (wordids, wordcts, theta, index)

    Does M-step

  • learn_model (save_model_every=0, compute_sparsity_every=0, save_statistic=False, save_top_words_every=0, num_top_words=10, model_folder=None, save_topic_proportions=None)

    This used for learning model and to save model, statistics of model.

    Parameters:

    • save_model_every: int, default: 0. If it is set to 2, it means at iterators: 0, 2, 4, 6, …, model will is save into a file. If setting default, model won’t be saved.
    • compute_sparsity_every: int, default: 0. Compute sparsity and store in attribute statistics. The word “every” here means as same as save_model_every
    • save_statistic: boolean, default: False. Saving statistics or not. The statistics here is the time of E-step, time of M-step, sparsity of document in corpus
    • save_top_words_every: int, default: 0. Used for saving top words of topics (highest probability). Number words displayed is num_top_words parameter.
    • num_top_words: int, default: 20. By default, the number of words displayed is 10.
    • model_folder: string, default: None. The place which model file, statistics file are saved.
    • save_topic_proportions: string, default: None. This used to save topic proportions \(\theta\) of each document in training corpus. The value of it is path of file .h5

    Return: the learned model (object of class LdaModel)

  • infer_new_docs (new_corpus)

    This used to do inference for new documents. new_corpus is object Corpus. This method return topic proportions \(\theta\) for each document in new corpus

Example

from tmlib.lda import MLFW
from tmlib.datasets import DataSet

# data preparation
data = DataSet(data_path='data/ap_train_raw.txt', batch_size=100, passes=5, shuffle_every=2)
# learning and save the model, statistics in folder 'models-ml-fw'
ml_fw = MLFW(data=data, num_topics=20)
model = ml_fw.learn_model(save_model_every=1, compute_sparsity_every=1, save_statistic=True, save_top_words_every=1, num_top_words=10, model_folder='models-ml-fw')


# inference for new documents
vocab_file = data.vocab_file
# create object ``Corpus`` to store new documents
new_corpus = data.load_new_documents('data/ap_infer_raw.txt', vocab_file=vocab_file)
theta = ml_fw.infer_new_docs(new_corpus)
[1]
  1. Than and T. B. Ho, “Fully sparse topic models,” in Machine Learning and Knowledge Discovery in Databases, ser. Lecture Notes in Computer Science, P. Flach, T. De Bie, and N. Cristianini, Eds. Springer, 2012, vol. 7523, pp. 490–505.
[2]Khoat Than, Tu Bao Ho, “Inference in topic models: sparsity and trade-off”. [Online]. Available: https://arxiv.org/abs/1512.03300

[3] K. L. Clarkson, “Coresets, sparse greedy approximation, and the frank-wolfe algorithm,” ACM Trans. Algorithms, vol. 6, pp. 63:1–63:30, 2010. [Online]. Available: http://doi.acm.org/10.1145/1824777.1824783