Module tmlib.datasets¶

This module includes some classes and utility functions which help us work with the dataset

class DataSet¶

This is the main class storing the information about your corpus such as: number of documents, size of vocabulary set, etc. You also can load the mini-batches data to implement your learning algorithm.

tmlib.datasets.DataSet(data_path=None, batch_size=None, passes=1, shuffle_every=None, vocab_file=None)

Parameters¶

data_path: string,

Path of file input (corpus)
batch_size: int

size of mini-batch in each sampling from corpus.
passes: int, default: 1

passes controls how often we train the model on the entire corpus. Another word for passes might be “epochs” (in training neural network). iterations is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. It is important to set the number of “passes” and “iterations” high enough.

For example, if you set passes = 5, assume that batch_size = 100 and size of corpus is 10000 then number of training iterations is 10000/100*5 = 5000
shuffle_every: int,

This parameter help us shuffle the samples (documents) in corpus at each pass (epoch)

If you set shuffle_every=2, it means after passing over corpus 2 times, corpus will be shuffled
vocab_file: string, default: None

File vocabulary of corpus

If corpus is raw text format, file vocabulary is non-necessary. Otherwise, if corpus is tf or sq format, user must set it

Attributes¶

batch_size: int
vocab_file: string,
num_docs: int,

Return number of document in corpus
data_path: string,

path of file corpus which is the term-frequency or term-sequence format
data_format: attribute of class DataFormat

The class DataFormat stores name of formats data: DataFormat.RAW_TEXT, DataFormat.TERM_SEQUENCE or DataFormat.TERM_FREQUENCY
output_format: attribute of class DataFormat, default is DataFormat.TERM_FREQUENCY

format of mini-batch. User change the format by use method set_output_format
passes: int
shuffle_every: int
work_path: string

This path is different from data_path. If corpus is shuffled then work_path is path of the shuffled file, not the original file

Methods¶

__init__(data_path=None, batch_size=None, passes=1, shuffle_every=None, vocab_file=None)
load_mini_batch ()

loading a mini-batch from corpus with specific format (controlled by output_format) Return: object class Corpus storing mini-batch
load_new_document (path_file, vocab_file=None)

You can load new document from path_file. If format of file is raw text, you need add vocab_file Return: object Corpus
check_end_of_data ()

To check out that whether we visit to the last mini-batch or not.

Return True if the last mini-batch is loaded and the training is done
set_output_format (output_format)

set format for the loaded mini-batch
- Parameters: output_format (DataFormat.TERM_SEQUENCE or DataFormat.TERM_FREQUENCY)
get_total_docs ()

Return number of documents which have been analyzed until the present
get_num_tokens ()

Return number of tokens in corpus
get_num_terms ()

Return number of unique terms in corpus (size of vocabulary set)

Example¶

Load mini-batch with term-frequency format

from tmlib.datasets import DataSet

#AP corpus in folder examples/ap/data
data = DataSet(data_path='data/ap_train_raw.txt', batch_size=100, passes=4, shuffle_every=2)
minibatch = data.load_mini_batch()  # The format is term-frequency by default

Load mini-batch with term-sequence format

from tmlib.datasets import DataSet
from tmlib.datasets.utilities import DataFormat

#AP corpus in folder examples/ap/data
data = DataSet(data_path='data/ap_train_raw.txt', batch_size=100, passes=4, shuffle_every=2)
data.set_output_format(DataFormat.TERM_SEQUENCE)
minibatch = data.load_mini_batch()

In these examples, we set passes=4 and shuffle_every=2, it means: 4 times of passing over data and after every 2 times, corpus is shuffled again. Assume that size of corpus is 5000 documents, batch_size = 100, then number of iterators is: 5000/100*4 = 2000. We can check the last iterator by using method check_end_of_data().

class DataFormat¶

This is class which contains 3 data-format types of library is: raw text, term_sequence, term-frequency

tmlib.datasets.utilities.DataFormat

Static Attributes¶

RAW_TEXT: string, value is ‘txt’
TERM_FREQUENCY: string, value is ‘tf’
TERM_SEQUENCE: string, value is ‘sq’

Example¶

This example allows checking data format for: corpus examples/ap/ap_train_raw.txt

from tmlib.datasets.utilities import DataFormat, check_input_format

input_format = check_input_format('examples/ap/ap_train_raw.txt')
print(input_format)
if input_format == DataFormat.RAW_TEXT:
    print('Corpus is raw text')
elif input_format == DataFormat.TERM_SEQUENCE:
    print('Corpus is term-sequence format')
else:
    print(Corpus is term-frequency format')

Output:

txt
Corpus is raw text

class Corpus¶

This class is used to store the corpus with 2 formats: term-frequency and term-sequence

tmlib.datasets.utilities.Corpus(format_type)

Parameters¶

format_type: DataFormat.TERM_SEQUENCE or DataFormat.TERM_FREQUENCY

Attributes¶

format_type: format of corpus
word_ids_tks: list of list,

Each element in this list is a list which include the words of a document in corpus (words is unique terms if format is term-frequency and is list of tokens if format is term-sequence)
cts_lens: list

if format is term-frequency, each element in list is a list frequency of unique terms in respectly document of corpus. If format is term-sequence, each element in list is the number of tokens in document (number of tokens in each doc).

Methods¶

append_doc (ids_tks, cts_lens)

Add a document to corpus. If format of this document is term-frequency, this method will append list of unique terms to word_ids_tks and append list of frequency to cts_lens. If format is term-sequence, the list of tokens and number of tokens will be appended respectly
- Parameters: ids_tks and cts_lens is format (tf or sq) of added document
  
  ids_tks: list of unique terms (term-frequency format) or list of tokens (term-sequence format) cts_lens: list of frequency of unique terms (term-frequency format) or number tokens in document (term-sequence format)

Utility functions¶

These functions below are in module tmlib.datasets.utilities

get_data_home¶

tmlib.datasets.utilities.get_data_home(data_home=None)

This folder is used by some large dataset loaders to avoid downloading the data several times.

By default the data dir is set to a folder named ‘tmlib_data’ in the user home folder. We can change it by change value of data_home parameter The ‘~’ symbol is expanded to the user home folder.

If the folder does not already exist, it is automatically created.

Return: path of the tmlib data dir.

>>> from tmlib.datasets import utilities
>>> print 100.get_data_home()
/home/kde/tmlib_data

clear_data_home¶

tmlib.datasets.utilities.clear_data_home(data_home=None)

Delete all the content of the data home cache.

check_input_format¶

tmlib.datasets.utilities.check_input_format(file_path)

Check format of input file(text formatted or raw text)
Parameters: file_path (string)

Path of file input
Return: format of input (DataFormat.RAW_TEXT, DataFormat.TERM_FREQUENCY or DataFormat.TERM_SEQUENCE)

>>> from tmlib.datasets import utilities
>>> file_path = '/home/kde/Desktop/topicmodel-lib/examples/ap/ap_train.txt'
>>> print utilities.check_input_format(file_path)
tf
>>> file_path = '/home/kde/Desktop/topicmodel-lib/examples/ap/ap_train_raw.txt'
>>> print utilities.check_input_format(file_path)
txt

load_batch_raw_text¶

tmlib.datasets.utilities.load_batch_raw_text(file_raw_text_path)

load all of documents and store as a list. Each element in this list is a document with raw text format (string)
Parameters: file_raw_text_path (string)

Path of file input
Return: list, each element in list is string type and also is text of a document

>>> from tmlib.datasets import utilities
>>> path_file_raw_text = '/home/kde/Desktop/topicmodel-lib/examples/ap/ap_infer_raw.txt'
>>> list_docs = utilities.load_batch_raw_text(path_file_raw_text)
>>> print 'number of documents: ', len(list_docs)
number of documents:  50
>>> print list_docs[8]
 Here is a summary of developments in forest and brush fires in Western states:

pre_process¶

tmlib.datasets.utilities.pre_process(file_path)

Preprocessing for file input if format of data is raw text
Paremeter: file_path (string)

Path of file input
Return: list which respectly includes path of vocabulary file, term-frequency file, term-sequence file after preprocessing

>>> from tmlib.datasets import utilities
>>> path_file = '/home/kde/Desktop/topicmodel-lib/examples/ap/ap_train_raw.txt'
>>> path_vocab, path_tf, path_sq = utilities.pre_process(path_file)
Waiting...
>>> print 'path to file vocabulary extracted: ', path_vocab
path to file vocabulary extracted:  /home/kde/tmlib_data/ap_train_raw/vocab.txt
>>> print 'path to file with term-frequency format: ', path_tf
path to file with term-frequency format:  /home/kde/tmlib_data/ap_train_raw/ap_train_raw.tf
>>> print 'path to file with term-sequence format: ', path_sq
path to file with term-sequence format:  /home/kde/tmlib_data/ap_train_raw/ap_train_raw.sq

load_batch_formatted_from_file¶

tmlib.datasets.utilities.load_batch_formatted_from_file(data_path, output_format=DataFormat.TERM_FREQUENCY)

load all of documents in file which is formatted as term-frequency format or term-sequence format and return a corpus with format is output_format
Parameters:
- data_path: path of file data input which is formatted
- output_format: format data of output, default: term-frequence format
Return: object corpus which is the data input for learning

>>> path_file_tf = '/home/kde/Desktop/topicmodel-lib/examples/ap/ap_train.txt'
>>> corpus_tf = utilities.load_batch_formatted_from_file(path_file_tf)
>>> print 'Unique terms in the 9th documents: ', corpus_tf.word_ids_tks[8]
Unique terms in the 9th documents:  [5829 4040 2891   14 1783  381 2693]
>>> print 'Frequency of unique terms in the 9th documents: ', corpus_tf.cts_lens[8]
Frequency of unique terms in the 9th documents:  [1 1 1 1 1 1 1]
>>> corpus_sq = utilities.load_batch_formatted_from_file(path_file_tf, output_format=utilities.DataFormat.TERM_SEQUENCE)
>>> print 'List of tokens in the 9th documents: ', corpus_sq.word_ids_tks[8]
List of tokens in the 9th documents:  [5829 4040 2891   14 1783  381 2693]
>>> print 'Number of tokens in the 9th document: ', corpus_sq.cts_lens[8]
Number of tokens in the 9th document:  7

reformat_file_to_term_sequence¶

tmlib.datasets.utilities.reformat_file_to_term_sequence(file_path)

convert the formatted file input (tf or sq) to file with format term-sequence
Parameter: file_path (string)

Path of file input
Return: path of file which is formatted to term-sequence

>>> from tmlib.datasets import utilities
>>> path_file_tf = tmlib
>>> path_file_sq = utilities.reformat_file_to_term_sequence(path_file_tf)
>>> print 'path to file term-sequence: ', path_file_sq
path to file term-sequence:  /home/kde/tmlib_data/ap_train/ap_train.sq

reformat_file_to_term_frequency¶

tmlib.datasets.utilities.reformat_file_to_term_sequence(file_path)

convert the formatted file input (tf or sq) to file with format term-frequency
Parameter: file_path (string)

Path of file input
Return: path of file which is formatted to term-frequency

>>> from tmlib.datasets import utilities
>>> path_file = '/home/kde/Desktop/topicmodel-lib/examples/ap/ap_train.txt'
>>> path_file_tf = utilities.reformat_file_to_term_sequence(path_file)
>>> print 'path to file term-frequency: ', path_file_tf
path to file term-frequency:  /home/kde/tmlib_data/ap_train/ap_train.tf

convert_corpus_format¶

tmlib.datasets.utilities.convert_corpus_format(corpus, data_format)

convert corpus (object of class Corpus) to desired format
Parameters:
- corpus: object of class Corpus,
- data_format: format type desired (DataFormat.TERM_SEQUENCE or DataFormat.TERM_FREQUENCY)
Return: object corpus with desired format

>>> from tmlib.datasets import utilities
>>> path_file_tf = '/home/kde/Desktop/topicmodel-lib/examples/ap/ap_train.txt'
>>> corpus = utilities.load_batch_formatted_from_file(path_file_tf)
>>> corpus_sq = utilities.convert_corpus_format(corpus, utilities.DataFormat.TERM_SEQUENCE)
>>> print 'Unique terms in the 22th documents: ', corpus.word_ids_tks[21]
Unique terms in the 22th documents:  [  32  396  246   87  824 3259  316  285]
>>> print 'Frequency of unique terms in the 22th documents: ', corpus.cts_lens[21]
Frequency of unique terms in the 22th documents:  [1 1 1 2 1 1 2 1]
>>> print 'List of tokens in the 22th documents: ', corpus_sq.word_ids_tks[21]
List of tokens in the 22th documents:  [32, 396, 246, 87, 87, 824, 3259, 316, 316, 285]
>>> print 'Number of tokens in the 22th document: ', corpus_sq.cts_lens[21]
Number of tokens in the 22th document:  10

compute_sparsity¶

tmlib.datasets.utilities.compute_sparsity(doc_tp, num_docs, num_topics, _type)

Compute document sparsity.
Parameters:
- doc_tp: numpy.array, 2-dimention, the estimated topic mixtures of all documents in corpus
- num_docs: int, the number of documents in corpus
- num_topics: int, is the number of requested latent topics to be extracted from the training corpus.
- _type: string, if the value is ‘z’, the topic mixtures is estimated by the sampling method as CGS or CVB0, so we have the individual caculation for this. Otherwise, if the value of it isn’t ‘z’, this is for the methods as: VB, OPE or FW
Return: float, sparsity of documents

>>> import numpy as np
>>> from tmlib.datasets import utilities
>>> theta = np.array([[0.1, 0.3, 0.2, 0.2, 0.1, 0.1], [0.02, 0.05, 0.03, 0.5, 0.2, 0.2]], dtype='float32')
>>> utilities.compute_sparsity(theta, theta.shape[0], theta.shape[1], _type='t')
1.0

write_topic_proportions¶

tmlib.datasets.utilities.write_topic_proportions(theta, file_name)

save topic mixtures (theta) to a file
Parameters:
- theta: numpy.array, 2-dimention
- file_name: name (path) of file which is written