Structural Topic Models

Author

Deon Roos

Hey Grace,

I’ve written this page for you and your honours project. I’ve aimed to keep it accessible, covering both the core theory behind the stats you’ll use (Structural Topic Models, STMs) but also how to actually implement them. That said, the theory can get complex at times, and I’m still learning it myself! (It’s also surprisingly hard to find good, clear explanations online.) It took me a while to make sense of the stats, so don’t worry if it also takes you some time. Be patient with when learning this material.

To be clear, this page isn’t meant to replace our meetings or turn me into a hands-off “supervisor.” It’s a resource you can return to whenever you need a refresher or some help getting unstuck.

Hopefully, this helps you get a handle on the method, but let me know if anything is confusing. It probably won’t answer every question, and I won’t be offended if you want to toss it in a bonfire. Just flag anything that’s unclear, and we’ll work through it together.


Why analyse text?

Let’s say we want to understand how movie reviews have changed over time. How would we do that?

“Well,” you might say, “maybe we could read thousands of reviews from the last 50 years and summarise them.” That’s technically possible but massively time-consuming. Even if we narrowed it down to the “best” reviews, we’d still face a huge pile of reading plus we’d now have the problem of how do you define which are the “best” reviews.

In fact, that’s exactly what many arts PhD students do: read hundreds (or thousands) of documents and distill them into themes. But I’d argue I’m not smart enough to do an arts PhD so instead, I’d like to cheat and let computers do the hard work for me.

But how do we cheat?

This is where your thesis steps into a murky (but exciting) space between statistics and machine learning.

One option is supervised learning. This is where you tell the model what it should learn. For example, in BI3010 you learned:

\[ y_i \sim Normal(\mu_i, \sigma) \\ \mu_i = \beta_0 + \beta_1 \times x_i \]

Here, the goal was to estimate $\beta_0$ and $\beta_1$. It’s supervised because we’ve defined the task: “Fit this straight line.”

Doing this with text means labelling each document manually. For example, marking them “Pro-Lynx” or “Anti-Lynx”, and then fitting a model. Doable? Yes. But incredibly time-intensive, and not much easier than just summarising the documents yourself.

The other option is unsupervised learning. Here, you don’t tell the model what to look for, you let it discover patterns on its own. The model identifies clusters or themes in the text, and your job afterward is simply to interpret those themes.

That’s way more manageable.

There are many ways to do this (Large Language Models like ChatGPT use similar ideas), but for your project, we’ll use a method called Structural Topic Modelling (STM).

What is a Structural Topic Model?

A Structural Topic Model (STM) is a type of analysis that allows us to explore large sets of text documents and identify what the common themes are; called topics. Just like you did in BI3010, you can also include covariates (also called explanatory variables) to see if that makes a topic may be more or less prevalent.

Grace, in your case, this might include things like how date (i.e. how has coverage of lynx changed over time), and if the illegal release of lynx (and maybe their interaction) may have changed which topics are more or less prevelant. For example, did negative topics become more common in the media after the illegal release compared to before?

STMs are a fairly complex beast, with lots of new ideas. One of these new ideas that I won’t explain in this document is Bayesian statistics. Luckily, I have written another set of documents that explain the general theory of this, which you are welcome and encouraged to read through. Although STMs, as implemented in the R package stm, use the Bayesian statistical framework, you can’t actually interact with it, so it’s not crucial to understand in this case. However, I would recommend trying to wrap your head around it, as it’s a piece of knowledge that may make you highly employable.

With that said, let’s go over, conceptually, what Structural Topic Models are going.

  1. We begin by gathering a corpus. This is a collection of documents, like newspaper articles. Our objective is to learn something about the corpus.
  2. We assume that within each document, there can exist multiple topics. Topics are the “themes” that the document covers; things like “Lynx are bad and we shouldn’t release them” or “Lynx are good and we should release them”.
  3. These topics are latent, which means “hidden” or “unobserved” (newspapers don’t add a sticker on each article to say what the theme is afterall) and we want to use the STM to identify them and to see which are most prevalent.
  4. We state how many theme we think there are. This might be 5 or it might be 200. This is a choice we make. (There are some tools that can help make this choice.)
  5. Within each document we consider each word. We assume that each word is associated with one of these topics with differing probabilities. For example, if there is a topic for “Lynx are bad”, then we might expect that “livestock” has a 90% chance of belonging to this topic, while “rewilding” only has a 0.5% chance.
  6. Each topic will have a distribution of words associated with it, each with their own probability to belong to that topic.

So our objective then, is to identify different topics, and which words tend to categorise those topics. This is what we’re, fundamentally, trying to do in an STM.

How do STMs work?

By the conceptual description above, you may notice something. STMs have a hierarchical structure. At the top level is the document, within which we have topics, within which we have words.

The have data we have is document and words, and we use these to estimate topic.

This hierarchical structure is common in many of the more advanced statistical methods, especially in ecology. Occupancy models are a type of this, as a Cormack-Jolly-Seber models which estimate the survival indivudual animals. For your heritage, know the Cormack and Jolly worked in Aberdeen when they developed the method. That’s something to be super proud of! The CJS is such an important model in ecology that there are entire conferences dedicated to people using it.

What does an STM look like?

Let’s start with a simple way to visualise the data and output of an STM (apologies for the generative AI image):

The things to take away from this are to highlight that \(d\) is just the current document you are looking at. \(\theta_k\) describes the relative proportion of a document that is dedicated to topic \(k\) (e.g. in the above figure we have three topics, called A, B and C). These topics are determined by the word (\(w\), given all words \(n\) used in the document \(d\)) within them, that appears in the topic (\(k\)) with probability \(\beta\).

That’s the simplified version. If we dive into the details, things get a bit more complex.

The equations

We’ll go through this step-by-step because the estimation process in structural topic modeling is complex (but powerful).

1. Document Topic Model

Our first objective is to understand, within a document, what proportion are given to each topic where we can have multiple topics (up to \(K\) topics, e.g. 20). For example, an article may dedicate 50% to the topic “lynx are bad” and 50% to “protect livestock”. Why are these the topics that we see? Well maybe the newspaper has a particular politically leaning, and newspapers that have this stance tend to include these topics. Maybe this was a few days after the illegal lynx release.

This first “sub-model” attempts to resolve that. The complication is that we will have multiple topics, and we need a proportion for each. In BI3010 the distributions you worked with, for example the Normal distribution, which only had a single average. To model how much of each topic is in a document, we need to estimate a proportion for each topic. But before we can turn these into proportions, we first create a set of ‘topic scores’, one for each topic. These scores are drawn from a distribution that lets topics vary together (some may be more likely to appear together), and for that, we use a multivariate normal distribution, which is just a version of the normal distribution that handles multiple, possibly correlated, values at once. This model will produce multiple averages, which we’ll label with \(\theta\) (note that because there are multiple values in \(\theta\), we label it as \(\vec{\theta}\)).

So this first sub-model is a type of Generalised Linear Model, except this GLM does not use a Poisson, like you did in BI3010, but the Multivariate Normal.

Here’s how we describe it “formally”:

For each document \(d\) that we have, with covariates \(\vec{X}_d\), we work with a vocabulary of size \(V\) (that is, \(V\) unique words appear across the documents), given \(K\) topics, which we’re going to fit using a GLM that uses a multivariate normal distribution (with a \(\text{softmax}\) link function) to estimate which topic is present in a document (note that the “\(\vec{}\)” symbol is shorthand for vector, or a “column of data”):

\[ \vec{\theta} \sim \text{MVNorm}(\vec{\mu}, \boldsymbol\Sigma) \]

\[ \text{softmax}(\vec{\mu}) = {X_d} \boldsymbol{\Gamma} \]

A multivariate normal distribution is like several normal distributions stacked together. So, instead of having just one mean and one variance, you have a mean and variance for each topic. The covariance matrix \(\Sigma\) (which is the variance) also tells you how topics tend to co-occur, for example, maybe “Lynx are bad” often appears alongside “Predator control”.

where \(\vec{X}_d\) is a 1-by-\(p\) vector (your covariates), \(\Gamma\) is a \(p\)-by-(\(K-1\)) matrix of coefficients (this is a way to describe all the parameters, \(p\), in the model) and \(\boldsymbol{\Sigma}\) is a (\(K-1\))-by-(\(K-1\)) covariance matrix.

Keep in mind, these \(\theta\) values are like scores that indicate how much each topic is ‘preferred’ in a document. We turn these scores into probabilities using a transformation called the \(\text{softmax}\), which ensures they sum to 1; like proportions should.

Here’s what the \(\text{MVNorm}\) distribution looks like when we have two averages with differing correlation (the \(\rho\) values) between them:

Code
library(mvtnorm)
library(ggplot2)
library(dplyr)

make_density_df <- function(rho) {
  mu <- c(0, 0)
  sigma <- matrix(c(1, rho, rho, 1), nrow = 2)
  
  x <- seq(-3, 3, length.out = 100)
  y <- seq(-3, 3, length.out = 100)
  grid <- expand.grid(X1 = x, X2 = y)
  
  grid$z <- mvtnorm::dmvnorm(grid, mean = mu, sigma = sigma)
  grid$corr <- rho
  
  return(grid)
}

df_all <- bind_rows(
  make_density_df(-0.8),
  make_density_df(0),
  make_density_df(0.8)
)

ggplot(df_all, aes(x = X1, y = X2, z = z)) +
  geom_contour_filled(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ corr, labeller = labeller(corr = function(x) paste0("ρ = ", x))) +
  coord_equal() +
  labs(
    x = expression(Y[1]),
    y = expression(Y[2])
  ) +
  theme_minimal(base_size = 14)

2. Topic-Word Model

At the same time, we want to determine the probability for each word to be associated with any of our topics. For instance, what is the probability that “kill”, “predator”, “rewilding” and “nature” belong to the topic “lynx are good”? That’s what the Topic-Word model is tasked with solving.

Assume you included a document-level content covariate \(y_d\) (e.g. Politically Left versus Politically Right newspaper), we can form a document-specific distribution of words (as a vector, or “column” of numbers), called \(\boldsymbol{\beta}\), which represents each topic (\(k\)) by using:

We model the probability of each word as a combination of the baseline frequency of the word, how much more or less it appears in each topic, and how that might change depending on document metadata like political leaning. So we combine these effects additively in the log space, and then exponentiate to turn it back into probabilities.

  • The baseline word distribution (\(m\), i.e. how common is this word across all documents),

  • The topic specific deviation \(\boldsymbol{\kappa}^{(t)}_k\) (i.e. is that word more or less common in topic \(k\))

  • The covariate group deviation \(\boldsymbol{\kappa}^{(c)}_{y_d}\) (i.e. is that word more or less common in Politically Left or Right newspapers),

  • And the interaction between the two \(\boldsymbol{\kappa}^{(i)}_{y_d,k}\) if we want one

which we can estimate by doing:

\[ \vec{\beta}_{d,k} \propto exp(\vec{m} + \vec{\kappa}^{(t)}_k + \vec{\kappa}^{(c)}_{y_d} + \vec{\kappa}^{(i)}_{y_d,k}) \]

Read this as saying “the probability, \(\beta\), to see a given unique word in document \(d\), in topic \(k\) is proportional to (the \(\propto\) symbol) how common it is in general, as well as how common it is in the given topic and/or group”

This results in:

\[ \vec{\beta}_{d,k} = [\beta_{d,k,1}, \beta_{d,k,2}, ..., \beta_{d,k,V}] \]

where \(\vec{\beta}_{d,k}\) is a vector that contains the probability to see a given unique word (the \(_{1,2,...,V}\) bit) in a topic (\(k\)), in a document (\(d\))).

3. Estimating \(\vec{\beta}_{d,k}\)

Keep in mind that \(\vec{\beta}_{d,k}\) should be a probability. But to figure it out we start by estimating the rate that at which we see each unique word (\(v\)) across the entire corpus using multiple Poisson GLMs; one for each unique word:

\[ w_v \sim Poisson(\lambda_v) \]

\[ log(\lambda_v) = m_v + \kappa^{(t)}_{k,v} + \kappa^{(c)}_{y_d,v} + \kappa^{(i)}_{y_d,k,v} \]

Here, \(w_v\) is the observed count of word \(v\). Remember from BI3010 that a Poisson GLM estimates a rate but here we need a probability. To do that, we take the estimated rate (\(\lambda\)) for word \(v\) and divide it by the sum all of the \(\lambda\)s of all the other Poisson GLMs to get a probability (e.g. if we see the word lynx 100 times but we see all words a total of 500 times, then the probability to see the word lynx is \(\frac{100}{500} = 0.2 = 20\%\). We do this by:

\[ \beta_{d,k,v} = \frac{\lambda_v}{\sum\lambda_{v'}} \]

A small note here. Normally you’d want to estimate this by using a multinomial GLM, which estimates the probability of an event happening - like seeing the word lynx - but when you have lots of different words. The problem occurs when you have hundreds of thousands of unique words. In that case a multinomial model can take far too long to fit. That’s why stm uses a Poisson model for each unique word which takes these rates and converts them to probabilities.

4. Which words and which topics?

Now that we’ve estimated the topic proportions \(\vec{\theta}_d\) and the topic-word distributions \(\vec{\beta}_{d,k}\), we can now estimate the latent variables that explain how each word was chosen.

For each word in the document (which we can write as \(n \in \{1,...,N_d\}\), or “for each word that is in all words from the first to the last”) :

  • Estimate the topic by fitting a multinomial GLM, based on the probabilities in the vector \(\vec{\theta}_d\):

    \[ z_{d,n} \sim Multinomial(\vec{\theta_d}) \]

  • Then conditional on the topic, we fit another multinomial GLM to estimate which word is most likely to appear in that topic:

    \[ w_{d,n} \sim Multinomial(\vec{\beta}_{d,k=z_{d,n}}) \]

And that’s it. Suuuuuper simple, right? For transparency, I spent about three days going over material trying to make sense of the literature, in part because quantitative social scientists use very different terminology and a lot of the material I found glossed over the details, making it frustratingly hard to understand what an STM is actually doing. (But also a hell of a lot of fun).

Plate notation

If the above equations were too much, there’s another way to describe how the model works; more visual and less algebraic. It doesn’t given the nuts-and-bolts but it might help to give an intuition.

To do so, we can use plate notation. These are diagrams that describe how different parts of the model relate to each other.

Code
library(DiagrammeR)

grViz("
digraph stm {
  graph [layout = dot, rankdir = LR]

  # Nodes
  Σ [shape=circle, label='Σ', style=dashed]
  μ [shape=circle, label='μ', style=dashed]
  X [shape=circle, label='X']
  κ [shape=circle, label='κ']
  θ [shape=circle, label='θ', style=dashed]
  z [shape=circle, label='z', style=dashed]
  w [shape=circle, label='w']
  β [shape=circle, label='β', style=dashed]

  # Edges
  Σ -> θ
  μ -> θ
  X -> θ
  θ -> z
  z -> w
  β -> w
  κ -> β

  # Outer plate: D
  subgraph cluster_D {
    label = 'D'
    style = 'solid'
    X; θ; β; κ;

    # Nested plate: N
    subgraph cluster_N {
      label = 'N'
      style = 'solid'
      z; w;
    }
  }
}
")

Where:

  • Nodes: Circles represent variables. Dashed circles mean they are latent (a variable we have to estimate), while solid circles means they are observed data.

  • Plates: Rectangle indicate repetition:

    • \(D\): Each node is relevent for each document

    • \(N\): Each node is relevant for each word (and because \(N\) is within \(D\), also for each document)

And where the variables in the plate notation are:

  • \(X\) - Document level covariates (e.g. date of publication, political leaning)

  • \(\mu\) - Mean score for each topic

  • \(\Sigma\) - The covariance matrix between topics (models topic co-occurence)

  • \(\theta\) - The estimated topic proportion (which sums to 1)

  • \(z\) - Estimated topic assignment for word \(n\) in document \(d\)

  • \(w\) - The actual observed word (e.g. lynx)

  • \(\beta\) - The estimated word distribution for topic \(k\)

  • \(\kappa\) - Document level content covariate (e.g. political group)

What’s in the box?

Now that we’ve seen the equations and model structure, let’s connect them to the key data structures that the model learns. These are the basis for most interpretations, visualizations, and inferences.

Specifically, STM produces two major matrices that summarise the relationship between documents, topics, and words:

  • The topic-word matrix \(\beta\) tells us, for each topic, how likely each word is to appear; this is how we’re able to interpret what each topic is “about.”

  • The document-topic matrix \(\theta\) tells us, for each document, how much it draws on each topic; this is how we understand which topics are emphasized in which texts.

These matrices are useful because they translate the complex statistical machinery into something that might be more understandable: which documents talk about which topics, and what those topics consist of.

For \(\beta\) it’s actually a topic-word matrix \(\beta\): of dimension \(K times V\). For example, if we have 10 topics (\(K = 10\)) and 100 unique words (\(V = 100\)), then we would have a matrix with 10 rows and 100 columns.

Within the matrix each row (\(\beta_k\)) is the distribution of probabilities to see each unique word in turn.

For example, \(\beta\) might look like:

\[ \begin{array}{c|cccc} & \text{predator} & \text{policy} & \text{illegal} & \cdots & \text{Word V} \\ \hline \text{Topic } 1 & 0.35 & 0.10 & 0.06 & \cdots & \beta_{1,V} \\ \text{Topic } 2 & 0.05 & 0.01 & 0.32 & \cdots & \beta_{2,V} \\ \text{Topic } 3 & 0.10 & 0.02 & 0.08 & \cdots & \beta_{3,V} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \text{Topic } K & \beta_{K,1} & \beta_{K,2} & \beta_{K,3} & \cdots & \beta_{K,V} \end{array} \]

For \(\theta\), it’s a document-topic matrix of dimension \(D \times K\). For example, if we have 5 documents (\(D = 5\)) and 10 topics (\(K = 10\)), then \(\theta\) would be a matrix with 5 rows and 10 columns.

Each row \(\theta_d\) is a probability distribution over topics for that document — i.e., which topics are present, and to what degree. In the Document-Topic Model above, we were estimating each row! And as a result of \(\text{softmax}\) each row sums to 1.

For example, \(\theta\) might look like:

\[ \begin{array}{c|cccccc} & \text{Topic 1} & \text{Topic 2} & \text{Topic 3} & \cdots & \text{Topic 10} \\ \hline \text{Doc 1} & 0.40 & 0.10 & 0.25 & \cdots & \theta_{1,K} \\ \text{Doc 2} & 0.05 & 0.60 & 0.10 & \cdots & \theta_{2,K} \\ \text{Doc 3} & 0.15 & 0.15 & 0.10 & \cdots & \theta_{3,K} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \text{Doc D} & \theta_{D,1} & \theta_{D,2} & \theta_{D,3} & \cdots & \theta_{D,K} \end{array} \]

Structural Topic Models Explained (Video)

This is one of the better videos I found that explains STMs. It keeps things fairly light and doesn’t dive into the kind of details I included above, so have a watch to see if this helps things make sense.


Implementing STM

Now that we’ve covered the theory, let’s have a look at how we actually implement the method. To do so, I’ll make use of a dataset that contains text of political blogs from 2008 (when Obama and McCain were running for president of the USA). Have a look at the data because the data you collect will need to be stored in the same way.

Load Packages and Data

To do the analysis we’re going to use stm to run the actual STMs, as well as tm and ggplot2 to help with text processing and visualisations.

Here’s what our data looks like (note that I’ve shortened the articles so the screen isn’t filled with text).

Code
library(stm)
library(tm)
library(ggplot2)

data <- read.csv("data/poliblogs2008.csv", stringsAsFactors = FALSE)

library(DT)
library(dplyr)
library(stringr)

preview_data <- data |> 
  mutate(documents = str_trim(str_replace(documents, "^\\s+|\\s+$", ""))) |>   # trim outer space
  mutate(documents = word(documents, 1, 20, sep = " ")) |> 
  mutate(documents = paste0(documents, " […]"))

DT::datatable(preview_data, 
              options = list(
                pageLength = 5,
                scrollX = TRUE
          ))

Preprocess Text

The first important stage in the analysis, before we get to the modelling, is to process the text. There’s apparently a lot of debate in the social sciences over whether or not to do some of these steps. I won’t lie. I’m no expert so I can’t give any meaningful advice about which ones are sensible and which aren’t, other than to suggest you do some of your own research and decide which you think are appropriate to use.

Imagine we have this sentence:

100 lynx were released today in Edinburgh. Locals are said to have fed them Whiskers cat food and offered them some buckfast.

After text processing, this becomes:

100 lynx were released today in edinburgh

locals are said to have fed them whiskers cat food and offered them some buckfast

Here are the types of text processing that can be done, and what they do:

Step Description Before After
Lowercasing Convert all words to lowercase to avoid treating “Lynx” and “lynx” as different. Edinburgh edinburgh
Stopword removal Remove very common words that don’t carry meaning (e.g. “was”, “it”, “to”). was, to, and (removed)
Stemming Reduce words to their root form to group variants. offered, feeding offer, feed
Punctuation removal Strip punctuation to avoid splitting on tokens like “cat.” and “cat”. buckfast. buckfast
Removing numbers Remove tokens that are just numbers (often meaningless in context). 100 (removed)
Filtering by word length Remove words that are too short or too long (e.g. length < 3 or > 20). in, supercalifragilisticexpialidocious (removed)
Tokenization Break text into individual words (“tokens”). "fed it Whiskers" fed, it, whiskers
Removing rare/common terms Remove words that are too rare or too frequent across documents. buckfast or the (removed) (if threshold met)

To do this, we use the functions textProcessor() and prepDocuments(). These carry out the processing mentioned above and get’s the documents and data ready for analysis.

Code
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# I'm not entirely convinved by stms text processing, so these are my additions:
data$documents_clean <- tolower(data$documents)

# expand or remove contraction suffixes
data$documents_clean <- gsub("'re\\b", " are", data$documents_clean)
data$documents_clean <- gsub("'ll\\b", " will", data$documents_clean)
data$documents_clean <- gsub("'ve\\b", " have", data$documents_clean)
data$documents_clean <- gsub("n't\\b", " not", data$documents_clean)
data$documents_clean <- gsub("'s\\b", "", data$documents_clean)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #

processed <- textProcessor(data$documents, metadata = data)
Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Removing numbers... 
Stemming... 
Creating Output... 
Code
out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
Removing 83198 of 123990 terms (83198 of 2298953 tokens) due to frequency 
Your corpus now has 13246 documents, 40792 terms and 2215755 tokens.
Code
docs <- out$documents
vocab <- out$vocab
meta <- out$meta

After this you can see the result that the processing has had on text within the documents:

Code
# Code below just for website, don't worry about including
docs_readable <- lapply(docs, function(doc) {
  words <- vocab[doc[1, ]]
  rep(words, doc[2, ])
})
preview_data <- data.frame(
  documents = sapply(docs_readable, function(words) {
    snippet <- paste(head(words, 20), collapse = " ")
    paste0(snippet, " […]")
  })
)
datatable(preview_data,
          options = list(
            pageLength = 5,
            scrollX = TRUE
          ))

In the table where I described how the text is processed, the final option was to remove words that are too rare or too frequent. To do so, we need a threshold; how rare is too rare? To help us with this, we can produce the figure below. This shows how many documents, words or tokens would be removed if we increased lower.thresh in prepDocuments().

  • Panel 1: Documents removed by threshold. At all threshold values considered (from 1 to 200 documents that it must appear in), no documents would ever be removed as prepDocuments() only removes a document if none of its words survive the treshold.

  • Panel 2: Words removed by threshold. This shows how many unique words would be removed from the vocabulary due to being too rare.

  • Panel 3: Tokens removed by threshold. This shows how many individuals words would be removed.

What we’re looking for here is where the relationships largely flatten out. When the relationships are steep this implies that there are a lot of rare words that won’t help us understand the topics, so we can remove these without much risk. But when they get flatter, then any additional pruning isn’t worth it.

Code
plotRemoved(processed$documents, lower.thresh = seq(1, 200, by = 5))

I think a threshold of somewhere between 30-60 seems reasonable across the three figures, so I’ll rerun the prepDocuments() function, with lower.thresh = 50.

Code
out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 50)
Removing 119243 of 123990 terms (359560 of 2298953 tokens) due to frequency 
Your corpus now has 13246 documents, 4747 terms and 1939393 tokens.
Code
docs <- out$documents
vocab <- out$vocab
meta <- out$meta

And with that, we’re done with the data processing.

Choosing K

When we run STMs, we need to specify how many topics there are. If we choose too many, the topics become overly specific and often blur together. Too few, and they’re so broad that it’s hard to identify meaningful themes. We want a \(K\) value that’s just right.

To help with this, the function searchK() runs an STM for each value of \(K\) that we specify, but with a twist: it holds out some documents during training, then checks how well the model predicts them. This helps evaluate predictive performance, which we see in the held-out likelihood plot below.

When we plot out the result of searchK()`, across the different \(K\) values we specified, we’re looking for a balance between:

  • Held-out Likelihood: how well the model generalizes to unseen data (higher is better)

  • Residuals: how much structure is unexplained (lower is better)

  • Semantic Coherence: how well topic words cluster together in real documents (higher is better)

  • Lower Bound: technical measure of model fit (can be ignored for model selection)

What are we looking for?

We’re trying to balance multiple criteria:

  • High held-out likelihood

  • Low residuals

  • High semantic coherence

  • High exclusivity (evaluated separately after modelling)

Note: While exclusivity is important, it is not reported by searchK(). It must be evaluated separately after fitting a model using stm() and can only be done for models that do not include any content covariates (variables that influence \(\beta\)).

If you see… It means… What to do
Coherence ↑ but Exclusivity ↓ Topics are interpretable but overlapping Try slightly lower K
Exclusivity ↑ but Coherence ↓ Topics are distinct but fragmented or incoherent Try slightly lower or slightly higher K
Held-out Likelihood levels off Predictive gains are saturated Good point to stop increasing K
Residuals flatten Model has explained most of the structure Increasing K adds noise, not insight

Note that this might take a long time to run.

Code
res_K <- searchK(docs, vocab, K = c(10, 15, 20, 25), data = meta)
Beginning Spectral Initialization 
     Calculating the gram matrix...
     Finding anchor words...
    ..........
     Recovering initialization...
    ...............................................
Initialization complete.
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 1 (approx. per word bound = -7.455) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 2 (approx. per word bound = -7.352, relative change = 1.390e-02) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 3 (approx. per word bound = -7.320, relative change = 4.261e-03) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 4 (approx. per word bound = -7.307, relative change = 1.754e-03) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 5 (approx. per word bound = -7.301, relative change = 8.952e-04) 
Topic 1: obama, mccain, campaign, hillari, barack 
 Topic 2: bill, legisl, hous, congress, senat 
 Topic 3: bush, said, presid, report, administr 
 Topic 4: democrat, obama, senat, republican, will 
 Topic 5: one, like, get, just, time 
 Topic 6: will, govern, year, can, new 
 Topic 7: mccain, tax, john, said, will 
 Topic 8: vote, state, elect, voter, democrat 
 Topic 9: iraq, will, war, iran, militari 
 Topic 10: peopl, one, will, american, right 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 6 (approx. per word bound = -7.297, relative change = 5.289e-04) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 7 (approx. per word bound = -7.294, relative change = 3.437e-04) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 8 (approx. per word bound = -7.293, relative change = 2.434e-04) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 9 (approx. per word bound = -7.291, relative change = 1.829e-04) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 10 (approx. per word bound = -7.290, relative change = 1.440e-04) 
Topic 1: obama, mccain, campaign, hillari, barack 
 Topic 2: bill, hous, legisl, congress, fund 
 Topic 3: bush, said, report, presid, administr 
 Topic 4: democrat, republican, senat, obama, will 
 Topic 5: one, like, get, just, time 
 Topic 6: will, oil, govern, year, energi 
 Topic 7: mccain, tax, john, palin, said 
 Topic 8: vote, state, voter, elect, poll 
 Topic 9: iraq, war, will, militari, iran 
 Topic 10: peopl, one, american, will, right 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 11 (approx. per word bound = -7.289, relative change = 1.179e-04) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 12 (approx. per word bound = -7.289, relative change = 9.910e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 13 (approx. per word bound = -7.288, relative change = 8.634e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 14 (approx. per word bound = -7.288, relative change = 7.801e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 15 (approx. per word bound = -7.287, relative change = 7.245e-05) 
Topic 1: obama, mccain, campaign, barack, hillari 
 Topic 2: bill, hous, legisl, congress, fund 
 Topic 3: bush, said, report, presid, administr 
 Topic 4: democrat, republican, senat, will, polit 
 Topic 5: one, like, get, time, just 
 Topic 6: will, oil, year, govern, energi 
 Topic 7: mccain, palin, john, tax, said 
 Topic 8: vote, state, poll, voter, elect 
 Topic 9: iraq, war, will, militari, iran 
 Topic 10: peopl, one, american, will, right 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 16 (approx. per word bound = -7.287, relative change = 6.814e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 17 (approx. per word bound = -7.286, relative change = 6.421e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 18 (approx. per word bound = -7.286, relative change = 6.101e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 19 (approx. per word bound = -7.285, relative change = 5.867e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 20 (approx. per word bound = -7.285, relative change = 5.722e-05) 
Topic 1: obama, campaign, mccain, barack, hillari 
 Topic 2: bill, hous, legisl, congress, money 
 Topic 3: bush, said, report, presid, administr 
 Topic 4: democrat, republican, will, senat, polit 
 Topic 5: one, like, get, time, just 
 Topic 6: will, oil, year, govern, new 
 Topic 7: mccain, palin, john, tax, said 
 Topic 8: vote, state, poll, voter, democrat 
 Topic 9: iraq, war, will, militari, iran 
 Topic 10: peopl, american, one, will, right 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 21 (approx. per word bound = -7.284, relative change = 5.625e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 22 (approx. per word bound = -7.284, relative change = 5.558e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 23 (approx. per word bound = -7.284, relative change = 5.505e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 24 (approx. per word bound = -7.283, relative change = 5.459e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 25 (approx. per word bound = -7.283, relative change = 5.381e-05) 
Topic 1: obama, campaign, barack, mccain, hillari 
 Topic 2: hous, bill, legisl, congress, money 
 Topic 3: bush, said, report, presid, administr 
 Topic 4: democrat, will, republican, think, presid 
 Topic 5: one, like, get, time, just 
 Topic 6: will, oil, year, econom, govern 
 Topic 7: mccain, palin, john, said, sen 
 Topic 8: vote, state, poll, voter, democrat 
 Topic 9: iraq, war, will, militari, iran 
 Topic 10: peopl, american, one, will, america 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 26 (approx. per word bound = -7.282, relative change = 5.239e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 27 (approx. per word bound = -7.282, relative change = 5.009e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 28 (approx. per word bound = -7.282, relative change = 4.722e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 29 (approx. per word bound = -7.281, relative change = 4.415e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 30 (approx. per word bound = -7.281, relative change = 4.086e-05) 
Topic 1: obama, campaign, barack, hillari, clinton 
 Topic 2: hous, bill, legisl, senat, congress 
 Topic 3: bush, said, report, administr, presid 
 Topic 4: democrat, will, think, republican, presid 
 Topic 5: one, like, get, time, media 
 Topic 6: will, tax, year, oil, econom 
 Topic 7: mccain, john, palin, said, sen 
 Topic 8: vote, state, democrat, poll, voter 
 Topic 9: iraq, war, will, militari, iran 
 Topic 10: peopl, american, one, will, america 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 31 (approx. per word bound = -7.281, relative change = 3.714e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 32 (approx. per word bound = -7.281, relative change = 3.300e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 33 (approx. per word bound = -7.280, relative change = 2.842e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 34 (approx. per word bound = -7.280, relative change = 2.430e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 35 (approx. per word bound = -7.280, relative change = 2.064e-05) 
Topic 1: obama, campaign, barack, hillari, clinton 
 Topic 2: hous, bill, senat, legisl, congress 
 Topic 3: bush, said, report, administr, presid 
 Topic 4: think, will, peopl, democrat, presid 
 Topic 5: one, like, get, time, media 
 Topic 6: will, tax, american, year, econom 
 Topic 7: mccain, john, palin, said, sen 
 Topic 8: vote, democrat, state, poll, voter 
 Topic 9: iraq, war, will, militari, iran 
 Topic 10: peopl, american, one, will, america 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 36 (approx. per word bound = -7.280, relative change = 1.772e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 37 (approx. per word bound = -7.280, relative change = 1.543e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 38 (approx. per word bound = -7.280, relative change = 1.364e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 39 (approx. per word bound = -7.280, relative change = 1.194e-05) 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Completing Iteration 40 (approx. per word bound = -7.279, relative change = 1.056e-05) 
Topic 1: obama, campaign, barack, hillari, clinton 
 Topic 2: hous, bill, senat, legisl, congress 
 Topic 3: bush, said, report, administr, presid 
 Topic 4: think, peopl, will, like, presid 
 Topic 5: one, like, get, time, media 
 Topic 6: will, tax, american, econom, year 
 Topic 7: mccain, john, palin, said, sen 
 Topic 8: vote, democrat, state, poll, voter 
 Topic 9: iraq, war, will, militari, iran 
 Topic 10: peopl, american, one, will, america 
....................................................................................................
Completed E-Step (3 seconds). 
Completed M-Step. 
Model Converged 
Beginning Spectral Initialization 
     Calculating the gram matrix...
     Finding anchor words...
    ...............
     Recovering initialization...
    ...............................................
Initialization complete.
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 1 (approx. per word bound = -7.421) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 2 (approx. per word bound = -7.311, relative change = 1.471e-02) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 3 (approx. per word bound = -7.280, relative change = 4.276e-03) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 4 (approx. per word bound = -7.266, relative change = 1.897e-03) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 5 (approx. per word bound = -7.259, relative change = 1.026e-03) 
Topic 1: obama, mccain, campaign, barack, john 
 Topic 2: bill, legisl, vote, congress, hous 
 Topic 3: bush, presid, said, report, hous 
 Topic 4: democrat, senat, republican, obama, will 
 Topic 5: get, one, like, ’re, ’ll 
 Topic 6: will, govern, can, peopl, american 
 Topic 7: tax, mccain, economi, will, econom 
 Topic 8: vote, elect, state, voter, court 
 Topic 9: will, oil, attack, russia, govern 
 Topic 10: one, will, peopl, like, can 
 Topic 11: school, citi, said, report, polic 
 Topic 12: palin, think, say, like, mccain 
 Topic 13: obama, hillari, clinton, mccain, democrat 
 Topic 14: iran, israel, bush, presid, nuclear 
 Topic 15: iraq, iraqi, war, troop, militari 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 6 (approx. per word bound = -7.254, relative change = 6.118e-04) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 7 (approx. per word bound = -7.252, relative change = 3.915e-04) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 8 (approx. per word bound = -7.250, relative change = 2.670e-04) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 9 (approx. per word bound = -7.248, relative change = 1.936e-04) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 10 (approx. per word bound = -7.247, relative change = 1.485e-04) 
Topic 1: obama, mccain, campaign, barack, john 
 Topic 2: bill, vote, legisl, congress, senat 
 Topic 3: bush, presid, said, administr, report 
 Topic 4: democrat, republican, senat, parti, will 
 Topic 5: get, one, ’re, like, don’t 
 Topic 6: will, govern, can, need, american 
 Topic 7: tax, economi, econom, plan, health 
 Topic 8: vote, elect, state, voter, court 
 Topic 9: will, attack, govern, russia, pakistan 
 Topic 10: one, will, peopl, world, time 
 Topic 11: school, citi, report, offic, state 
 Topic 12: think, peopl, like, say, know 
 Topic 13: obama, hillari, clinton, poll, democrat 
 Topic 14: iran, israel, nuclear, bush, state 
 Topic 15: iraq, war, iraqi, troop, militari 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 11 (approx. per word bound = -7.246, relative change = 1.196e-04) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 12 (approx. per word bound = -7.246, relative change = 1.009e-04) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 13 (approx. per word bound = -7.245, relative change = 8.767e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 14 (approx. per word bound = -7.244, relative change = 7.691e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 15 (approx. per word bound = -7.244, relative change = 6.649e-05) 
Topic 1: obama, mccain, campaign, barack, john 
 Topic 2: bill, vote, congress, legisl, senat 
 Topic 3: bush, presid, said, administr, report 
 Topic 4: democrat, republican, senat, parti, will 
 Topic 5: get, one, like, ’re, don’t 
 Topic 6: will, can, govern, american, need 
 Topic 7: tax, econom, plan, economi, financi 
 Topic 8: vote, elect, state, voter, campaign 
 Topic 9: attack, govern, will, russia, pakistan 
 Topic 10: one, will, life, peopl, world 
 Topic 11: school, report, citi, group, new 
 Topic 12: think, peopl, like, say, know 
 Topic 13: obama, hillari, clinton, poll, democrat 
 Topic 14: iran, israel, nuclear, state, presid 
 Topic 15: iraq, war, iraqi, militari, troop 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 16 (approx. per word bound = -7.243, relative change = 5.662e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 17 (approx. per word bound = -7.243, relative change = 4.779e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 18 (approx. per word bound = -7.243, relative change = 4.055e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 19 (approx. per word bound = -7.243, relative change = 3.469e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 20 (approx. per word bound = -7.242, relative change = 3.029e-05) 
Topic 1: obama, mccain, campaign, john, barack 
 Topic 2: bill, vote, congress, legisl, senat 
 Topic 3: bush, presid, said, administr, report 
 Topic 4: democrat, republican, senat, parti, will 
 Topic 5: get, one, like, ’re, don’t 
 Topic 6: will, can, american, need, make 
 Topic 7: tax, econom, plan, economi, financi 
 Topic 8: vote, elect, voter, state, campaign 
 Topic 9: attack, govern, will, russia, terrorist 
 Topic 10: one, will, life, world, time 
 Topic 11: school, report, group, citi, new 
 Topic 12: think, peopl, like, dont, know 
 Topic 13: obama, hillari, clinton, poll, democrat 
 Topic 14: iran, israel, nuclear, state, terrorist 
 Topic 15: iraq, war, iraqi, militari, troop 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 21 (approx. per word bound = -7.242, relative change = 2.703e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 22 (approx. per word bound = -7.242, relative change = 2.446e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 23 (approx. per word bound = -7.242, relative change = 2.274e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 24 (approx. per word bound = -7.242, relative change = 2.077e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 25 (approx. per word bound = -7.241, relative change = 1.961e-05) 
Topic 1: obama, mccain, campaign, john, barack 
 Topic 2: bill, vote, congress, legisl, senat 
 Topic 3: bush, presid, said, administr, hous 
 Topic 4: democrat, republican, senat, parti, obama 
 Topic 5: get, one, like, ’re, don’t 
 Topic 6: will, american, can, need, chang 
 Topic 7: tax, econom, plan, money, million 
 Topic 8: vote, elect, voter, state, campaign 
 Topic 9: attack, govern, will, terrorist, russia 
 Topic 10: one, will, life, world, day 
 Topic 11: report, school, new, group, citi 
 Topic 12: think, peopl, like, dont, say 
 Topic 13: obama, hillari, clinton, poll, democrat 
 Topic 14: iran, israel, nuclear, state, foreign 
 Topic 15: iraq, war, iraqi, militari, troop 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 26 (approx. per word bound = -7.241, relative change = 1.764e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 27 (approx. per word bound = -7.241, relative change = 1.613e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 28 (approx. per word bound = -7.241, relative change = 1.512e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 29 (approx. per word bound = -7.241, relative change = 1.455e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 30 (approx. per word bound = -7.241, relative change = 1.373e-05) 
Topic 1: obama, mccain, campaign, john, barack 
 Topic 2: bill, vote, congress, senat, legisl 
 Topic 3: bush, presid, said, administr, hous 
 Topic 4: democrat, republican, senat, obama, parti 
 Topic 5: get, one, like, ’re, don’t 
 Topic 6: will, american, can, need, chang 
 Topic 7: tax, econom, money, plan, million 
 Topic 8: vote, elect, voter, state, campaign 
 Topic 9: govern, attack, will, terrorist, forc 
 Topic 10: one, will, life, world, day 
 Topic 11: report, new, school, group, time 
 Topic 12: think, peopl, like, dont, say 
 Topic 13: obama, hillari, clinton, democrat, poll 
 Topic 14: iran, israel, nuclear, state, foreign 
 Topic 15: iraq, war, iraqi, militari, troop 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 31 (approx. per word bound = -7.241, relative change = 1.290e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 32 (approx. per word bound = -7.241, relative change = 1.194e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 33 (approx. per word bound = -7.241, relative change = 1.130e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 34 (approx. per word bound = -7.241, relative change = 1.094e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 35 (approx. per word bound = -7.241, relative change = 1.049e-05) 
Topic 1: obama, mccain, campaign, john, barack 
 Topic 2: bill, vote, congress, senat, legisl 
 Topic 3: bush, presid, said, administr, hous 
 Topic 4: democrat, obama, republican, senat, parti 
 Topic 5: get, one, like, ’re, don’t 
 Topic 6: will, american, can, need, oil 
 Topic 7: tax, money, econom, plan, million 
 Topic 8: vote, elect, voter, state, campaign 
 Topic 9: govern, attack, will, terrorist, forc 
 Topic 10: one, will, life, day, world 
 Topic 11: report, new, school, group, time 
 Topic 12: think, peopl, like, dont, polit 
 Topic 13: obama, hillari, clinton, democrat, poll 
 Topic 14: iran, israel, nuclear, state, foreign 
 Topic 15: iraq, war, iraqi, militari, troop 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 36 (approx. per word bound = -7.240, relative change = 1.036e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 37 (approx. per word bound = -7.240, relative change = 1.023e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 38 (approx. per word bound = -7.240, relative change = 1.009e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 39 (approx. per word bound = -7.240, relative change = 1.004e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Model Converged 
Beginning Spectral Initialization 
     Calculating the gram matrix...
     Finding anchor words...
    ....................
     Recovering initialization...
    ...............................................
Initialization complete.
....................................................................................................
Completed E-Step (7 seconds). 
Completed M-Step. 
Completing Iteration 1 (approx. per word bound = -7.387) 
....................................................................................................
Completed E-Step (7 seconds). 
Completed M-Step. 
Completing Iteration 2 (approx. per word bound = -7.282, relative change = 1.430e-02) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 3 (approx. per word bound = -7.251, relative change = 4.283e-03) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 4 (approx. per word bound = -7.236, relative change = 1.977e-03) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 5 (approx. per word bound = -7.228, relative change = 1.108e-03) 
Topic 1: obama, mccain, campaign, barack, john 
 Topic 2: bill, legisl, congress, vote, hous 
 Topic 3: bush, presid, hous, white, administr 
 Topic 4: senat, democrat, republican, will, obama 
 Topic 5: get, one, ’re, like, ’ll 
 Topic 6: will, govern, market, peopl, money 
 Topic 7: mccain, tax, econom, economi, health 
 Topic 8: vote, elect, voter, state, ballot 
 Topic 9: russia, world, georgia, nation, will 
 Topic 10: one, will, peopl, like, can 
 Topic 11: school, citi, said, polic, immigr 
 Topic 12: mccain, think, like, say, peopl 
 Topic 13: obama, mccain, poll, voter, state 
 Topic 14: iran, israel, nuclear, presid, state 
 Topic 15: iraq, war, iraqi, troop, militari 
 Topic 16: court, law, right, tortur, will 
 Topic 17: hillari, clinton, obama, democrat, will 
 Topic 18: palin, report, news, media, time 
 Topic 19: ayer, terrorist, obama, radic, american 
 Topic 20: will, oil, energi, american, year 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 6 (approx. per word bound = -7.223, relative change = 7.021e-04) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 7 (approx. per word bound = -7.220, relative change = 4.784e-04) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 8 (approx. per word bound = -7.217, relative change = 3.386e-04) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 9 (approx. per word bound = -7.215, relative change = 2.497e-04) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 10 (approx. per word bound = -7.214, relative change = 1.897e-04) 
Topic 1: obama, mccain, campaign, barack, john 
 Topic 2: bill, legisl, vote, congress, hous 
 Topic 3: bush, presid, administr, said, hous 
 Topic 4: democrat, senat, republican, will, parti 
 Topic 5: get, one, ’re, like, don’t 
 Topic 6: will, govern, money, market, financi 
 Topic 7: tax, mccain, health, econom, economi 
 Topic 8: vote, elect, voter, state, campaign 
 Topic 9: world, nation, russia, will, georgia 
 Topic 10: one, will, peopl, life, like 
 Topic 11: school, citi, said, immigr, polic 
 Topic 12: think, peopl, like, dont, polit 
 Topic 13: mccain, poll, obama, state, voter 
 Topic 14: iran, israel, nuclear, state, iranian 
 Topic 15: iraq, war, iraqi, troop, militari 
 Topic 16: court, law, right, tortur, constitut 
 Topic 17: clinton, hillari, obama, will, democrat 
 Topic 18: palin, report, news, media, time 
 Topic 19: obama, ayer, terrorist, group, chicago 
 Topic 20: will, oil, energi, american, global 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 11 (approx. per word bound = -7.213, relative change = 1.482e-04) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 12 (approx. per word bound = -7.212, relative change = 1.188e-04) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 13 (approx. per word bound = -7.211, relative change = 9.782e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 14 (approx. per word bound = -7.211, relative change = 8.371e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 15 (approx. per word bound = -7.210, relative change = 7.503e-05) 
Topic 1: obama, mccain, campaign, barack, john 
 Topic 2: bill, vote, legisl, congress, hous 
 Topic 3: bush, presid, said, administr, hous 
 Topic 4: democrat, republican, senat, will, parti 
 Topic 5: get, one, ’re, like, don’t 
 Topic 6: will, govern, money, financi, market 
 Topic 7: tax, health, econom, economi, care 
 Topic 8: vote, elect, voter, campaign, state 
 Topic 9: world, nation, will, russia, georgia 
 Topic 10: one, will, women, life, peopl 
 Topic 11: school, citi, said, offic, immigr 
 Topic 12: think, peopl, like, dont, know 
 Topic 13: poll, mccain, obama, state, voter 
 Topic 14: iran, israel, nuclear, iranian, state 
 Topic 15: iraq, war, iraqi, militari, troop 
 Topic 16: law, court, right, tortur, state 
 Topic 17: clinton, hillari, obama, will, democrat 
 Topic 18: palin, report, news, media, time 
 Topic 19: obama, group, ayer, polit, chicago 
 Topic 20: will, oil, energi, global, chang 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 16 (approx. per word bound = -7.210, relative change = 6.861e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 17 (approx. per word bound = -7.209, relative change = 6.276e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 18 (approx. per word bound = -7.209, relative change = 5.933e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 19 (approx. per word bound = -7.208, relative change = 5.947e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 20 (approx. per word bound = -7.208, relative change = 6.135e-05) 
Topic 1: obama, mccain, campaign, john, barack 
 Topic 2: bill, vote, legisl, congress, senat 
 Topic 3: bush, presid, said, administr, hous 
 Topic 4: democrat, republican, senat, parti, will 
 Topic 5: get, one, like, ’re, don’t 
 Topic 6: will, govern, money, financi, market 
 Topic 7: tax, economi, econom, health, american 
 Topic 8: vote, elect, voter, campaign, state 
 Topic 9: world, nation, will, russia, countri 
 Topic 10: one, will, women, life, peopl 
 Topic 11: school, citi, said, offic, immigr 
 Topic 12: think, peopl, like, dont, know 
 Topic 13: poll, mccain, obama, state, voter 
 Topic 14: iran, israel, nuclear, iranian, will 
 Topic 15: iraq, war, militari, iraqi, troop 
 Topic 16: law, court, right, tortur, state 
 Topic 17: clinton, hillari, obama, will, democrat 
 Topic 18: palin, report, news, media, time 
 Topic 19: obama, group, polit, wright, ayer 
 Topic 20: oil, will, energi, global, price 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 21 (approx. per word bound = -7.208, relative change = 6.258e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 22 (approx. per word bound = -7.207, relative change = 6.466e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 23 (approx. per word bound = -7.207, relative change = 6.564e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 24 (approx. per word bound = -7.206, relative change = 6.361e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 25 (approx. per word bound = -7.206, relative change = 6.096e-05) 
Topic 1: obama, mccain, campaign, john, barack 
 Topic 2: bill, vote, legisl, congress, senat 
 Topic 3: bush, presid, said, administr, hous 
 Topic 4: democrat, republican, senat, parti, will 
 Topic 5: get, like, one, ’re, don’t 
 Topic 6: money, govern, financi, will, market 
 Topic 7: tax, american, will, economi, econom 
 Topic 8: vote, elect, voter, campaign, ballot 
 Topic 9: world, nation, will, russia, countri 
 Topic 10: one, will, women, life, love 
 Topic 11: citi, school, said, offic, immigr 
 Topic 12: think, peopl, like, dont, know 
 Topic 13: poll, mccain, obama, state, voter 
 Topic 14: iran, israel, nuclear, iranian, will 
 Topic 15: iraq, war, militari, iraqi, troop 
 Topic 16: law, court, right, state, case 
 Topic 17: clinton, hillari, obama, will, democrat 
 Topic 18: palin, report, media, news, time 
 Topic 19: obama, group, wright, barack, polit 
 Topic 20: oil, energi, will, global, price 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 26 (approx. per word bound = -7.205, relative change = 5.728e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 27 (approx. per word bound = -7.205, relative change = 5.326e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 28 (approx. per word bound = -7.205, relative change = 4.897e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 29 (approx. per word bound = -7.204, relative change = 4.423e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 30 (approx. per word bound = -7.204, relative change = 3.820e-05) 
Topic 1: mccain, obama, campaign, john, barack 
 Topic 2: bill, vote, legisl, congress, senat 
 Topic 3: bush, presid, said, administr, hous 
 Topic 4: democrat, republican, senat, parti, will 
 Topic 5: get, like, one, ’re, don’t 
 Topic 6: money, financi, govern, market, will 
 Topic 7: tax, will, american, economi, econom 
 Topic 8: vote, elect, voter, campaign, ballot 
 Topic 9: world, nation, will, countri, russia 
 Topic 10: one, will, women, life, love 
 Topic 11: citi, school, said, offic, polic 
 Topic 12: think, peopl, like, dont, know 
 Topic 13: poll, obama, mccain, state, voter 
 Topic 14: iran, israel, nuclear, terrorist, iranian 
 Topic 15: iraq, war, militari, iraqi, troop 
 Topic 16: law, court, right, state, case 
 Topic 17: clinton, hillari, obama, will, democrat 
 Topic 18: palin, report, media, news, time 
 Topic 19: obama, barack, group, polit, wright 
 Topic 20: oil, energi, price, global, will 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 31 (approx. per word bound = -7.204, relative change = 3.318e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 32 (approx. per word bound = -7.204, relative change = 2.876e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 33 (approx. per word bound = -7.203, relative change = 2.515e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 34 (approx. per word bound = -7.203, relative change = 2.211e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 35 (approx. per word bound = -7.203, relative change = 1.970e-05) 
Topic 1: mccain, obama, campaign, john, barack 
 Topic 2: bill, vote, senat, legisl, congress 
 Topic 3: bush, presid, said, administr, hous 
 Topic 4: democrat, republican, senat, parti, will 
 Topic 5: get, like, one, ’re, don’t 
 Topic 6: money, financi, govern, market, will 
 Topic 7: tax, will, american, economi, econom 
 Topic 8: vote, elect, voter, campaign, ballot 
 Topic 9: nation, world, will, countri, govern 
 Topic 10: one, will, women, life, love 
 Topic 11: citi, said, school, offic, polic 
 Topic 12: think, like, peopl, dont, know 
 Topic 13: poll, obama, mccain, voter, state 
 Topic 14: iran, israel, nuclear, terrorist, iranian 
 Topic 15: iraq, war, militari, iraqi, troop 
 Topic 16: law, court, right, state, case 
 Topic 17: clinton, hillari, obama, will, democrat 
 Topic 18: palin, media, report, news, time 
 Topic 19: obama, barack, polit, wright, group 
 Topic 20: oil, energi, price, global, will 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 36 (approx. per word bound = -7.203, relative change = 1.752e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 37 (approx. per word bound = -7.203, relative change = 1.558e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 38 (approx. per word bound = -7.203, relative change = 1.370e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 39 (approx. per word bound = -7.203, relative change = 1.211e-05) 
....................................................................................................
Completed E-Step (4 seconds). 
Completed M-Step. 
Completing Iteration 40 (approx. per word bound = -7.203, relative change = 1.103e-05) 
Topic 1: mccain, obama, campaign, john, barack 
 Topic 2: bill, vote, senat, congress, legisl 
 Topic 3: bush, presid, said, administr, hous 
 Topic 4: democrat, republican, senat, parti, will 
 Topic 5: get, like, one, ’re, don’t 
 Topic 6: money, financi, govern, market, crisi 
 Topic 7: tax, will, american, economi, health 
 Topic 8: vote, elect, voter, campaign, ballot 
 Topic 9: nation, world, will, countri, govern 
 Topic 10: one, will, women, life, love 
 Topic 11: citi, said, school, offic, polic 
 Topic 12: think, like, peopl, dont, know 
 Topic 13: poll, obama, voter, state, mccain 
 Topic 14: iran, israel, nuclear, terrorist, attack 
 Topic 15: iraq, war, militari, iraqi, troop 
 Topic 16: law, court, right, state, case 
 Topic 17: clinton, obama, hillari, will, democrat 
 Topic 18: palin, media, news, report, time 
 Topic 19: obama, barack, polit, wright, group 
 Topic 20: oil, energi, price, global, will 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Completing Iteration 41 (approx. per word bound = -7.202, relative change = 1.024e-05) 
....................................................................................................
Completed E-Step (5 seconds). 
Completed M-Step. 
Model Converged 
Beginning Spectral Initialization 
     Calculating the gram matrix...
     Finding anchor words...
    .........................
     Recovering initialization...
    ...............................................
Initialization complete.
....................................................................................................
Completed E-Step (9 seconds). 
Completed M-Step. 
Completing Iteration 1 (approx. per word bound = -7.351) 
....................................................................................................
Completed E-Step (8 seconds). 
Completed M-Step. 
Completing Iteration 2 (approx. per word bound = -7.250, relative change = 1.369e-02) 
....................................................................................................
Completed E-Step (8 seconds). 
Completed M-Step. 
Completing Iteration 3 (approx. per word bound = -7.220, relative change = 4.162e-03) 
....................................................................................................
Completed E-Step (7 seconds). 
Completed M-Step. 
Completing Iteration 4 (approx. per word bound = -7.205, relative change = 2.015e-03) 
....................................................................................................
Completed E-Step (7 seconds). 
Completed M-Step. 
Completing Iteration 5 (approx. per word bound = -7.197, relative change = 1.137e-03) 
Topic 1: obama, mccain, campaign, barack, john 
 Topic 2: bill, legisl, vote, congress, senat 
 Topic 3: bush, presid, white, hous, said 
 Topic 4: democrat, senat, republican, will, hous 
 Topic 5: get, one, ’re, ’ll, like 
 Topic 6: elect, will, govern, parti, polit 
 Topic 7: mccain, john, mccain’, tax, said 
 Topic 8: vote, voter, elect, state, ballot 
 Topic 9: russia, georgia, russian, world, nation 
 Topic 10: one, will, film, like, say 
 Topic 11: school, citi, polic, said, student 
 Topic 12: wright, church, mccain, conserv, black 
 Topic 13: obama, poll, mccain, voter, state 
 Topic 14: iran, bush, presid, nuclear, polici 
 Topic 15: iraq, iraqi, troop, war, militari 
 Topic 16: court, right, law, tortur, constitut 
 Topic 17: hillari, clinton, obama, democrat, will 
 Topic 18: palin, news, media, report, time 
 Topic 19: american, border, chavez, govern, trade 
 Topic 20: oil, energi, global, will, price 
 Topic 21: peopl, like, dont, think, know 
 Topic 22: israel, attack, terrorist, will, kill 
 Topic 23: report, investig, depart, said, offici 
 Topic 24: will, tax, govern, economi, econom 
 Topic 25: obama, barack, chicago, ayer, polit 
....................................................................................................
Completed E-Step (7 seconds). 
Completed M-Step. 
Completing Iteration 6 (approx. per word bound = -7.192, relative change = 6.888e-04) 
....................................................................................................
Completed E-Step (7 seconds). 
Completed M-Step. 
Completing Iteration 7 (approx. per word bound = -7.189, relative change = 4.437e-04) 
....................................................................................................
Completed E-Step (7 seconds). 
Completed M-Step. 
Completing Iteration 8 (approx. per word bound = -7.187, relative change = 3.036e-04) 
....................................................................................................
Completed E-Step (7 seconds). 
Completed M-Step. 
Completing Iteration 9 (approx. per word bound = -7.185, relative change = 2.178e-04) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 10 (approx. per word bound = -7.184, relative change = 1.563e-04) 
Topic 1: obama, mccain, campaign, barack, john 
 Topic 2: bill, vote, legisl, congress, senat 
 Topic 3: bush, presid, white, hous, said 
 Topic 4: democrat, republican, senat, gop, will 
 Topic 5: get, one, like, ’re, doesn’t 
 Topic 6: elect, will, govern, parti, polit 
 Topic 7: mccain, john, mccain’, sen, said 
 Topic 8: vote, voter, elect, state, ballot 
 Topic 9: world, russia, nation, georgia, will 
 Topic 10: one, will, film, like, allah 
 Topic 11: school, citi, said, polic, student 
 Topic 12: black, wright, church, conserv, white 
 Topic 13: poll, obama, mccain, voter, state 
 Topic 14: iran, nuclear, polici, foreign, iranian 
 Topic 15: iraq, war, iraqi, troop, militari 
 Topic 16: court, law, right, tortur, constitut 
 Topic 17: hillari, clinton, obama, will, democrat 
 Topic 18: palin, media, news, report, time 
 Topic 19: immigr, illeg, border, american, trade 
 Topic 20: oil, energi, price, global, will 
 Topic 21: peopl, think, dont, like, know 
 Topic 22: terrorist, attack, israel, terror, kill 
 Topic 23: report, investig, depart, offici, said 
 Topic 24: will, tax, economi, econom, money 
 Topic 25: obama, barack, chicago, polit, ayer 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 11 (approx. per word bound = -7.183, relative change = 1.131e-04) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 12 (approx. per word bound = -7.183, relative change = 8.554e-05) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 13 (approx. per word bound = -7.182, relative change = 6.697e-05) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 14 (approx. per word bound = -7.182, relative change = 5.235e-05) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 15 (approx. per word bound = -7.181, relative change = 4.247e-05) 
Topic 1: obama, mccain, campaign, barack, john 
 Topic 2: bill, vote, legisl, congress, senat 
 Topic 3: bush, presid, said, hous, white 
 Topic 4: democrat, republican, senat, gop, hous 
 Topic 5: get, one, like, ’re, want 
 Topic 6: will, elect, polit, parti, govern 
 Topic 7: mccain, john, sen, mccain’, said 
 Topic 8: vote, voter, elect, state, ballot 
 Topic 9: world, nation, will, russia, georgia 
 Topic 10: one, will, film, book, life 
 Topic 11: school, citi, said, polic, famili 
 Topic 12: black, wright, church, white, american 
 Topic 13: poll, obama, mccain, state, voter 
 Topic 14: iran, nuclear, foreign, polici, iranian 
 Topic 15: iraq, war, iraqi, troop, militari 
 Topic 16: court, law, right, tortur, constitut 
 Topic 17: clinton, hillari, obama, will, democrat 
 Topic 18: palin, media, news, report, time 
 Topic 19: immigr, illeg, border, trade, american 
 Topic 20: oil, energi, price, global, drill 
 Topic 21: peopl, think, like, dont, know 
 Topic 22: terrorist, attack, israel, terror, kill 
 Topic 23: report, investig, depart, offici, former 
 Topic 24: will, tax, economi, econom, money 
 Topic 25: obama, barack, chicago, polit, ayer 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 16 (approx. per word bound = -7.181, relative change = 3.463e-05) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 17 (approx. per word bound = -7.181, relative change = 2.854e-05) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 18 (approx. per word bound = -7.181, relative change = 2.408e-05) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 19 (approx. per word bound = -7.181, relative change = 2.000e-05) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 20 (approx. per word bound = -7.181, relative change = 1.666e-05) 
Topic 1: obama, mccain, campaign, barack, john 
 Topic 2: bill, vote, legisl, congress, senat 
 Topic 3: bush, presid, said, administr, white 
 Topic 4: democrat, republican, senat, gop, hous 
 Topic 5: get, one, like, ’re, want 
 Topic 6: will, elect, polit, parti, conserv 
 Topic 7: mccain, john, sen, mccain’, said 
 Topic 8: vote, voter, elect, state, ballot 
 Topic 9: world, nation, will, america, countri 
 Topic 10: one, will, film, book, life 
 Topic 11: school, citi, said, famili, polic 
 Topic 12: black, wright, church, white, american 
 Topic 13: poll, mccain, obama, state, voter 
 Topic 14: iran, nuclear, foreign, polici, iranian 
 Topic 15: iraq, war, iraqi, troop, militari 
 Topic 16: court, law, right, tortur, constitut 
 Topic 17: clinton, hillari, obama, will, democrat 
 Topic 18: palin, media, news, report, time 
 Topic 19: immigr, illeg, border, trade, alien 
 Topic 20: oil, energi, price, global, drill 
 Topic 21: peopl, think, like, know, dont 
 Topic 22: terrorist, attack, israel, kill, terror 
 Topic 23: report, investig, depart, offici, former 
 Topic 24: tax, will, economi, econom, plan 
 Topic 25: obama, barack, chicago, campaign, polit 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 21 (approx. per word bound = -7.180, relative change = 1.415e-05) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 22 (approx. per word bound = -7.180, relative change = 1.229e-05) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Completing Iteration 23 (approx. per word bound = -7.180, relative change = 1.070e-05) 
....................................................................................................
Completed E-Step (6 seconds). 
Completed M-Step. 
Model Converged 
Code
plot(res_K)

From the above, my feeling is that \(K\) being somewhere between 15 and 25 is where we want to be thinking. However, coherence drops quite rapidly as \(K\) is increased, so I might opt for \(K = 15\) in this case. You may well disagree. It’s a subjective choice in the end, so it’s find if you don’t agree and think it should be 25 or 10 or whatever you prefer.

Fit STM with no Covariates

We can now fit our first STM. Here we won’t use any covariates on topic prevalence. Just a very simple model to start off with.

Code
model1 <- stm(documents = docs,
              vocab = vocab,
              K = 15,
              data = meta,
              init.type = "Spectral", 
              verbose = FALSE)

Explore

We can start by doing some checks of the model, beginning with topic coherence. This gives us a measure of how often the top words of each topic co-occur in the documents (e.g. how often does “lynx” and “rewilding” appear together in the same topic?). In stm, semantic coherence is measured on a negative scale, so closer to 0 means higher coherence. That is, a topic with a score of -40 is more coherent than one with -80.

Each value corresponds to one of the topics.

Code
semanticCoherence(model1, docs)
 [1] -54.01384 -46.18649 -52.19125 -73.64842 -45.21362 -57.09193 -79.88974
 [8] -64.34934 -64.93084 -89.65010 -80.10995 -54.54471 -39.24183 -72.45012
[15] -63.94915

The output is organised from topic 1 to topic 15. So here, it seems like topic 13 has the best coherence, while topic 10 has the worst.

We can pair this with exclusivity. As a reminder, this is a measure of how distinct a topic’s top words are compared to other topics.

Code
exclusivity(model1)
 [1] 9.828369 9.693257 9.583936 9.675580 9.268534 9.707071 9.663757 9.709532
 [9] 8.841380 9.434878 9.455630 9.673962 9.451958 9.566483 9.691479

Here, we see that topic 1 has the highest exclusivity.

These outputs are useful but there’s a better way to visualise it. We can always extract these values and plot them out. The “ideal” topics will be in the top right, topics with both high coherence and exclusivity.

Code
coh <- semanticCoherence(model1, docs)
exc <- exclusivity(model1)

df <- data.frame(Topic = 1:length(coh), Coherence = coh, Exclusivity = exc)

ggplot(df, aes(x = Coherence, y = Exclusivity, label = Topic)) +
  geom_point() +
  geom_text(nudge_y = 0.05, size = 3, check_overlap = TRUE) +
  labs(title = "Semantic Coherence vs. Exclusivity",
       x = "Semantic Coherence", y = "Exclusivity") +
  theme_minimal()

STM provides different ways to list the top words in each topic. Each highlights different characteristics of a topic:

Label Type What It Does Good For…
Highest Probability Ranks words by how often they appear in the topic General sense of what dominates the topic
FREX Balances frequency and exclusivity (shared control parameter) Clear topic interpretation with distinct words
Lift Highlights words that are rare overall but frequent in topic Spotting unique, niche vocabulary
Score Bayesian log odds of word being in this topic vs. others Emphasizing topic-distinguishing words
Code
labelTopics(model1, n = 5)
Topic 1 Top Words:
     Highest Prob: mccain, obama, campaign, john, palin 
     FREX: palin, mccain, biden, mccain’, sarah 
     Lift: oct, palin, biden, “mccain, mccain” 
     Score: mccain, obama, palin, oct, campaign 
Topic 2 Top Words:
     Highest Prob: bill, vote, democrat, senat, congress 
     FREX: legisl, pelosi, bill, amend, congress 
     Lift: legisl, telecom, co-sponsor, pelosi, fisa 
     Score: legisl, vote, congress, bill, republican 
Topic 3 Top Words:
     Highest Prob: obama, senat, will, polit, barack 
     FREX: lieberman, blagojevich, governor, illinoi, rezko 
     Lift: blagojevich, chairmanship, rahm, blago, jindal 
     Score: chairmanship, obama, blagojevich, lieberman, rezko 
Topic 4 Top Words:
     Highest Prob: bush, said, presid, administr, hous 
     FREX: tortur, cheney, justic, depart, interrog 
     Lift: mukasey, waterboard, gonzal, interrog, mcclellan 
     Score: mcclellan, tortur, interrog, detaine, attorney 
Topic 5 Top Words:
     Highest Prob: get, one, like, ’re, don’t 
     FREX: ’ll, doesn’t, ’re, didn’t, don’t 
     Lift: widget, see-dubya, ingraham, beck, vis-avi 
     Score: ’re, widget, ’ll, obama’, don’t 
Topic 6 Top Words:
     Highest Prob: elect, vote, voter, state, campaign 
     FREX: franken, ballot, coleman, acorn, registr 
     Lift: absente, chambliss, nrsc, registr, canvass 
     Score: canvass, franken, ballot, coleman, vote 
Topic 7 Top Words:
     Highest Prob: oil, energi, will, price, global 
     FREX: energi, oil, drill, global, warm 
     Lift: mugab, carbon, gallon, gasolin, emiss 
     Score: mugab, oil, energi, drill, price 
Topic 8 Top Words:
     Highest Prob: tax, will, econom, economi, govern 
     FREX: tax, health, financi, mortgag, economi 
     Lift: gramm, lender, mortgag, aig, fanni 
     Score: gramm, tax, mortgag, billion, bailout 
Topic 9 Top Words:
     Highest Prob: one, will, women, life, peopl 
     FREX: film, life, allah, movi, god 
     Lift: vers, muhammad, allah, film, allah’ 
     Score: muhammad, allah, film, vers, women 
Topic 10 Top Words:
     Highest Prob: attack, govern, will, terrorist, kill 
     FREX: russian, pakistan, russia, pakistani, taliban 
     Lift: pakistani, georgian, putin, russian, ukrain 
     Score: ossetia, russian, pakistan, russia, taliban 
Topic 11 Top Words:
     Highest Prob: report, time, new, group, york 
     FREX: student, school, univers, newspap, publish 
     Lift: berkeley, campus, dohrn, copyright, annenberg 
     Score: berkeley, ayer, school, polic, student 
Topic 12 Top Words:
     Highest Prob: obama, hillari, clinton, democrat, poll 
     FREX: hillari, clinton, romney, primari, deleg 
     Lift: zogbi, super-deleg, superdeleg, uncommit, hillari 
     Score: hillari, zogbi, obama, poll, clinton 
Topic 13 Top Words:
     Highest Prob: think, peopl, like, dont, polit 
     FREX: dont, linktocommentspostcount, postcounttb, that, wright 
     Lift: hage, gasbag, digbyi, wingnut, youd 
     Score: hage, wright, dont, hes, linktocommentspostcount 
Topic 14 Top Words:
     Highest Prob: iran, israel, nuclear, state, polici 
     FREX: israel, iran, hama, isra, iranian 
     Lift: palestinian, ahmadinejad, hama, israel, bolton 
     Score: bolton, iran, israel, iranian, hama 
Topic 15 Top Words:
     Highest Prob: iraq, war, iraqi, militari, troop 
     FREX: iraqi, iraq, troop, withdraw, petraeus 
     Lift: basra, maliki, maliki’, nouri, al-maliki 
     Score: sadr, iraq, iraqi, troop, maliki 
Code
plot(model1, type = "summary")

We can also extract sections that exemplify specific topics. This is how we learn what each topic is probably describing. We know the output is for topic \(k\), so what does that convey? For example, here’s an section specific to topic 1.

Code
findThoughts(model1, texts = meta$documents, n = 1, topics = 1)

 Topic 1: 
     Oh, this is fun. Today the McCain campaign held a conference call unveiling a new "truth squad" Web site designed to defend McCain from attacks on his military record.   This was in response to Wes Clark's  claim yesterday that McCain lacks the necessary experience to be President, which wasn't an attack on McCain's military record at all.  Be that as it may,  on the call, the McCain camp rolled out a leading surrogate named Bud Day -- who was described merely as a fellow POW of McCain -- who blasted such attacks. "John was slandered and reviled in the 2000 campaign in a way that denigrated his service enormously...it was absolutely important to face this issue right off the bat."  But guess what -- it turns out that this very same Bud Day was  featured in the Swift Boat Vets ads attacking John Kerry in 2004!  To make matters even better, recall that McCain himself  condemned the Swift Boat Vets. Yet now the McCain campaign is cheerfully enlisting someone who did what McCain claimed to decry -- attacks on Kerry's credentials -- and using him to defend McCain against the same sort of attacks.  That's a good one.  Late Update: As  Ben Smith notes, on the call Day defended his Swift Boat Vets work as being "about laying out the truth."  Late Update: Here's the audio from the conference call:                                                                                                      Click To Play

This returns the document(s) that best represent a given topic. In this example, we ask for the most representative document (n = 1) for topic 1. Reading this helps us assign a human label to the topic.

With all of these summaries, I can start to get the sense that topic 1 has a theme related to Obama, McCain and their presidential campaigns. Therefore I might label this as Obama & McCain Campaigns.

Cleaner figure

The above figure is fine, but it’s a bit bland. Here’s a slightly jazzed up version:

Code
topic_props <- colMeans(model1$theta)
topic_sd <- apply(model1$theta, 2, sd)

topic_df <- data.frame(
  Topic = factor(1:length(topic_props)),
  Mean = topic_props,
  SD = topic_sd
)

top_frex <- labelTopics(model1, n = 3)$frex
topic_labels <- apply(top_frex, 1, function(words) paste(words, collapse = ", "))

topic_df$Label <- paste0("T", topic_df$Topic, ": ", topic_labels)

ggplot(topic_df, aes(x = Label, y = Mean)) +
  geom_col(fill = "#00A68A") +
  geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD), width = 0.2, color = "white") +
  labs(x = "Topic",
       y = "Expected %\nof Corpus") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

We can also use the results to create word clouds, showing the words that best capture the topic.

Code
library(wordcloud)
cloud(model1, topic = 1)

Or the jazzier version:

Code
topic_id <- 1
top_words <- labelTopics(model1, n = 50)$frex[topic_id, ]

beta_matrix <- exp(model1$beta$logbeta[[1]])
word_probs <- beta_matrix[topic_id, ]

vocab <- model1$vocab
df <- data.frame(
  word = vocab,
  prob = word_probs
) |> 
  filter(word %in% top_words)

library(ggwordcloud)
ggplot(df, aes(label = word, size = prob)) +
  geom_text_wordcloud(area_corr = TRUE, color = "#00A68A") +
  scale_size_area(max_size = 50) +
  theme_minimal()

These tools allow us to evaluate the structure and interpretability of our STM model. From here, we can either adjust the number of topics, clean the vocabulary further, or move on to analyzing topic prevalence with covariates.

Fit STM with Covariates

Now that we’ve explored the basic STM model, we can move on to a more powerful feature: including covariates. This lets us ask questions like:

  • Does political leaning affect which topics are used?

  • Do certain topics become more or less common over time?

We specify covariates in the same way as you did in BI3010, using formulas like ~ rating + day. But here we’re going to do something slightly different with time: we assume that the effect of day isn’t a straight line. Instead, it might wiggle, maybe topic prevalence rises, dips, and rises again over time (yes, the technical term for this really is wiggly).

To model this, we use a smooth term: s(day, df = 5).

  • s() means “fit a smooth curve”

  • df stands for degrees of freedom, which controls how wiggly the curve can be

  • A larger df allows more bends; a smaller one keeps the curve gentle

There’s no exact science to choosing df. In this case, I picked 5 because I have 5 fingers, and because the 2008 presidential campaign probably had some changes in topic focus, but not pure chaos every day.

We now get to the point where we can include covariates into the model. We do this in the same way that we did for the models in BI3010, using syntax like y ~ x + z. Here we’ll run a model that assumes that rating affects prevalence (\(\theta\)), as well as day.

Code
model2 <- stm(documents = docs,
              vocab = vocab,
              K = 15,
              prevalence = ~ rating + s(day, df = 5),
              data = meta,
              init.type = "Spectral",
              verbose = FALSE)

Explore

We can begin by checking the coherence of each topic, just as we did before.

Code
semanticCoherence(model2, docs)
 [1] -58.27322 -48.64337 -50.42340 -75.48279 -44.66076 -59.90875 -78.18520
 [8] -65.36588 -64.93084 -89.65010 -72.62134 -54.95509 -39.24183 -72.45012
[15] -63.94915

These values are again negative, so closer to 0 means higher coherence. If topic 13 has a value of -40, and topic 10 has -85, then topic 13 is more interpretable.

We can pair this with exclusivity to see how distinct each topic is.

Code
exclusivity(model2)
 [1] 9.774044 9.716968 9.544814 9.705059 9.271672 9.712446 9.656976 9.647156
 [9] 8.853296 9.437992 9.354623 9.666774 9.475512 9.552169 9.689942

Let’s visualise these two together to spot “ideal” topics, the ones that are both coherent and exclusive.

Code
coh <- semanticCoherence(model2, docs)
exc <- exclusivity(model2)

df <- data.frame(Topic = 1:length(coh), Coherence = coh, Exclusivity = exc)

ggplot(df, aes(x = Coherence, y = Exclusivity, label = Topic)) +
  geom_point() +
  geom_text(nudge_y = 0.05, size = 3, check_overlap = TRUE) +
  labs(x = "Semantic Coherence (closer to 0 is better)", 
       y = "Exclusivity (higher is better)") +
  theme_minimal()

And we can check the top words per topic for our new model:

Code
labelTopics(model2, n = 5)
Topic 1 Top Words:
     Highest Prob: mccain, campaign, obama, john, palin 
     FREX: palin, mccain, biden, mccain’, sarah 
     Lift: oct, palin, “mccain, mccain”, sarah 
     Score: mccain, palin, oct, obama, mccain’ 
Topic 2 Top Words:
     Highest Prob: democrat, bill, vote, senat, republican 
     FREX: legisl, pelosi, bill, reid, amend 
     Lift: legisl, telecom, pelosi, co-sponsor, demint 
     Score: legisl, vote, congress, republican, pelosi 
Topic 3 Top Words:
     Highest Prob: obama, barack, campaign, senat, will 
     FREX: illinoi, barack, blagojevich, chicago, lieberman 
     Lift: chairmanship, blagojevich, rahm, blago, rezko 
     Score: obama, chairmanship, blagojevich, barack, lieberman 
Topic 4 Top Words:
     Highest Prob: bush, presid, said, administr, hous 
     FREX: tortur, cheney, justic, attorney, interrog 
     Lift: mukasey, waterboard, gonzal, interrog, mcclellan 
     Score: mcclellan, tortur, detaine, interrog, attorney 
Topic 5 Top Words:
     Highest Prob: get, one, like, ’re, don’t 
     FREX: doesn’t, ’ll, didn’t, ’re, don’t 
     Lift: widget, see-dubya, ingraham, vis-avi, media’ 
     Score: ’re, widget, ’ll, obama’, don’t 
Topic 6 Top Words:
     Highest Prob: elect, vote, voter, state, republican 
     FREX: franken, ballot, coleman, acorn, registr 
     Lift: absente, chambliss, nrsc, registr, canvass 
     Score: canvass, franken, ballot, coleman, vote 
Topic 7 Top Words:
     Highest Prob: oil, energi, will, price, global 
     FREX: energi, oil, drill, climat, warm 
     Lift: carbon, mugab, gallon, gasolin, anwr 
     Score: mugab, oil, energi, drill, price 
Topic 8 Top Words:
     Highest Prob: tax, will, econom, economi, govern 
     FREX: tax, health, financi, mortgag, economi 
     Lift: fanni, gramm, lender, mortgag, aig 
     Score: gramm, tax, mortgag, billion, bailout 
Topic 9 Top Words:
     Highest Prob: one, will, women, life, peopl 
     FREX: film, life, allah, god, women 
     Lift: vers, allah, muhammad, film, allah’ 
     Score: muhammad, allah, film, vers, women 
Topic 10 Top Words:
     Highest Prob: govern, attack, will, terrorist, kill 
     FREX: russian, pakistan, russia, pakistani, taliban 
     Lift: pakistani, bhutto, moscow, musharraf, putin 
     Score: ossetia, russian, pakistan, russia, taliban 
Topic 11 Top Words:
     Highest Prob: report, time, new, stori, group 
     FREX: student, ayer, school, newspap, univers 
     Lift: berkeley, dohrn, campus, annenberg, copyright 
     Score: berkeley, ayer, school, polic, student 
Topic 12 Top Words:
     Highest Prob: obama, hillari, clinton, democrat, poll 
     FREX: hillari, romney, clinton, primari, deleg 
     Lift: zogbi, super-deleg, superdeleg, uncommit, romney 
     Score: hillari, zogbi, obama, poll, clinton 
Topic 13 Top Words:
     Highest Prob: think, peopl, like, dont, say 
     FREX: dont, linktocommentspostcount, postcounttb, that, didnt 
     Lift: hage, gasbag, digbyi, youd, tristero 
     Score: hage, wright, dont, linktocommentspostcount, postcounttb 
Topic 14 Top Words:
     Highest Prob: iran, israel, nuclear, state, polici 
     FREX: israel, iran, isra, hama, iranian 
     Lift: palestinian, ahmadinejad, gaza, hama, isra 
     Score: bolton, iran, israel, iranian, hama 
Topic 15 Top Words:
     Highest Prob: iraq, war, iraqi, militari, troop 
     FREX: iraqi, iraq, troop, withdraw, petraeus 
     Lift: basra, maliki, maliki’, al-maliki, al-sadr 
     Score: sadr, iraq, iraqi, troop, maliki 

And, as before, we can extract text that exemplifies each topic — a helpful way to figure out what a topic might mean.

Code
findThoughts(model2, texts = meta$documents, n = 1, topics = 1)

 Topic 1: 
     Infuriated About Tough CNN Interview, McCain Cancels Larry King Appearance                     Yesterday, Sen. John McCain’s (R-AZ) campaign spokesman Tucker Bounds appeared on CNN for an interview with Campbell Brown. Brown was tough on Bounds, refusing to let him spout typical campaign talking points. She repeatedly pressed him on Palin’s foreign policy experience and qualifications, asking him to name one decision that she made as commander-in-chief of the Alaskan National Guard. Bounds was unable to do so. Today, CNN’s Wolf Blitzer revealed that because of that tough interview, the McCain campaign has canceled the senator’s appearance on Larry King Live tonight: The McCain campaign said it believed that exchange was over the line and as a result the interview scheduled for Larry King Live with Sen. McCain was pulled. CNN does not believe that Campbell’s interview was over the line. We are committed to fair coverage of both sides of this historic election. CNN also replayed the interview between Brown and Bounds. Watch Blitzer’s announcement and the interview: The McCain campaign has repeatedly tried to intimidate the press. It is now angry about media coverage of Bristol Palin’s pregnancy, calling NBC’s reporting on it “irresponsible journalism.” Campaign staffers “even considered pulling out of one of the three presidential debates because it would be moderated by Tom Brokaw, a former NBC News anchorman.” When Newsweek wrote a cover story in May examining the hardball tactics conservatives might use in the general election, the McCain campaign “threatened to throw the magazine’s reporters off the campaign bus and airplane.” Digg It!

This returns the most representative document for the selected topic. Reading it helps guide our interpretation. This time around in our new model, for Topic 1, it seems to be more closely focussed on McCain’s Campaign (and possibly attacks on it) rather than also including Obama per se.

Cleaner figure

We can recreate the topic prevalence plot using ggplot2, just like before.

Code
topic_props <- colMeans(model2$theta)
topic_sd <- apply(model2$theta, 2, sd)

topic_df <- data.frame(
  Topic = factor(1:length(topic_props)),
  Mean = topic_props,
  SD = topic_sd
)

top_frex <- labelTopics(model2, n = 3)$frex
topic_labels <- apply(top_frex, 1, function(words) paste(words, collapse = ", "))

topic_df$Label <- paste0("T", topic_df$Topic, ": ", topic_labels)

ggplot(topic_df, aes(x = Label, y = Mean)) +
  geom_col(fill = "#00A68A") +
  geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD), width = 0.2, color = "white") +
  labs(x = "Topic", y = "Expected %\nof Corpus") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Word Clouds

Let’s also generate word clouds for a topic in the model with covariates.

Code
topic_id <- 1
top_words <- labelTopics(model2, n = 50)$frex[topic_id, ]

beta_matrix <- exp(model2$beta$logbeta[[1]])
word_probs <- beta_matrix[topic_id, ]

vocab <- model2$vocab
df <- data.frame(
  word = vocab,
  prob = word_probs
) |> 
  filter(word %in% top_words)

ggplot(df, aes(label = word, size = prob)) +
  geom_text_wordcloud(area_corr = TRUE, color = "#00A68A") +
  scale_size_area(max_size = 50) +
  theme_minimal()

Estimate and Plot Covariate Effects

Now that we’ve included covariates, we can test and plot how topic prevalence varies by metadata. This is where STM becomes really powerful.

Code
effect_model <- estimateEffect(1:15 ~ rating + s(day), model2,
                               meta = meta, uncertainty = "Global")

Which we can use to plot how a topic changes over time. For example, how did Topic 7’s prevalence change over time?

Code
plot(effect_model, "day", method = "continuous", topics = 7,
     printlegend = FALSE, xlab = "Time (2008)", xaxt = "n")

monthseq <- seq(from = as.Date("2008-01-01"), to = as.Date("2008-12-01"), by = "month")
monthnames <- months(monthseq)
axis(1, at = as.numeric(monthseq) - min(as.numeric(monthseq)), labels = monthnames)

Code
plot(effect_model, covariate = "rating", topics = 3, method = "difference",
     cov.value1 = "Liberal", cov.value2 = "Conservative",
     xlab = "<- More Liberal | More Conservative ->",
     main = "Topic 3: Difference by Political Rating")

These figures are… ok but I think we can do better. Here’s my attempt:

Code
ggplot(plot_data, aes(x = day, y = est, color = topic)) +
  geom_line(linewidth = 1) +
  geom_ribbon(aes(ymin = ci.lower, ymax = ci.upper, fill = topic), alpha = 0.15, colour = NA) +
  labs(x = "Time",
       y = "Estimated Topic Proportion",
       color = "Topic",
       fill = "Topic") +
  theme_minimal(base_size = 13)

Summary

Well done. You’ve just worked through a lot of material. Structural Topic Modelling (STM) is no small thing. It combines ideas from text analysis, statistics, and machine learning, and asks you to think about how documents are structured and what they’re saying. That’s a big ask, and if your brain feels a little full right now, that’s completely normal.

To recap the essentials:

  • STMs help uncover the hidden themes that exist across a collection of texts, what we call topics.

  • Each document is treated as a mix of topics, and each topic is a mix of words. That’s the heart of it.

  • Metadata (like time or political leaning) can be used to explain why certain topics appear more in some documents than others — which turns STM from an exploratory tool into an analytical one.

  • The model is hierarchical and probabilistic, but you don’t need to master every equation to use it well.

  • Topic interpretation is a human task. The model helps reveal structure, but it’s your judgement that assigns meaning.

If things still feel murky, that’s okay. The best way to solidify your understanding is to try it out with your own data. Poke it, explore it, get confused, get curious. You’ll get there.

And remember: I’m just a message or meeting away. You’re not in this alone.