Topic modeling and, specifically, Latent Dirichlet allocation, is a fancy machine learning method used in computational social science and digital humanities to explore large sets of documents. I’ve used it a bit myself.
Benjamin Schmidt, in an article that’s already five years old, has some great points about the caveats of using LDA:
The idea that topics are meaningful rests, in large part, on assumptions of their coherence derived from checking the list of the most frequent words. But the top few words in a topic only give a small sense of the thousands of the words that constitute the whole probability distribution.
He demonstrates this with a clever example, using LDA to cluster ship voyages by treating ship logs as documents and locations derived from them as a vocabulary. It would be interesting if similar problems could somehow be demonstrated with more conventional corpora as well.
Schmidt also has a few things to say about using machine learning in the humanities as well:
Perhaps humanists who only apply one algorithm to their texts should be using LDA. — But “one” is a funny number to choose. Most humanists are better off applying zero computer programs, and most of the remainder should not be limiting themselves to a single method.
And:
Although quantification of textual data offers benefits to scholars, there is a great deal to be said for the sort of quantification humanists do being simple.
I think LDA is a promising method and hope to be able to explore what it’s actually useful for in the near future. But I also think Schmidt makes a good point that we should aim to work with simple methods whenever possible. That one should only turn to more complicated methods (which, by extension, tend to produce less interpretable results and be more prone to overfitting) once the possibilities of simpler methods have been exhausted seems to me to be an idea that comes rather naturally to computer scientists, but perhaps less so to those from other disciplines.
The article also accidentally invents a clever new term. Along with supervised and unsupervised machine learning, we know have ‘poorly supervised’ machine learning as well. Better be careful with that.
Schmidt, B. M. (2012). “Words Alone: Dismantling Topic Models in the Humanities”. Journal of Digital Humanities, 2(1). Retrieved from http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/.