Getting started

This site provides an easy means to create machine learning text models from a number of different scientific and humanities databases. We provide four visualization tools to analyze these models and search a tool to explore and filter the corpus.

The tool was created with funding provided by the Andrew W. Mellon Foundation . We continue to provide updates and improvements to this site as we move into Phase 2 of the tool’s development. Additional documentation and resources are forthcoming and will be updated during our second grant’s award period.

Before you get started, create an account. Next, read up on the types of models we’re creating so you have a basic understanding of how these are created and how they can be used:

Here’s a good description of our main model type Latent Dirichlet Allocation (LDA)

Here’s a description of Word2Vec, another more word-centered model we can create:

Search Texts

To begin, you need to fetch the set of documents you will be creating the model with. Select the corpus you would like to query.

search box

The corpora cover a number of subjects described below:

Pubmed Abstract: a database of the abstracts of every article on pubmed before ?2018?

Pubmed Central: the set of open access articles available on Pubmed

Jstor Life Sciences: All articles from journals related to Life sciences on JSTOR??


Two Archeology Journals:

Iowa Latin Canon:

EHealth Alzheimer’s:

Text Creation Partnership:

AC Justice Project:

Next choose a word, set of words (separated by spaces), or regular expression you would like the documents in the model to contain. Space between words means OR, 'OR' between words means one or the other term must be present, AND between words means both terms must be in document. Multi word phrases must be enclosed in double quotes (""). In this case we are searching for 1000 documents that contain the word maize.

search box

Now choose the number of documents you would like returned. The smaller the number of documents, the faster the model will run, but it can be subject to more variability.

Optionally, you can set a year range for documents as well in the From and To text inputs, but bear in mind certain databases have incomplete year metadata.

Now click that search button!

Explore Text

Next you will see time and term-count graphs. You can click and drag on these graphs to filter the documents by date, allowing for a more granular method to alter your dataset by time period. If this filter is applied, the model will be created using only the documents in the filtered range. To reset, click the reset button above the graph.

search table

Once the database has been searched and filtered, you will see the top documents (sorted by relevance to your search term(s)) on the table at the left of the page.

search table

Select one of these documents to see the text in full. This allows you to survey some of the documents to ensure their validity before running a model.

Once you feel good about your search parameters, it’s time to choose which type of model and visualization you want to run.

Choose Model/Visualization Type

Here are descriptions of what each model/visualization type creates and shows:

Topic Browser

This visualization type was created by Andrew Goldstone (github). The main vis presents the model created by showing a list of topics and the the top words in those topics, providing methods to explore how topics occur over time as well as an interface to show words in topics, topics in documents and other helpful information. We think of this as the best visualization for exploratory work, as it provides an easily readable overview of the topics and words in the model.

dfr main page

This visualization type comes from the pyLdavis package (github). On the left panel, projects each of the topics onto 2 Primary Components, allowing for a great way to see how similar topics are and if there are any topic clusters or particularly outlying topics in that space. On the top right it introduces the lambda relevance -metric, which changes the top words shown in topics by how much of the total word frequency that topics’s score contributes to, so a lower relevance metric will show words that appear relatively more frequently. Studies showed that a metric around .5 returns the best results for interpretability. Selecting a word on the right side table will show how much it exists in all of the topics on the left projection, and selecting a topic on the left side will populate the table with the words most relevant to that topic (based off of the lambda cutoff).

dfr main page

This visualization is based off of tensorflow’s tensorboard projector. It takes a word2vec model and projects the word-vectors onto a 3-d space, allowing for a way to explore the learned word similarities and meanings in the set of documents. Selection a word in the cloud shows the 20 nearest words in the original space on the left hand table, and it shows the documents in the corpus where that word appeared most frequently in the right hand table. Selecting a word on the left hand table will select that word in the cloud, and selecting a document from the right side table will show the contents of that document on the lower right.

w2v example
Multilevel Model of Models

This visualization method incorporates a few new ideas: Lda models at multiple document sizes and the combination of multiple Lda models. Both of these methods help solve the issue of stability across multiple lda runs with identical parameters. These ensure that the groups in these visualizations are assuredly stable, because they are created from multiple topics from multiple scales across multiple runs.

This visualization is created by creating two corpora from the documents: one at the normal level, treating documents as document-units for lda, and another at a more close-up level, splitting each paragraph in the corpus documents into its own document-unit and running lda on that. These two ‘levels’ will highlight different words in their topics, but if the topics at both levels look similar, then we have additional confidence in the stability of these models. We then create three models each of the normal document level model and the new paragraph level models and visualize all six models.

We then take the topics in these six models and cluster them using Spectral Clustering, giving you an estimation of stable topics in the corpus. These clusters are shown on the table at the right side, associating colors with clusters that are consistent across the page. This table lists the number of documents and paragraphs respectively within the topics in the cluster, and it shows the top words in the cluster. Selecting a row on this table populates the left upper table with the top documents in that cluster.

model of model cluster table

The groups and topics are visualized with three related panels:

  • Model tree : provides an overview of the top words and the scores associated with that level node in a hierarchical manner, starting with the ‘model’ as root, then splitting into each of the clusters generated, and within that the topics within the cluster and finally the words within each topic.

    model of model trees
  • Word web circle : this shows the top words in every topic (colored by cluster) and creates a link between identical words, giving an overview of semantic similarity of all topics. Mouse over a word to see it’s links highlighted, and click a word to see it’s topic highlighted in red.

    model of model circle
  • Topic similarity graph network : this panel projects the topics onto a two dimensional space using 5000 iterations of T-SNE, then it calculates the KL-divergence of those word-topic relationships between each topic and each other topic. The location of each topic is its T-SNE projection, and the links between topics are based off of the KL-divergence. The slider at the top shifts the KL-divergence cutoff for when to draw a link, closer to 0 is more similar, closer to 1 is less similar, with a fully connected graph occurring at a KL-divergence score of 1.

    model of model trees

Selecting a topic on any of these panels highlights that topic on the other two panels in red, and it populates the document table with the top documents in that topic (not cluster).

Tweaking Model Parameters

Before you choose a visualization type to create, if you want you can look at and make choices about tweaking model parameters under the advanced parameters tab.

advanced params
  • Tfidf: If this box is checked, we will perform tf-idf on the corpus before running the model. Tfidf more heavily weights less common words to account for some words occurring more frequently in general.

  • Remove digits: Will remove numbers from the corpus

  • Number of Topics: If automatic, the number of topics will be calculated by the number of documents in the corpus divided by 100, with a minimum of 5 and maximum of 50 topics. Passes: The number of passes over the corpus that you would like LDA to do, resulting in a more stable model. If not chosen, the model will run 25 passes. (Warning!) The more passes in the, the longer the model will take to run.

  • Stop word list: This is a list of comma separated words that will be removed from the corpus before the model is run. Can be used to get rid of common and uninformative words in the model (e.g. removing the word dna from a corpus you created by searching ‘dna’)

  • Phrases: Like stop word list, removes entire phrases from the corpus before the model is run. Word Replacement: This can be used to combine terms that have identical meanings into one term for more easily interpretable models. Should be comma separated, in the format of ‘word(s)-to-be-replaced’ -> ‘word-to-replace-with’