FeatureLens
We present ‘FeatureLens,’ an application dedicated to the visualization of collections of text documents based on text features. Our goal is to integrate the results of text-mining algorithms into a meaningful representation of a text collection. We choose to select parts of text according to the distribution of some features such as the word frequency across and within the chapters of a book. Trends, gaps, patterns and outliers in the distributions are used to select ‘interesting’ patterns in the documents.
Researchers
Anthony Don, HCIL
Machon Gregory
Elena Zheleva
Sureyya Tarkan
Catherine Plaisant, HCIL
Ben Shneiderman, HCIL
In collaboration with Tanya Clement (UMd Dept of English) and Loretta Auvil (NCSA)
FeatureLens Live!
Start by ‘loading’ Gamer Theory using the [LOAD] button in the upper left of the application.
Text Mining
We extracted one set of words and one set of frequent patterns of trigrams from ‘Gamer Theory 2.0′ using the Text-to-Knowledge framework (http://alg.ncsa.uiuc.edu/do/tools/t2k).
Stop words such as “a,” “the,” and “of” were filtered out from the set of words and the frequency (number of occurences) of each word was computed for each paragraph of the book.
A trigram is a set of three consecutive words. For example, in the text “the quick brown fox,” there are two trigrams: “the quick brown” and “quick brown fox”.
The text was preprocessed to support detection of trigrams that frequently co-occur in the same paragraphs. In our setting, a set of trigrams that occurs in at least three paragraphs is considered to be a frequent pattern of trigrams. Stop words were not filtered out for this analysis.
Frequent patterns of trigrams may appear when a sentence, composed by more than three words, is exactly repeated in different paragraphs or when some slight variations occur in the repeated sentence. The frequency of each frequent pattern was computed for each paragraph.
We will use the term “pattern” to refer to either words or frequent patterns of trigrams.
FeatureLens
Figure 1 shows the Graphical Interface of FeatureLens. The interface is divided vertically into three parts.
In the leftmost part, the “Frequent Pattern” section contains a selection of words and frequent patterns. Patterns can be filtered by minimum size (number of trigrams in a pattern) and minimum frequency within the whole book. A text query can also be used to load patterns.
In the middle of the screen, the “Collection Overview” displays:
- the distribution of the selected patterns across the chapters of the book
- an overview of the chapters, using one gray line per paragraph
- the legend with the selected patterns and their associated color
In the rightmost part, the “Document View” displays the currently selected paragraph and its context on a blue background. The text is colored to show the position of the selected patterns inside a paragraph.
Sorting the distributions of pattern frequencies by trends
We use the distribution of individual pattern frequencies in each chapter to build different orderings for patterns. The user can sort patterns that:
- remain constant : retrieves patterns with the same frequency in each chapter. Patterns are ordered by decreasing total frequency.
- increase or decrease : retrieves patterns whose frequency steadily increases or decreases along the book. These patterns may represent topics that get more and more (resp. less and less) emphasis along the book.
- contain sink or spike : retrieves patterns that are used less (resp. more) in one chapter. These patterns may characterize a particular idea that is developed in one chapter.
- drop or float : retrieves patterns that have a high then low frequency or the other way around.
- contain gaps : retrieves patterns that start and finish with a high frequency but have a low frequency in the middle of the book.
Patterns can also be sorted according to their trend within a chapter. We provide four preselected trends for:
- patterns that have a low frequency in the beginning and at the end of the chapter but a high frequency in the middle .
- patterns that have a low frequency in the beginning and in the middle of the chapter but a high frequency at the end .
- patterns that have a high frequency in the beginning and at the end of the chapter but a low frequency in the middle .
- patterns that have a low frequency in the middle and at the end of the chapter but a high frequency in the beginning .
FeatureLens Examples
Five examples of FeatureLens that demonstrate how an analysis can lead to insights into the structure of the text.
Frequent patterns form “a line of a certain type”
Figure 2: The patterns are sorted by increasing trend. The word “problem” is selected and the graph shows a steadily increasing frequency starting from the “Atopia” chapter. Is this trend meaningful? Does the author want to ask new questions and underline new problems as the book progresses? New hypotheses are being provoked and can be verified by reading the corresponding paragraphs.
Video (avi, no sound)
The rise of problems
The patterns are sorted by increasing trend. The word “problem” is selected and the graph shows a steadily increasing frequency starting from the “Atopia” chapter. Is this trend meaningful? Does the author want to ask new questions and underline new problems as the book progresses? New hypotheses are being provoked and can be verified by reading the corresponding paragraphs.
Video (avi, no sound)
The real world disappears
The patterns are sorted by decreasing trend. Two words appear on the top of the list “real” and “world.” The graph of these patterns shows a decreasing trend along chapters for the word “real.” By reading the corresponding paragraphs, it appears that “real” mainly appears in “real world.” Does this fading-out of the real world follow the author’s ideas about the transition from analog to digital world?
Video (avi, no sound)
The steep hill
The patterns are sorted by “spikyness.” On the top of the list, four patterns form a spike in the “Analog” chapter. These words are “sisyphus”, “prince”, “katamari” and “ball,” and they only occur in this chapter. Why is this chapter so particular? What is the topic? By reading the content of the chapter, it appears that it contains an analysis of the game “katamari damacy,” a game where the player controls a ball through various levels. It also contains a parallel between “the Myth of Sisyphus” and the neverending nature of the game.
Video (avi, no sound)
From “gamers” to “gamer as theorist”
Figure 6: The patterns are sorted by trends inside the chapter entitled “Agony.” First, the patterns with a high frequency at the beginning are retrieved: the words “gamers” and “screen” have a high value. Then, the patterns with a high frequency at the end of the chapter are retrieved: the trigram “gamer as theorist” and “utopia” have a high value. It seems that in this section, the author switches between two concepts, “gamer” and “gamer as theorist.”
Video (avi, no sound)
Conclusion
The selection of trends in the distribution of pattern frequencies allows collecting meaningful pieces of information about the text. FeatureLens is a “provocational” tool as it gives rise to new questions and hypotheses, as well as insights about the text.
The current version of FeatureLens is developped at HCIL, University of Maryland, USA. We are using OpenLaszlo, Ruby and MySQL. An online version will be available soon.