Lab 06 - Working with Large Scale Datasets

1: The Metadata

2: Analysis

Introduction

I chose to explore the WhatEvery1Says¹ (WE1S) dataset “comparison-not-humanities”. The reason for my choice was at first quite superficial: the dataset provides information about mainstream news articles that use the terms “person,” “say,” and “good,” and I am generally interested in language attached to “person”/“people”/”human.” The WE1S team, however, chose these words simply because they are the three most common words in the English Language according to the Oxford English Corpus and therefore available for search in LexisNexis; when “humanities” is excluded in the search process, drawing data based on the occurrence of each term allows for the collection of what might serve as a control dataset that offers “humanities” media a gauge of relation, size, and scope (“WE1S Datasets,” par. 6). In other words, the “comparison-not-humanities” dataset is valuable to the larger goals of WhatEvery1Says because it provides large-scale information about the lack of engagement with “humanities.” In what follows, I will investigate this dataset: (1) on its own, (2) in relation to other datasets collected by WE1S and in relation to the key findings of WE1S. Finally, I will (3) address the boundaries of this dataset.

The Data Alone

What does the data contain and how is it useful? Comparison-not-humanities data is non-consumptive; its files list words in each news article alphabetically along with their attributes. It seems the strength in this dataset is not necessarily what any extracted document alone has the potential to indicate, but rather the sheer volume of information from which more specific questions might be asked. For example, this dataset seemingly allows for further mining that locates the most frequently used terms in each news article or across multiple news articles. As the news articles in this dataset come from a range of geographical regions within the United States, one might also ask whether most commonly used terms (aside from “person,” “say,” and “good”) vary to a significant degree across regions or publishing entities. This data then might be used in comparison to word frequency from other keywords searches. In sum then, this dataset might serve to add to data collections and/or provide broad metrics of comparison (hence the name).

What about the form? The dataset is formatted in JSON. That makes the transferability and readability of its files across different platforms very easy. Because these files only opened in TextEdit (Mac specific) on my computer, which is not such a visually appealing platform, I downloaded a program called Atom, where the files were much more visually digestible. That said, had I not had any particular program to read these files, I still would have been met with the same exact content as appeared in Atom. In other words, this dataset is stable in that it can be stored in plain text format.

While JSON is a valuable format for text data that is fairly interpretable, though, it is worth noting that its readability is contingent upon understanding its structure. Insofar as humanities, when viewed in the pejorative, is often thought to train students to develop soft skills, working with this dataset felt like the opposite experience: it required learning a new data language (JSON) and making sense of its patterns. It wasn’t until I understood the data fields being employed within the WE1S specific JSON schema that I was able to see how the actual content embedded in each document could contribute to larger meaning. In “Data Modeling in a Digital Humanities Context: Introduction,”Julia Flanders and Fotis Jannidis echo Michael Sperberg-McQueen to argue that “modeling is a way to make explicit our assumptions about the nature of a text/artefact” (3). Flanders and Jannidis refer largely to TEI and XML, but a similar concept applies here; I could not make sense of how the data contained within the comparison-not-humanities dataset was valuable until I looked more carefully at how each field was actually defined by both LexisNexis and the WE1S team. At a larger scale then, this work in interpretation seems to have a time “cost” (to use Flanders and Jannidis’s term), but also allows for ease in further computational analysis of data (8).

The Data in Comparison

As stated clearly within the WE1S datasets directory, “WE1S researchers use the data for context to better understand the place of documents in public discourse that do mention the word ‘humanities.’” (“WE1S Datasets” par. 5). The most obvious point of comparison afforded by the comparison-not-humanities data is its sheer size: 1,380,465 documents extracted from from 18 randomly selected months over six years total (or 21 and 7, alternatively, depending on whether 2019 was included in the search or the cut-off). In contrast, the “humanities-keywords” dataset contains 474,930 documents published from 1989-2019 across 1,287 sources total (though over 400 were international). This data, then, does indeed provide a sense of the limited space occupied by humanities conversation in the media. More exacting numerical data might be identified by isolating the “humanities-keywords” data that matches the date ranges used in comparison-not-humanities data and isolating region to the United States. In fact, in something akin to this process, the WE1S team found that the proportion of “humanities”-specified articles to non-“humanities”-specified articles within the overlapping date range was 1:40 (“Collection 32” par. 2.

At a more specific level, incorporation of comparison-not-humanities allows for content from humanities data to sit next to that of “everything else” (“Collection 20” par. 2). That is, it allows WE1S researchers to ask, first: What topics appear mostly frequently in humanities articles and non-humanities together? Is there overlap between most frequently mentioned terms and topics in the “humanities”-specific articles and those that do not mention “humanities”? Of course then, these questions give way to both answers and more specific questions. For example: In what sources is there the most overlap? The least? Do we think that concerns of humanists actually do apply to frequently mentioned topics where the “humanities” are not mentioned? Several of these above questions and answers helped WE1S arrive at key findings and provide intricate data visualizations.

Limitations of the Data

The comparison-not-humanities dataset, along with several other datasets included in the WE1S initiative, is inherently non-consumptive because of copyright law: WE1S does not have any claim to ownership over significant content from news sources, nor does LexisNexis. Therefore, keyword searches and topic and exploratory mining are restricted to token words and thus are also inherently decontextualized. That is, it is not possible to analyze the positionality of most frequently used terms in sentences and phrases. It seems this is just how dealing with data from contemporary media—particularly news and less so social media—works. In order to avoid this issue, the project would have to have been drastically scaled down. Therefore, this limitation is inevitable. The researchers of this project note this limitation, and incorporate consumptive data, as it were, where they can in other components of the project. That said, it might be possible to replicate this study at a smaller scale in a way wherein researchers are granted full access to target media.

Brief Summary

As a product of a series of comparative analyses, often incorporating data from comparison-not-humanities, WE1S was able to target when and where the humanities is left out and the ways in which it might be seen in the pejorative when included in conversation. As a response, they ultimately provide several action and communication steps for humanists and students in/of the humanities to help change the perception around the humanities. Where humanities work is undervalued and obscured but always needed, this seems urgently useful and important.

Sources Mentioned

“Collection 20: Top U.S. Newspapers.” WE1S, 4Humanities, https://we1s.ucsb.edu/wp-content/uploads/C-HS-1.pdf.

“Collection 32: Top U.S. Newspapers.” WE1S, 4Humanities, https://we1s.ucsb.edu/wp-content/uploads/C-HS-1.pdf.

Flanders, Julia, et al. “Data Modeling in a Digtial Humanities Context: An Introduction”. The Shape of Data in the Digital Humanities : Modeling Texts and Text-Based Resources. Routledge, 2019.

“WE1S Datasets.” WE1S, 4Humanities, https://.we1s.ucsb.edu/research/ we1s-materials/datasets/.

WhatEvery1Says (WE1S) is a collaborate initiative out of 4humanities.org and involving researchers out of various universities that looks at how the humanities is perceived in the media and in the public. WE1S Directors include Alan Liu, Jeremy Douglass, Scott Kleinman, and Lindsay Thomas. To created various datasets, researchers pulled data from mainstream and local news sources, Reddit, and Twitter that did and did not talk about the humanities. Researchers also collected survey data from university students. See [] https://we1s.ucsb.edu/research/we1s-findings/key-findings/ for key findings. ↩