Data will talk to you if you’re willing to listen.

-Jim Bergeson

Bergeson’s words, accurate as they may be, raise but do not answer an unavoidable question: should we listen?

I’m guessing the president and CEO of Bridgz Marketing Group would chuckle in annoyance at such a question; considering that Bergeson is the author of “Doc, What Can You Do for My Marketing Headache? A Data Prescription” and “Five Ways to Create a Data-Driven Marketing Culture,” we don’t need a PhD to infer what his response would be. While such an attitude may a winning one in the world of Wall Street, Big Business, the Nasdaq and the Dow Jones, we’re historians here.

Thus, I ask again: should we listen?

On one hand, our job is to investigate the collective story of humankind—so, why wouldn’t we listen to another storyteller? If data holds the keys to unlock the secrets of yesterday, we’d surely be fools to ignore it. If leveraging census data to analyze the correlation between 19th-century urban growth and the spread of slavery teaches us something about the complex relationship between America’s greatest sin and fundamental demographic shifts with long-lasting implications, it’s time to open our ears—if text-mining a corpus of North American slave narratives can shed light on the multidimensional role of religion in the lived experience of North American victims of slavery with a new hue, it’s time to open our eyes. If data can teach us about the past, why wouldn’t we sit in on its lectures and take notes?

Word frequency trend visualization investigating the chronology of religion and sentiment in a corpus of 25 religiously-themed slave narratives from the University of North Carolina’s North American Slave Narrative Collection.

The answer is simple: what if that storyteller is a liar? How can we verify the credibility of ledger entries, survey forms, and inventory receipts? What can we learn from the numerical remains of a species who has mastered the art of cover-ups, corruption, and misinformation? How can we trust an abstracted model of the past, manifested in a collection of binary digits?

George Washington’s List of Enslaved People, 1799. [source: George Washington’s Mount Vernon]
It might seem self-defeating to take a step back and consider these meta-analytical questions after 10+ weeks of investing my trust in data and advocating for the process along the way; however, I’d argue that now’s the perfect time to reconsider such a process, given that I’ve dirtied my hands with it. Credibility is built upon experience, and experience comes with trial, error, and reflection.

Should we listen?

Yes…with caution. Where we have data we can learn, so long as we think like historians.

Before I elaborate on this point, it’s worth defining the following terms to make sure we’re on the same page:

Data: information

Structured Data: “spreadsheet-like,” “rectangular” data with clearly-defined variables and observations/instances (also encompasses Tidy Data, in which “each variable forms a column, each observation forms a row, [and] each type of observational unit forms a table.”)

Unstructured Data: unprocessed, raw data: textual primary sources, images, etc.

History has long relied on unstructured data, synthesizing narratives, recollections, and other prose to reconstruct a model of the past; as such, I want to be clear that I’m talking about structured data in this post. If we couldn’t trust data in general, history as a field of research would cease to exist; data at the most general level is information. Rather, I’m focused on the extent to which we may leverage structured data as historians: I claim that we can and should leverage such data, so long as we tread carefully.

In discussing the role of structured data, it’s useful to further categorize our data as:

Received Data: structured data in its original form; data which was created in “rectangular” form as a ledger entry, census entry, etc.

Derived Data: structured data derived from analysis and abstraction of non-“rectangular” primary sources, whether calculated or reasonably inferred

Excerpt from the 1800 US Census. Such data would be considered observed-structured data, in contrast to unstructured data in the form of long-form text/images and derived-structured data, computed after original observation. [source: Medium]
Time-series chart showing frequency of fugitive slave advertisements in the Richmond Daily Dispatch from 1860-1865; such data would be considered derived-structured data. [source: University of Richmond’s Mining the Dispatch]
Naturally, it’s easiest to trust received data, if not derived data as well. I’ll postpone discussion of trustworthiness for a side-note on why we ought to care about structured data more generally, first—if historians have relied on unstructured data so long, why should we take on structured data with all of its risks? What’s so great about structured data?

1) Structured data allows us to ask novel questions

Without some computational firepower, we simply couldn’t ask how the spread of slavery in the United States depended on time and place; nor could we investigate the geographic dynamics of emancipation. Answering such questions requires the investigation and manipulation of thousands upon thousands of observations across tens of variables: something that is best handled by a spreadsheet, not a human being.

2) Structured data allows us to think bigger

Implicit in the above point is that structured data enables generalization, abstraction, and higher-order thinking; by leveraging structured data, we may think in trends, patterns, and global characteristics without having to infer such trends, patterns, and global characteristics from a small sample of specific literature. We may directly compute the percentage of a county’s, state’s, or nation’s population that was enslaved with the click of a button if we have the census data, while unstructured textual data may only hint at the prevalence of slavery as a side note in an account of another topic.

3) Structured data allows us to think faster

By leveraging structured data, we may form, test, and confirm hypotheses more quickly as historians; with a few commands, we may digest information en masse to find a curious path of insight or a dead-end, while finding such insightful paths or dead-ends may otherwise take days, weeks, months, or years through the synthetic investigation of unstructured primary sources.

4) Structured data enables us to think from a different perspective

Thinking numerically forces our thoughts to the left hemisphere of our brain, away from the typical historical machinery housed in our right hemisphere. Put simply, we think differently when looking at the abstraction numbers provides, which often leads to new explanations and observations otherwise unrealized.

5) Structured data allows us to communicate more effectively

Just as scientists and engineers leverage graphs, charts, tables, and visualizations in presentations, posters, and advertisements to communicate a complex phenomenon more simply, structured data enables us to do the same as historians. Trying to explain the multivariate equation governing America’s Manifest Destiny is more difficult in words than it is in a time-series GIF:

America’s Westward Expansion, as visualized in a time-series GIF. [source: Reddit]
Back to the main point of contention: should we listen? More precisely, given the utility of structured data, what questions must we ask before working with it? How must we treat received and derived data differently from unstructured data?

Ironically, I’ve arrived at this point to say that we must treat received and derived (structured) data just as we treat unstructured data: skeptically. Such a mindset is not natural, however; in abstracting the lives of our ancestors into rows of a CSV file, we often unknowingly abstract away our skepticism, too. It’s much easier to trust a number than a story, which is simultaneously the bane and promise of structured data’s role in history: it allows us to explain phenomena more convincingly, regardless of the truth of that phenomenon.

Hence, it’s in our best interest to use structured data, so long as we use it carefully: we must remember that each cell in our spreadsheet amounts to much, much more than a mere cell in our spreadsheet. We must take each and every number with a grain of salt, just as we may take a journal entry from a (knowingly or unknowingly) biased author or a letter from a (knowingly or unknowingly) incentivized actor in our more established research process applied to unstructured data.

Given this precarious landscape, I outline below the process through which we may become more critical consumers of structured historical data: as tempting as it may be to dive straight in and dig through the data for answers, start by asking a few questions of the data itself:

1) What is the nature of the dataset?

If the data is structured, is it received or derived? Did the historical actor originally create the dataset in rectangular form, or has a scholar parsed it into such a form in transit? Structured data is a model of the past, and all models are wrong; hence, it’s worth considering who made the decision to reflect the past in such a model, when that decision was made, and why that decision was made. Asking such questions isn’t designed to limit the utility of the data; rather, they help us determine how large of a grain of salt we should digest along with the data.

2) What is the context and motivation for creation of the dataset?

Implied by 1) is the perhaps the most important question, which is why it deserves its own entry: why was this structured dataset manufactured? What is the intended purpose of the data? What does the dataset seek to communicate? What does the dataset not seek to communicate? If the data were produced by an individual, how credible is that individual; what motivation drove the individual to document the world around him/her in the form of columns, rows, and numbers? (The same questions apply if the data were produced by an organization or institution as well.) If George Washington’s List of Enslaved People from 1799 were created for taxation documentation purposes, what might that tell us about the level of detail included in the dataset? If the Berry Slave Value database were synthesized and derived with the ultimate goal of enabling “scholars to … understand how enslaved people were valued and appraised,” what does this tell us about the data itself?

The nature of these questions demands a case-by-case analysis, which is precisely why I leave them as questions: if I could answer them I would. No one-size-fits all interpretation of structured data will ever apply.

3) What assumptions and implicit biases may exist?

Question 2) naturally raises a red flag to implicit bias and inherent partiality—because both received and derived structured data are the products of human actors, they ingrain human prejudices under a disguise of numbers. Was the received data you’re working with collected by a neutral third party, or was it collected by an institution on a mission to push an agenda? Likewise, was the derived data under consideration computed for the sake of scholarship and knowledge, or for the sake of making an argument? Considering its origin, setting and process of creation, what assumptions and subtle prejudice lurks between the lines of your dataset?

4) How complete is this dataset?

There’s a reason witnesses must take an oath to “tell the truth, the whole truth, and nothing but the truth”; it’s one thing for structured data to be accurate, and it’s another thing for structured data to be all-encompassingly accurate. Might George Washington’s List of Enslaved People from 1799 omit a few names, if each additional name on the list brought with it additional taxation? Credible structured data sources will quantify the degree to which a dataset’s missing rows affect the conclusions we can make; use these insights to your advantage.

5) How representative is this dataset?

Similar to 4) is the issue of representation; if our dataset is designed to be incomplete—i.e., if our dataset is meant to be a small sample representative of the whole when direct observation of the whole is impossible—then we must ask: how well does this sample represent the whole? Some observations generalize well, while others prone to high variance may tell different stories when examined on a micro- and macro- scale.

6) How many degrees of abstraction are embodied in this dataset?

Models are useful, but again, all models are wrong; the rows in our CSV don’t do the lives of our ancestors justice when it comes to embodying what they stood for. Humans are not numbers, and life does not have a dollar value, despite attempts by racists past to contend that fact. As such, it’s worth considering: how far removed is this structured data from the lived experience of whom it seeks to model? Received data is by nature less abstract than derived data, but that’s not to say one’s more useful than the other; rather, the two complement the story told by unstructured data in the form of narrative and prose, together forming a spectrum from general to specific which at each level enhances our understanding of the past.

7) What is missing from this dataset?

Following from 4), 5), and 6) is the notion that no structured dataset will ever tell the full story; thus, it’s critical to ask what’s missing from the story, and to seek out sources which fill those gaps. If the 1800 census merely recorded the number of enslaved persons in each household, while the 1820 census further prodded for the age ranges of the enslaved persons of a household, how might this affect a comparison between the two datasets? Moreover, if the 1850 census asked 13 questions of free inhabitants and only 8 of enslaved inhabitants, which 5 dimensions are missing between the two datasets, and how might they change the conclusions we may make? Often times, what’s not included in a dataset is just as important as what is included.

Excerpt from George Washington’s Mount Vernon Slavery Database, a massive collection of derived-structured data encapsulating an abstraction of the lives of Washington’s bondspersons. [source: George Washington’s Mount Vernon Slavery Database]
I’ll end with a disclaimer that I’m still an amateur in the process of becoming a more critical consumer of historical data myself—take my words with a grain of salt, just as you should take your data. The list of questions outlined above is by no means complete, but instead is meant to serve as a foundation upon which deeper, broader questions may be asked of structured data.

I’ve talked enough; the best way to learn is through exercise. Dive into the treasure trove of data in Emory University’s Slave Voyages database, asking yourself the above questions (and more!); then, compare your notes with those of Dr. David Eltis (whose notes are sure to be much more thorough) to get a veteran’s perspective on the database’s strengths and limitations.

Excerpt from Emory University’s Slave Voyages Database, a well-documented derived-structured data source. [source: Emory University’s Slave Voyages Database]
Structured data is willing to talk; and yes, you should listen.

Perhaps more importantly, though, is to ask: “what should I listen for?”