All models are wrong, but some are useful.

-George E.P. Box

Perhaps it’s unwise of me to preface my discussion of data-driven historical visualizations and graphical models with a quote dismissing their accuracy—regardless of the irony, British statistician George Box’s words are worth an acknowledgement before I get too far.

British statistician George E.P. Box.

By abstracting reality in the formation of a model, we sacrifice detail in exchange for understanding; we trim outliers, straighten curves and smooth corners in order to reduce complex systems into more comprehensible mechanisms. We can’t hope to take in the whole picture at any given time, so we instead seek to understand a reduction which gives actionable insight as a compromise. A map is not the territory, and absolute space is not representational space, but a one-to-one map would be useless; likewise, the models of quantum mechanics are still inaccurate, and yet the theory of relativity is considered one of the greatest intellectual products of the 20th century.

Instead of asking for a perfect model, “a better approach is to ask ‘Is it useful?’ and, if yes, ‘To what extent?'” History is big, and we’d be fools to hope to digest the entire collective story of the 108.2 billion humans born before March 29th, 2019—rather, we’d be better off selectively slicing and dicing those stories to get a sense of the human experience as witnessed through the eyes of a few. Or—better yet—we could encapsulate those collective stories with numerical snapshots characteristic of the human experience in a frame-by-frame sequence, building an intuition around the general by amassing and amalgamating the individual.

As an analogy, consider the curve (x^2+y^2-1)^3-x^2y^3=0. Excuse me, what?

Case in point. Trying to stretch one’s mind back to precalculus to imagine how every single x and y point in the plane may or may not satisfy the above equation on a guess-and-check basis will likely induce a headache—visualizing the curve’s holistic structure, on the other hand, is sure to cause an aneurysm (that’s coming from a math major, mind you). As soon as we plot our dear friend (x^2+y^2-1)^3-x^2y^3=0, however, we see that math isn’t so scary after all; in fact, one might argue there’s reason to love it.

While the graph of (x^2+y^2-1)^3-x^2y^3=0 omits various details of the generating equation—namely, its particular values at particular points, the equations of its derivatives and their continuity, its critical points, its maxima and minima, etc…—its graph serves as a particularly useful tool to offer us a new perspective into its complex nature.

In the same manner, the graphical visualizations of historical datasets afford us the opportunity to identify previously unrecognized correlations, trends, patterns, and dependencies over a given time period, abstracting away concrete narrations, descriptions, and lived experiences in exchange for a new perspective into the mechanisms which drove a series of events. While such visualizations may be leveraged to prove historical hypotheses, they often serve as the basis for hypothesis generation in the first place, enabling us to ask new questions by sparking new domains of historical curiosity in our visual cortex. As Richard White, Professor of History and Past Director of the Stanford University Spatial History Project puts it,

…visualization and spatial history are not about producing illustrations or maps to communicate things that you have discovered by other means. It is a means of doing research; it generates questions that might otherwise go unasked, it reveals historical relations that might otherwise go unnoticed, and it undermines, or substantiates, stories upon which we build our own versions of the past.

This brings me to highlight the first reason we, as historians, ought to embrace visualizations: they inspire the generation of new perspectives and ideas. 

Hence, while visual models of history inherently oversimplify by projecting a space-time of infinite dimension onto a two-dimensional arrangement of pixels, there’s no denying that those two dimensions carry value. Lincoln Mullen’s “The Spread of U.S. Slavery, 1790–1860” and the University of Richmond’s “Visualizing Emancipation” exemplify such value, exposing a new side of slavery’s history in the United States by examining the institution’s role at various scales.

Lincoln Mullen’s “The Spread of U.S. Slavery, 1790–1860”
The University of Richmond’s “Visualizing Emancipation”

This issue of scale is arguably one of the most precarious themes dealt with in the study of history, and yet, each project above faces this challenge head on, enabling investigation at varying levels of granularity. Mullen’s map allows one to analyze the scale of the spread of slavery on a nationwide, statewide, or county-by-county basis, at a resolution of eleven variables and timescale of every ten years over the span of eighty. Similarly, the University of Richmond’s illustrates macro-scale movements by way of heatmap and animation functions while preserving micro-scale detail in individually geo-referenced data points and associated metadata entries. As such, visual models enable us to better grasp the role of deep contingency in shaping history, defined by Edward Ayers1 and Scott Nesbit2 in “Seeing Emancipation: Scale and Freedom in the American South” as

…a way to think about social action across scales; it argues that different aspects of social life connect with others in unpredictable ways in the flow of time, creating important shifts in structures and self-understanding … Deep contingency means that change at one scale can trigger change at other scales, with systemic change resonating at all scales.

In this regard, I’ve come to the second reason data-driven visualizations add value to more conventional historical approaches: they contextualize scale and the deep contingency of history. Mullen’s words echo this fact:

…one of the main problems for the historians’ method today is the problem of scale. How can we understand the past at different chronological and geographical scales? How can we move intelligibly between looking at individuals and looking at the Atlantic World, between studying a moment and studying several centuries? Maps can help, especially interactive web maps that make it possible to zoom in and out, to represent more than one subject of interest, and to set representations of the past in motion in order to show change over time.

The final strength of data-driven visualizations in the historical space arises from their very nature: they are simultaneously objective and subjective. Data-driven visualizations are by definition objective in the sense that numbers are objective; numbers don’t lie (provided that they are the right numbers…more on that in a minute). At the same time, they admit a unique interpretation for each lens through which they’re examined, being a platform for research at heart. By way of oxymoron, I’m hoping to emphasize that this duality of data-driven visualizations enables us to stretch our interpretations without stretching the truth—to put it another way, data-driven visualizations remove the pure quantitative questions of “how much?” and “to what degree?” so that we may instead focus our energy on debating “why?” After all, if we’re stuck arguing about the first two questions, we’ll never get to the latter—and yet, the latter question is precisely why we study history.

• • •

Despite my unilateral support for graphical modeling and data-driven visualization to this point, I’ll be the first to admit that they’re not all sunshine and rainbows. I’ve been sidestepping the weaknesses that come with their strengths thus far, but won’t leave without a few warnings—after all, understanding their shortcomings enables us to be better historians.

The first and foremost fact to consider when it comes to doing data science in any context is that

There are three kinds of lies—lies, damned lies, and statistics.

-Mark Twain

More precisely,

Figures don’t lie; liars figure.

-Mark Twain

American writer Mark Twain.

In other words, the simultaneous objectivity and subjectivity of data acts as a double-edged sword in the creation of data-driven visualizations; when not wielded carefully, it presents a grave danger. The numbers recorded in the 1850 Alabama census are numbers, after all, and haven’t changed since 1850—but why should we trust those numbers in the first place? What ulterior motives and incentives were at play in the original curation of the dataset under consideration? Is the source of our dataset credible? Even if our numbers tell the truth, do they tell the whole truth? Or, do they tell the story of a small, non-representative group within a fundamentally different population? Have we lost key pieces of data over time? How might such missing data skew our reconstructed interpretation? Such are the questions we must begin to ask when relying on pure, numeric historical data; after all, there’s cause for concern in the fact that Darrell Huff’s How to Lie with Statistics tops the bestseller list for statistical texts.

The next point to consider when working in the space of data-driven visualizations is the first half of Box’s quote with which I opened:

All models are wrong, but some are useful.

-George E.P. Box

The universe of models.

While modeling enables our understanding of complex systems, it too is a double-edged sword. In the creation of historical visualizations, we cannot help but leave out some level of detail, some chain of causality, some state of contingency, or some factor of agency—as such, we must remember that our models are models at the end of the day. We must not forget that correlation does not imply causation, that looks may be deceiving, and that the average is not the truth—rather, we are all unique, with our own histories, personalities, and principles. Hence, while visualizations are useful for making generalized conclusions in a historical context, such generalizations must remain general; they tell an accurate story, but not a precise one.

Models may be accurate, but they are inherently imprecise.

Finally, in the space of historical data science and visualizations, we must be sensitive to what our data truly represents, and adapt accordingly. Jessica Marie Johnson warns of this in “Markup Bodies: Black [Life] Studies and Slavery [Death] Studies at the Digital Crossroads”:

the role[s] spectacular black death and commodification [played] in slavery’s eighteenth- and nineteenth-century Atlantic archive …, [if] left unattended, … reproduce themselves in digital architecture, even when and where digital humanists believe they advocate for social justice.

Out of respect of those robbed of their natural right to freedom and subject to the grossly unjust institution of slavery, we must not reduce their struggle to a number, shape, color, region, line, or point in an effort to make the perfect model. Out of respect of those killed in the line of duty fighting for the progressive ideology they wished to spread, we must not abstract away their courage. Hence, as an antidote to the fundamentally dehumanized nature of data science, we must supplement our creations with content which honors the souls represented within our visualizations, and continually advocate for social justice in their memory.

• • •

Given that data-driven historical visualizations 1) inspire new perspectives and ideas, 2) contextualize scale, 3) contextualize deep contingency, 4) are simultaneously objective and subjective [for better or for worse], 5) exchange detail for ease of understanding [for better or for worse], and 6) abstract away the humanistic element of history [for better or for worse], there’s no shortage of variables to consider when deciding whether or not a data-driven visualization provides the best platform for a given project.

While I can’t make that decision for you, I hope this discussion helps.

  1. University of Richmond
  2. University of Georgia