What is Data Literacy?

This March I’ll be presenting at the ACRL 2015 conference with Christine Murray (Bates College) on teaching data literacy in the library. To help me prepare and perhaps preview our discussion, I thought I’d post a few thoughts on the blog to get the juices flowing. Let’s begin with some definitions as they appear in both the library literature and the scholarship of statistics education in order to answer the question: what is data literacy?

In Libraryland, “data literacy” seems to be the most popular term (over statistical literacy, quantitative literacy, and numeracy), and consists of two aspects: information literacy and data management. From an information literacy perspective, the emphasis is on statistics, which are considered a special form of information but one that still falls under the information literacy umbrella. For example, Schield (2004:6) describes statistical literacy as the critical consumption of statistical information when used as evidence in arguments. Similarly, Stephenson and Caravello (2007) advocate for librarians to promote statistical literacy by assisting learners to locate and evaluate authoritative statistical sources, recalling Standards 2 and 3 of the 2000 ACRL Information Literacy Standards, as well as reference classics like the annual Statistical Abstract of the United States.

From the data management perspective, the emphasis is on data rather than statistics, and focuses on the organizational skills needed to create, process, and preserve original data sets. Returning to Schield (2004:7), he defines data literacy as the ability to obtain and manipulate data, but reserves these skills for certain fields of study such as business or the social sciences.  Carson et al. (2011:631), based on interviews with faculty and GIS students, emphasize the importance of data management and curation skills required to “store, describe, organize, track, preserve, and interoperate data.” There is plenty of literature on data management, a hot topic in Libraryland fueled by interest in e-science initiatives, new data requirements for federal grants, and the creation of institutional repositories. In my experience, though, discussion of data management is often divorced from statistical literacy, perhaps due to its focus on faculty and other experts rather than data novices. Calzada Prado and Marzal (2013) do attempt to unify the information literacy and data management aspects under one rubric, although their proposal for five data literacy standards is largely derivative of the soon-to-be-sunsetted 2000 ACRL Information Literacy Standards, which doesn’t bode well for their wider adoption.

Turning away from librarianship, we find that statisticians and statistics educators typically use the term “statistical literacy” to describe the knowledge, skills, and dispositions surrounding their field. One widely cited exposition of statistical literacy is that of Iddo Gal (2002:2-3), who identifies two interrelated components: the ability to interpret and critically evaluate statistical information, as well as the ability to discuss and communicate one’s understanding, opinions, and concerns regarding such statistical information. Gal (2002:4) further describes a model of interrelated knowledge elements and dispositions that together enable statistically literate behavior. Gal’s definition will no doubt look familiar to information literacy librarians, incorporating the evaluative and communicative aspects of information literacy along with the dispositions and affective components we find highlighted under the new ACRL Framework.

But what is the nature of statistical information, the object of Gal’s model for statistical literacy? It may be helpful to consider this in terms put forth by George Cobb and David S. Moore (1997). In their oft cited article on statistics pedagogy, they break down statistical analysis into three interrelated phases: data production, data analysis, and formal inference. Each of these phases produces statistical information requiring varying levels of contextual and mathematical knowledge.

Data production includes aspects of the research process related to designing a study, creating a data set, and preparing the data for short term and long term analysis. Viewed from the library, the data production phase is most closely associated with data management skills. Data analysis, next in Cobb and Moore’s schema, consists of the exploratory and descriptive phase of data-driven research. This includes examining the data set to discover trends or outliers, and using descriptive statistics to reduce large amounts of data into summary information such as measures of central tendency and variance (e.g. mean, median, mode, range, percentiles, standard deviation). Through this analysis, researchers can make hypotheses or predictions about phenomena revealed by the data. Finally, formal inference can be used to draw conclusions about a population from findings in sample data. Here we find the notorious formulas full of Greek letters such as Student’s t-test, chi-square test, ANOVA, and regression models. I’ll return to Cobb and Moore’s pedagogical advice in a future post.

So back to the original question: what is data literacy?

I suggest librarians borrow heavily from statistics educators when trying to answer this question. To paraphrase Gal and apply his definition to Cobb and Moore’s three phases of statistical analysis, the simplest definition of data literacy is the ability to interpret, evaluate, and communicate statistical information. Central to this ability is an understanding of how statistical information is created, encompassing data production, data analysis, and formal inference. In other words, data literacy includes the ability to evaluate the modes of data production, including the underlying research design and means of sampling, and how this impacts the possible findings. Data literacy also includes the ability to interpret the results of formal inference tests, including confidence intervals and the probability that findings are representative of a population rather than coincidental to the given sample. And finally, data literacy includes the ability to interpret and communicate about the descriptive statistics learners and citizens encounter everyday, from unemployment rates to political polling.

And what about data management? Ultimately it belongs to the data production phase of Cobb and Moore’s schema, and is perhaps one aspect of data literacy that, as Schield intimated, can be reserved for the specialists. While the data literate person can identify and evaluate the soundness of a research design and data collection methods, perhaps only trained practitioners need the specialized skills to carry out a full-fledged project involving data curation and advanced tools. And in most instances, teaching these skills is beyond the purview of librarians. Stay tuned for more on this and data literacy instruction in the library.

Calzada Prado, Javier and Miguel Ángel Marzal. 2013. “Incorporating Data Literacy into Information Literacy Programs: Core Competencies and Contents.” Libri: International Journal of Libraries & Information Services 63(2):123–34.
Carlson, Jacob, Michael Fosmire, C. C. Miller, and Megan Sapp Nelson. 2011. “Determining Data Information Literacy Needs: A Study of Students and Research Faculty.” portal: Libraries and the Academy 11(2):629–57.
Cobb, George W. and David S. Moore. 1997. “Mathematics, Statistics, and Teaching.” The American Mathematical Monthly 104(9):801–23.
Gal, Iddo. 2002. “Adults’ Statistical Literacy: Meanings, Components, Responsibilities.” International Statistical Review 70(1):1–25.
Schield, Milo. 2004. “Information Literacy, Statistical Literacy and Data Literacy.” IASSIST Quarterly 28(2):6–11.
Stephenson, Elizabeth and Patti Schifter Caravello. 2007. “Incorporating Data Literacy into Undergraduate Information Literacy Programs in the Social Sciences: A Pilot Project.” Reference Services Review 35(4):525–40.

Correlation Coefficients, or Applying What I Learned at LOEX 2014

LOEX is one of my favorite conferences. Its smallness makes it more intimate than ALA or ACRL. It’s “all inclusive,” which promotes those in-between-sessions conversations that are often the most fruitful. And everything is about instruction.  Win win win.

And sometimes one of the most helpful takeaways appears in an unexpected format.  All the sessions I attended at LOEX 2014 in Grand Rapids, Michigan, were great, but the one that paid the most immediate dividends for what’s happening right now at my library was the lightning talk by Chantelle Swaren of the University of Tennessee at Chattanooga.  In a strictly timed 7 minute presentation to all LOEX attendees after lunch, Chantelle explained the statistical concept of correlation, and demonstrated how to use Microsoft Excel to generate correlation coefficients. This is a statistical method for revealing relationships among the piles of data we have lying around: circulation stats, survey results, instruction session attendance, etc.

As it happened, my library received the results of our local Ithaka S+R Survey of faculty right before I departed for Grand Rapids.  While Chantelle was giving her presentation, I had the Excel spreadsheet of the survey responses (scrubbed of identifying information, of course), in my e-mail inbox. How fortuitous! I have to admit that I did test out the Excel correlation function before dinner that same day, but I still had some things to learn about correlation and the survey data before I could make these numbers meaningful.

Since then, I’ve done a little reading (thanks Wikipedia!) to better understand how statistical correlation works, and what are its limitations.  And now that my entire library is focused on digesting and interpreting our Ithaka Survey results, I’ve been putting my new Excel skills to good use. The folks at Ithaka sold us an analytic report of the survey findings, but that mostly included comparisons of the Tulane faculty responses to those of the 2012 national faculty survey.  I could have done that myself since the data set for the national survey is the ICPSR data archive. I wanted to know more.  For example, does a respondent’s perception of librarians’ impact on student success have any relationship to their value of librarians overall. (Answer: It does.) Or does a respondent’s heavy use of the library collections have any relationship to their willingness to divert funds away from the library building and staff.  (Answer: It doesn’t.) Correlation of these survey responses does not mean causation, but it does generate some interesting questions about how the library is perceived and valued by our faculty, and what aspects of our work with them may have the most impact.

As a library we’re still digesting the survey results, but my work with correlation coefficients has quantified what most librarians in public services have known for a long time: the more visible we make our work to our users, the more they will value us as partners in their research and teaching. When we make ourselves invisible–a common side-effect of making discovery and access easier for users–the work of librarians becomes devalued simply because our users don’t know about it. How my library will respond to these findings is still being discussed, but to me the solutions are obvious: put more resources into the visible services that generate value for the library as a whole, and find ways to make the inherently invisible work of librarians (collection development, technical services, electronic resources management) more visible outside the library walls. It’s up to all of us to demonstrate value to our communities, and the numbers suggest that even at a research institution, just having a great collection is not enough.