Big data and public health: New scenes and a new state of mind

By Eun Kyong Shin

The 2017 International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS 2017) was held in Washington, DC, in July, and prominent fields applying social computing techniques include public health and healthcare. In early modern epidemiology, data collection processes relied heavily on painstaking manual labor. Data on a large scale was hard to obtain and resulted from careful observation and intensive recording. Since the introduction of the internet and advances in digital communication, massive amounts of dynamic data have accumulated exponentially. Along with the digitization of medical practices and other social data collection process, the nature of scientific discovery has been fundamentally changed. How has the scene been so transformed? In the era of big data, medical and non-medical data expanded in quantity and variety beyond what was once available and imaginable. Moreover, this quantitative expansion changes the landscape for medical research and clinical treatment. The traditional assumption of the large N has lost its implied, if not granted, power as we encounter an ever-expanding and continuously growing denominator. However, accumulating large quantities of data doesn’t automatically solve the problems we face in healthcare and public health. Digital data and computation are among the unavoidable realities in health sciences, just as they are in any empirical science dealing with human affairs. Ubiquitous data makes the question “Why do we need to care about big data in public health?” obsolete. Instead, we need to ask questions about how we can utilize all this data, how we can make the most of the rapidly changing research and clinical realities that big data creates, and how we change our mindsets to address the challenges of a new kind of empirics. An abundance of data does not inherently help us arrive at solutions easily. The data collection process becomes more efficient, but this is not necessarily true for the analysis. As an example, consider the classic success story of modern epidemiology. In 1854, John Snow dramatically reduced the spread of cholera in London because he was able to find out that affected households were located around the water pump on Broad Street. Dr. Snow merged two simple data sets; one that described where the infected patients lived and one that identified the main built environments located near their dwellings. Transplanting this example into the contemporary context, we also have multiple types of data available to us. Patient data cover a wide array of information including where patients live, where and when they have traveled, their medical histories, and hospitals with which they are initially identified. Similarly, geographic information spans not only landmarks and facilities but also public transportation systems, traffic flow, air pollution and hazards/toxic substances exposure. It is true that we now have more comprehensive data available. However, lots of data does not guarantee that the most relevant data will be examined and used to make the best choices in a timely manner. Moreover, abundant data can mean new challenges related to its validity. As the amount of available data has increased, it becomes possible to garner data that supports false arguments, and false positive cases can flourish. Any hypothesis, valid or false, can be supported with empirical evidence. The statistical significance doesn’t speak as loudly as it once did. Therefore, semantic and logical analysis of the data should be taken into account to reach better judgments. A tighter link between data and knowledge is one of the most urgent quests in scientific research in general and in health science fields specifically. Whether one’s work is about quantifying unstructured texts from medical charts, online communications, overcoming the scarcity of empirical data in rare medical cases, or merging multiple data sources on a large scale, advanced computing will enable us to unravel health complexities that are otherwise impossible to understand. Digitization of medical practices and the data collection process often lead to disenchantment for both researchers and the public. This dissonance informs and encourages big data investigators to see what information is critical and fundamental in enhancing our understating of dynamics related to health.

Eun Kyong Shin, PhD, is a postdoctoral fellow at the University of Tennessee Health Science Center – Oak Ridge National Lab (UTHSC-ORNL), Center for Biomedical Informatics, Department of Pediatrics (affiliated with Le Bonheur Children's Hospital). She attended the 2017 International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS) with support from the South Big Data Hub.