SPSP Students Working with Big Data

By Hasagani Tissera & Lucy Zheng

Researchers may employ numerous approaches to answering research questions. One more novel method is to leverage big data, which consists of analyzing large amounts of data that is either publicly or privately available, for testing our questions. Big data can be achieved through various different sources such as social media sites (e.g., Facebook, Twitter). In this month’s issue, we interviewed three SPSP graduate student members who have experience working with big data—Jana Berkessel (Ph.D. student, University of Mannheim), Nina Wang (Ph.D. student, University of Toronto), and Babak Hemmatian (Ph.D. student, Brown University).

Big data is also the focus of SPSP’s first Summer Psychology Forum being held August 2-4 in St. Louis. Don’t miss this chance to join a lively conversation and gain hands-on experience working with big data.

1. Can you describe a little bit of your research with big data?

Jana: Since I am a first-year Ph.D. student, I am still in the process of learning how to use big data and how to find, retrieve, and use large-scale datasets and non-traditional data sources to answer my research questions. My main research interests revolve around cross-cultural variability of social psychological phenomena. For example, I am interested in the effects of social class on well-being. Throughout the last year, I have come to realize that some of my research questions cannot be answered using college student populations—especially when researching social class. Additionally, some psychological constructs are hard to manipulate or measure in the lab. I have since started to apply novel methods beyond the traditional lab studies and have started to look into using big data—meaning both large-scale survey datasets, as well as a little more creative and novel data sources like extracting data from websites. For example, I have recently started to employ web scraping to extract process-generated data—meaning data that was generated “on the go”, i.e. without the intention to be used in a study. These data included text bodies and since text is at the heart of how we communicate online, I have looked into applying semantic analyses to extract meaningful topics from these texts.

Nina: I use techniques from natural language processing to study moral language on social media platforms like Twitter and in political discussions.

Babak: I study how discourse around controversial topics has evolved on Reddit over the past fifteen years. I am especially interested in how changes in online discourse correspond to concepts from psychological theory on one hand, and to changes in public attitudes as reflected in public polls on the other, to see which psychological theory can better explain changes in societal attitudes. For instance, a recently published article of mine uses topic modelling to show that discussion of individuals’ values increased sharply prior to majority support for same-sex marriage in the United States and declined rapidly soon after, while discussion of consequences of same-sex marriage showed the opposite pattern. These trends are observed well in advance of the Supreme Court ruling that legalized same-sex marriage in the United States and may have contributed to its increasing popularity. I am currently using artificial neural networks to study similar trends in discourse surrounding marijuana legalization, another topic that shows an emerging societal consensus within the past fifteen years.

2. What are the benefits of working with big data?

Jana: Working with big data has many upsides. Harvesting existing, sometimes process-generated data has some obvious advantages: For one, we are able to retrieve data that capture human behavior in an entirely natural setting. For example, when asked to rate an object the words a participant in a lab study uses might differ from what they would write in an online review.

On the positive side, using big data helps to diversify the population under study. While lab studies are usually conducted with college students, the internet is used by a much wider social demography than college campuses can offer. Especially when studying hard-to-reach populations and cultural specificities, being able to obtain data from these populations is extremely helpful.

Nina: The ability to test psychological theories in an ecologically valid manner, and have access to populations that are otherwise difficult to test (e.g., I would have a hard time getting politicians to participate in a survey or complete an experiment, but all their Tweets and speeches are publicly available).

Babak: As the name big data implies, the most obvious benefit of this type of research is the vast amounts of data that can be applied to one’s hypotheses. My Reddit dataset, for example, contains billions of posts consisting of tens of billions of words. Despite known biases in the demographics of social media users, the fact that these posts reflect real everyday interactions lends research in this domain a level of external validity that is rarely matched by more traditional behavioral paradigms. Many of the databases are openly available and a wealth of tools and packages developed for different purposes offers researchers like me flexibility in examining a variety of cross-sectional or longitudinal hypotheses.

3. What are the challenges of working with big data?

Jana: There are not a lot of existing classic methods on how to measure certain psychological concepts through big data, so you need to get creative: Which concept am I interested in and how might this concept manifest itself in digital footprints? However, most of us (including me) are not trained in computer science, so finding out how to retrieve the data can be an adventure of itself. Luckily, others have been there before and more and more tutorials on harvesting data from Twitter and other social media platforms, as well as on semantic analysis and several other big data methods are available online. Another component is that, while using big data may increase ecological validity, data generated in natural settings is by far less controlled than data from the lab. We do not have any information about the situation the persons were in and what goal they had in mind. This reduces internal validity and makes causal claims much more complicated.

Nina: The computational power and programming knowledge required to wrangle large datasets; the limitations and messiness of working with real-world data like social media posts.

Babak: As someone with a background in psychology who only started working with big data in graduate school, the biggest challenge for me was the high upfront time and effort cost to develop the broad set of skills I needed to succeed in this area. Rapid changes in machine learning methods, the kinds of packages that are used by developers, and the datasets themselves means that the average half-life of this painstakingly acquired knowledge is also low.

On a more data-focused note, with real-life data comes real-life messiness. In the absence of experimental controls, various sources of noise can contaminate the results. Therefore, much of my time is spent striking a careful balance between gathering comprehensive information on my topic of interest and not allowing too much noise to enter my analyses. Due to the same lack of experimental control, even the cleanest results with big data are often just correlational and while hints of causality can be observed in the temporal order of things, causal interpretations are often less straightforward than in experimental psychology.

4. Can you describe the process of working and publishing with big data?

Babak: A significant amount of my time at the beginning of projects is spent on theory-guided exploratory analysis of raw textual data. This is followed by conversations with experts from both the psychological sciences and computer science to find the correct resources and come up with the best architecture that would answer the theoretical question I care about with as much technical ease and accuracy as possible. This step is followed by many hours of programming to get the intended results. One of the challenges in the implementation of architectures is the difficulty in predicting how much time each step of the work would take: Seemingly straightforward steps can turn into technical hurdles, while other, apparently difficult, parts of the project may go surprisingly smoothly.

Publishing with big data can look very different based on the kind of venue one intends to present one’s work at. For instance, computer science conferences may put more emphasis on algorithm comparisons based on concrete accuracy and speed of processing, while more psychologically oriented communities may focus more attention on the theoretical potential of the question and how soundly it maps to analyses, even if outdated algorithms are used. These criteria can have a strong impact on which parts of the procedures are emphasized and de-emphasized and what the products must look like in order to get published. Nevertheless, since experimental data are less often gathered, adapting one’s work to new analyses and standards can be faster and more straightforward compared with more traditional psychological research.