Finding Hidden Relationships between Right Skewed (Log-Normal) Variables

Oftentimes in research we’re advised to plot out the variables we’re interested in to check for extreme right skew in our data. Right skew is indicated when most of the data is clustered on the left side of the distribution, with a few extreme variables out on the right. Common advice when encountering skewed data like this is to apply a log transform to render it normal. A standard log transform replaces your original variable with a log to base 10 of that variable. This means that the values 10, 100, and 1000 in your original data will become 1, 2, and 3 in the transformed data. Thus, where you may have had data that clustered around 10, with outliers several orders of magnitude higher than average at 1000, the transformed data will mostly be clustered around 1, with a few 2’s and 3’s. The resulting distribution will be much more normal.

This procedure is advised because many statistical procedures assume that we’re working with normal variables. But that’s a fairly abstract reason to transform a variable, and it can often feel unclear what it is you’re actually doing. A visual demonstration can make the reason for log transforms much more obvious. One example of data in need of a log transform is the physical size of United States census tracts. Census tracts are areas defined by the Census Bureau, and are designed to contain between 1200 and 8000 individuals in order to provide the basic unit of aggregation for the Census. However, while the population of census tracts is constrained within a certain range, the physical areas they cover vary widely. High population density areas such as major cities contain many census tracts about the size of a city block, but some low population density areas like the desert stretches of Arizona contain census tracts thousands of square miles across. We can see that the land area of census tracts has a right-skewed distribution:

plot of chunk unnamed-chunk-2

And we can see that after a log transform, the data looks pretty close to normal. But it’s not immediately clear how that benefits us, besides the general knowledge that normal variables are good to have.

plot of chunk unnamed-chunk-3

To get a clearer demonstration of the benefits of log transforms, we need to look at the relationship between two right skewed variables. Fortunately, the census tract data set we’re looking at also has a record of the area of water in each census tract, a variable that is skewed in a similar fashion to the land area. If we plot the relationship between the two of them without transformations, we get the following plot, which is fairly unenlightening. It’s unclear whether there’s a pattern to the relationship between the two variables.

plot of chunk unnamed-chunk-4

However, when we look at the relationship between the two variables after a log transform, the picture is very different.

plot of chunk unnamed-chunk-5

For one thing, we can now clearly see the clusters of census tracts falling along one or the other axis – these are tracts that are pure land, or pure water. But, more importantly, we can now see a very clear relationship between the land area and water area of each census tracts – unsurprisingly, census tracts with larger land areas tend to have larger bodies of water within them, and the relationship is very clean. So why was this hidden before we transformed the data? It’s a matter of caring about relatives rather than absolutes.

Let’s say that, on average, each census tract has a water area 10% the size of its land area – usually made up of lakes. This is the linear relationship we’re looking for. But of course, there’s error to this relationship – some tracts will have lakes that are relatively smaller, resulting in compositions of 8% water, or 12%, etc. The problem is, when a census tract the size of Texas deviates from expectations by 2%, that’s a difference of thousands of square miles of water – bigger than the total size of the majority of tracts. What ends up happening is that, as we move right on the graph, our data points deviate more and more in absolute terms from the ‘true’ relationship – despite having the same relative margin of error. This often results in a graph like the first one – where the true relationship between variables is obscured.

It’s important to remember that this makes the most sense when there is a good case to be made that the size of an observation is important in determining what a meaningful change in that observation might be. For instance, for someone with 100 Twitter followers, a change of 100 extra followers is highly meaningful – the same might not be the case for someone with 10 000 followers. If an experimental stimulus is usually responded to in 100 ms, a 10 ms difference is likely meaningful – the same couldn’t be said for an anagram that takes 2 minutes to complete, on average.

So there you go – A (hopefully) practical demonstration of why it’s useful to log-transform your skewed variables, and the effects it actually has.

Book review: Structural Equation Modelling for Social and Personality Psychology

I enjoy reading statistics and methods textbooks in my spare time. That’s probably not a trait I share with a lot of people, but I continue nonetheless. I also enjoy endless searches through Amazon, forum threads, and other review websites to make sure I’m picking the best book on any given subject. Together, those traits mean that over the past year or so I’ve read through a decent number of very good textbooks on useful research topics. I thought I’d start writing reviews of some of those texts in the (somewhat optimistic) hope that others might find them useful.

I recently finished ‘Structural Equation Modelling for Personality and Social Psychology’, part of a series of books aimed at giving an accessible introduction to a research method, pitched directly at personality and social psychologists. The author, Rick Hoyle, states in the foreword that this book’s purpose is to take researchers with a knowledge of analysis of variance and regression, and instill in them the basics of structural equation modelling (SEM). I found the book to be an excellent introduction to the topic, and felt that the description of a ‘nontechnical overview’ is a bit of an undersell. The book is certainly easy reading and not overly bogged down in formulae, but for all of that Hoyle manages to convey a fairly deep understanding of the mathematical underpinnings of SEM.

I feel that this book is ideal for researchers who, like myself, may have been using SEM for some time without a strong understanding of its underpinnings. Modern programs with friendly point-and-click interfaces let us specify and report structural equation models without truly understanding, for instance, that the measures of fit the program spits out are based on how well the implied covariance matrix reproduces the actual covariance matrix, or the requirements for full model identification (and even what identification means!). Reading this book has deepened my knowledge of the topic, and I feel much more confident in using SEM as a result. I would strongly recommend this book to anyone interested in gaining a firm grasp of structural equation modelling without delving into a formula-heavy textbook. I also feel that the “for Personality and Social Psychology” series as a whole is well-aimed. I found John Nezlek’s text on Multilevel modelling in the same series to be equally high quality, and will probably review that text in the future.

Examining Publication Bias and the Decline Effect using Registered Replication Reports

Are psychological findings reliable? Do published studies mostly represent true effects, or are non-effects that look significant due to random sampling error overrepresented due to publication bias, the tendency to publish only significant effects? An increasing number of projects are being undertaken to answer these questions by meticulously replicating previously published research in an effort to see whether the same results are obtained. One of these projects is the Registered Replication Report recently unveiled at the journal Perspectives in Psychological Science. A registered replication report, or RRR, allows psychologists to specify their intentions to replicate a published finding, as well as plan out their analysis plan in advance so we can be confident that what they report is a confirmatory (designed to test an established hypothesis) rather than exploratory (designed to test multiple possible results to generate hypotheses) analysis.

Recently the first RRR was published at Perspectives. In this report, 31 labs collaborated to replicate an experiment by Schooler and Engstler-Schooler (1990) on verbal overshadowing – the finding that describing something verbally (in this case, a suspect in a criminal case), impairs subsequent visual identification of that same thing. The extent of this multi-lab effort is unusual, and likely reflects an initial enthusiasm for the idea of the RRR, but it does provide a mountain of evidence on verbal overshadowing. Due to an error in the initial protocol, only the second of the two studies in the replication report is a direct replication of one of the studies in the original paper. That still leaves data from 22 labs for this experiment, which found that there was a reliable drop in accuracy of 16% when a subject was verbally described. This is substantial, if somewhat smaller than the initial study, which found an effect size of 25%. Given that the original published effect sizes falls outside the confidence interval generated by the replication effort, I thought this data might make a good case study in the inflation of effect sizes that might result from publication bias.

When experiments don’t run enough participants to have a good chance of finding their effect (which is common in psychology), studies that find significant effects will tend to be those where random variation around the true effect errs on the side of making it larger, and thus easier to detect. Therefore, if publication bias leads to only significant studies getting published, the average effect size of published studies will be substantially larger than the true effect. Since a little less than half (9 of 22) of the studies in this replication effort are significant individually, it’s possible to directly compare the true effect size (at least as measured by the full set of 22 studies) with the effect size that would result from publication bias.

plot of chunk unnamed-chunk-1

The above is a graph of the meta-analysis of the verbal overshadowing effect in Study 2. The meat of the graph is generated from the same code that generated Figure 3 in the paper (found here). In addition, I’ve added a final horizontal line, on which I’ve placed a marker in blue for the meta-analytic effect of only those studies that found a significant effect individually. At the bottom right, you can see how the confidence interval for this new meta-analytic effect compares to the overall effect.

As expected, publication bias yields a larger effect size estimate. How much larger? 23% compared to 16%, or a 144% increase in the size of the estimate. This difference speaks to the so-called ‘decline effect’, where the first published finding of any given effect tends to be the largest, and follow-up studies average smaller effects. We can see that the original study by Schooler and Engstler-Schooler, (represented at the top of the figure), yielded an effect size that falls outside the confidence interval of the larger meta-analysis. However, that initial effect size falls almost perfectly in the middle of the confidence interval generated by a population of only significant studies. Since counter-intuitive findings like verbal overshadowing will tend only to be published when they are significant, we can assume initial publications of such findings are drawn from a sample of only significant studies, which would completely account for the ‘decline effect’ in this particular case.

Of course, this doesn’t indicate that there was anything suspect about the Schooler and Engstler-Schooler (1990) study. Looking at the graph above, several of the replication efforts found effects that almost perfectly mirrored the initial study. If we imagine these 22 replication efforts to be 22 original studies, each of the 9 authors of a significant effect could publish their effect with completely honest reporting, while the other 13 conclude a failure to find significant results and go back to the drawing board. Thus, each published paper would be an over-estimate of the effect it reports on, while being perfectly valid, statistically, in it’s own right.

Given all of this, it seems reasonable to weight effect size estimates downward to some degree when reading a paper that attempts to verify a counter-intuitive hypothesis for the first time. The size of the publication bias effect will depend on the power of published studies to find a real effect, so you can adjust your estimates less when the published study has a substantial number of participants for the size of the effect they’re looking to test. If you want to get a better idea about power estimates, programs like G-Power can be useful. One bright note on this front is that, with the sudden ease of collecting large numbers of participants through sites like Mturk, many psychology publications going forward can be more than adequately powered, reducing the effect of publication bias substantially.

Replication efforts like this one go a long way to help us not only understand how reliable specific effects are, but, as more replications stack up, how reliable the published body of studies in psychology is as a whole. The relative ease with which I was able to further analyse this published data also speaks to the value of making data and analysis code publicly available of forums like the Open Science Framework. In that spirit, the code used to generate my extended plot can be found here.

The Ethics of Using Existing Data in Research

In a few weeks I’ll be presenting a brief workshop on using existing sources of data in research as part of a post-graduate seminar series being run this semester in the UQ School of Psychology. It’s a topic I’m pretty keen on, so I thought I’d write a few posts on some aspects of using existing data that I’ve been thinking about lately. The first topic I wanted to write about was the ethical concerns involved in the use of existing data.

Research ethics has been under high scrutiny lately with the revelation that PNAS, a leading academic journal, published a study conducted by Facebook in what seems to be a fairly unethical manner. Some of the discussion around that study focused on whether it was appropriate for Cornell researchers who worked with Facebook on the study to get ethical approval under the lax guidelines surrounding the use of existing data. The main argument seems to be that because these authors had a hand in designing the study (at least according to the attributions in the paper), it wasn’t appropriate to claim it was an ‘existing dataset.’

While I agree that the use of Cornell’s existing data ethics policy was probably inappropriate in this case, what I’m more interested in talking is why it’s acceptable to have lower standards for ethics in the use of existing data than when we’re actually collecting it. The 2007 National Statement on Ethical Conduct in Human Research, which guides Australian human research practices, states that institutions may exempt research involving existing datasets from ethical review if they contain no identifying information about participants, and involve negligible risk to participants. It’s unclear whether negligible risk clause refers to the original research or your particular use of the data, which leaves some room for interpretation. A quick search shows several institutions which waive the need for ethical review for the use of any publicly available data.

Intuitively, this doesn’t seem too unreasonable. It seems difficult to do harm to participants from whom data has already been collected – if they did experience any discomfort or mistreatment, it has already occurred, and your analysis of the data will not change that. But this ignores the fact that the collection of data is at least partially demand-driven. To make an admittedly dramatic comparison, while the elephant may have been shot months before you buy an ivory souvenir, it’s still arguable unethical to increase the demand for such products by purchasing them, leading to increased poaching.

I don’t think the comparison is completely inappropriate. We’re seeing the rise of private companies and organisations with access to data that is incredibly valuable from a scientific standpoint. Dating websites like OkCupid, social network sites like Facebook and Twitter, and search companies like Google have all begun collecting large amounts of detailed data on their users. This is the kind of real-world, detailed data that academic researchers haven’t had a lot of access to until recently, and so I can imagine demand for this data will be high, as it may be highly publishable. The concern then is that if academics are continually willing to analyse and publish data that private companies collect under ethically questionable circumstances, this may drive a push for more unethical research to be conducted.

The take-home message here is not particularly earth-shattering, but it bears saying: just because you didn’t collect the data yourself doesn’t make using it ethical. I think many university ethics boards will adopt higher scrutiny of this kind of data use in light of this latest controversy, but I also think as researchers we need to hold ourselves to a certain standard. I would question heavily whether it’s worth publishing using existing data if you wouldn’t feel comfortable running the research that generated it.

Automatically generating large Qualtrics surveys in R

Image

Qualtrics is a company used by the majority of researchers in the social sciences to generate online surveys for data collection purposes. While they offer a great deal of depth in survey customization and have a lot of powerful tools, one thing they’re really lacking in is scalability. If you need to generate (or delete, or modify) large numbers of similar question blocks, you’re faced with a tedious process of manually copying, clicking, and dragging.

I do a lot of research that broadly falls under ‘person perception’, which means I generally collect responses from a large number of participants in the form of video or text, and need to have each of them rated, usually by Mturk workers. This leaves me faced with manually creating hundreds of near-identical Qualtrics blocks each time I run a study.

Thankfully, there is an alternative to the tedium, and that’s to use Qualtrics’ advanced text importer (as seen here). It’s not widely used (at least, according to a call to Qualtrics support about the possibility of extending the features) but it can save a lot of time.

The way the tool works is that you generate a .txt file with the content you want in your survey, annotated with tags that tell Qualtrics how to format that content into questions. For example, when the text below is imported into Qualtrics, a new survey will be created with a multiple choice block that displays an image (hosted on Imgur) and allows participants a same/different response choice.
Mental rotation block

 

 

 

 

 

Since formatting such a file by hand would be just as tedious as creating the survey block-by-block, this is where some automation comes in handy. There are a few steps you’ll need to take to get this running:

  1. First, you need the set of stimuli you’re using organized in a single column of a spreadsheet. Save the spreadsheet as Stimuli.csv, and take note of the file location.
  2. Next, you’ll need to download and install R and Rstudio. R is an open source programming language that will be doing the heavy lifting here.
  3. Once installed, open Rstudio and you’re ready to go.

We’ll be using the cat function in R to generate our text files. Cat, short for concatenate and print, allows you to flexibly set up commands in R that will take your stimuli and write them to a text file in the order you specify. For example, imagine that your Stimuli.csv file contains a list of image links that you need embedded into Same/Different questions, each with their own block. If you replace “/Users/Yourname/Directory” with the directory you saved your file in, the code below will compile all of your links and save them in the correct format to Choiceblocks.txt, in the same directory. Windows users note that you’ll need to change the backslashes in your directory name to double backslashes to make this work.

Generating lots of blocks

Once you’ve run the code and have the Choiceblocks.txt file, simply create a new survey in Qualtrics, click “Advanced Options” in the upper-right section of the screen, and then click “Import Survey.” It may take some time if you’ve generate a lot of blocks, but you should soon see your survey appear. If the page hangs for a while, click the red cancel button and then refresh – this often reveals a completed import.

This technique is highly modifiable. With some slight adjustments, for instance, you can generate blocks where the potential responses vary across questions, not just the stimuli. For example, you might want to generate a block of general knowledge questions, each with four possible answers. If you create a Stimuli.csv file with your questions in column one, and possible answers in columns two through five, then the following code will generate the desired survey blocks:

Generating knowledge

And if you wanted all your questions in a single block, just move the Block identifier into the first cat command (outside the for loop) as shown below. Here, I’ve added in ID notation for each question within the block, so it will be named according to its row in your spreadsheet.

generatingknowledgeoneblock

There are a few things this method sadly can’t do – it can’t add randomization, alter survey flow, or add advanced questions (like timers). But I still find it quite a time saver. The Qualtrics page has tags and formatting for a variety of different questions that you can substitute into the code presented here, and the help files in RStudio should point you in the right direction if you find yourself needing to change things in a more advanced way.

All of the code in this post can be found at my Github account here.