Oftentimes in research we’re advised to plot out the variables we’re interested in to check for extreme right skew in our data. Right skew is indicated when most of the data is clustered on the left side of the distribution, with a few extreme variables out on the right. Common advice when encountering skewed data like this is to apply a log transform to render it normal. A standard log transform replaces your original variable with a log to base 10 of that variable. This means that the values 10, 100, and 1000 in your original data will become 1, 2, and 3 in the transformed data. Thus, where you may have had data that clustered around 10, with outliers several orders of magnitude higher than average at 1000, the transformed data will mostly be clustered around 1, with a few 2’s and 3’s. The resulting distribution will be much more normal.
This procedure is advised because many statistical procedures assume that we’re working with normal variables. But that’s a fairly abstract reason to transform a variable, and it can often feel unclear what it is you’re actually doing. A visual demonstration can make the reason for log transforms much more obvious. One example of data in need of a log transform is the physical size of United States census tracts. Census tracts are areas defined by the Census Bureau, and are designed to contain between 1200 and 8000 individuals in order to provide the basic unit of aggregation for the Census. However, while the population of census tracts is constrained within a certain range, the physical areas they cover vary widely. High population density areas such as major cities contain many census tracts about the size of a city block, but some low population density areas like the desert stretches of Arizona contain census tracts thousands of square miles across. We can see that the land area of census tracts has a right-skewed distribution:
And we can see that after a log transform, the data looks pretty close to normal. But it’s not immediately clear how that benefits us, besides the general knowledge that normal variables are good to have.
To get a clearer demonstration of the benefits of log transforms, we need to look at the relationship between two right skewed variables. Fortunately, the census tract data set we’re looking at also has a record of the area of water in each census tract, a variable that is skewed in a similar fashion to the land area. If we plot the relationship between the two of them without transformations, we get the following plot, which is fairly unenlightening. It’s unclear whether there’s a pattern to the relationship between the two variables.
However, when we look at the relationship between the two variables after a log transform, the picture is very different.
For one thing, we can now clearly see the clusters of census tracts falling along one or the other axis – these are tracts that are pure land, or pure water. But, more importantly, we can now see a very clear relationship between the land area and water area of each census tracts – unsurprisingly, census tracts with larger land areas tend to have larger bodies of water within them, and the relationship is very clean. So why was this hidden before we transformed the data? It’s a matter of caring about relatives rather than absolutes.
Let’s say that, on average, each census tract has a water area 10% the size of its land area – usually made up of lakes. This is the linear relationship we’re looking for. But of course, there’s error to this relationship – some tracts will have lakes that are relatively smaller, resulting in compositions of 8% water, or 12%, etc. The problem is, when a census tract the size of Texas deviates from expectations by 2%, that’s a difference of thousands of square miles of water – bigger than the total size of the majority of tracts. What ends up happening is that, as we move right on the graph, our data points deviate more and more in absolute terms from the ‘true’ relationship – despite having the same relative margin of error. This often results in a graph like the first one – where the true relationship between variables is obscured.
It’s important to remember that this makes the most sense when there is a good case to be made that the size of an observation is important in determining what a meaningful change in that observation might be. For instance, for someone with 100 Twitter followers, a change of 100 extra followers is highly meaningful – the same might not be the case for someone with 10 000 followers. If an experimental stimulus is usually responded to in 100 ms, a 10 ms difference is likely meaningful – the same couldn’t be said for an anagram that takes 2 minutes to complete, on average.
So there you go – A (hopefully) practical demonstration of why it’s useful to log-transform your skewed variables, and the effects it actually has.