Update
See below for more information now that Ethan Brown has weighed in with some great code.
A recent post I came across on r-bloggers asked for input on visualizing ranked Likert-scale data.
I happen to be working on a substantial project using very similarly structured data so I thought I would share some code. In my efforts to be generic as possible, I decided to generate some fake data from scratch. As I peeled away the layers of context-specific aspects of my nearing-production level code, I ran into all kinds of trouble. So I apologize for the somewhat sloppy and unfinished code1.
My preferred method for visualizing Likert-scale data from surveys is using net stacked distribution graphs. There are two major benefits of these kinds of graphs. First, they immediately draw attention to how strongly respondents feel about a question, particularly when multiple questions are visualized at once. The total width of any bar is equal to the total number of responded who had a non-neutral answer. Second, these graphs make it very easy to distinguish between positive and negative responses. In some cases, it is critical to view the distribution of data to visualize the differences in responses to one question or another. However, most of the time it is informative enough to simply know how positive or negative responses are. I find this is particularly true with 3, 4, and 5-point Likert scales, the most common I come across in education research.
Anyway, without further ado, some starter code for producing net stacked distribution graphs.
|
|
And the results of all that?
UPDATE:
So now that Ethan has weighed in with his code I thought I would add some things to make this post better reflect my production code. Below, I have included my comment on his blog as well as an actual copy of my current production code (which definitely is not sufficiently refactored for easy use across multiple projects). Again, excuse what I consider to be incomplete work on my part. I do intend on refactoring this code and eventually including it in my broader set of custom functions available across all of my projects. I suspect along that path that I will be “stealing” some of Ethan’s ideas.
Comment
Hi Ethan! Super excited to see this post. This is exactly why I put up my code– so others could run with it. There are a few things that you do here that I actually already had implemented into my code and removed in an attempt to be more neutral to scale that I really like.
For starters, in my actual production code I also separate out the positive and negative responses. In my code, I have a parameter called scaleName
that allows me to switch between all of the scales that are available in my survey data. This includes Strongly Disagree to Strongly Agree (scaleName=='sdsa'
), Never -> Always (scaleName=='sdsa'
) and even simple yes/no (scaleName=='ny'
). This is not ideal because it does require 1) knowing all possible scales and including some work in the function to treat them differently 2) including an additional parameter. However, because I use this work to analyze just a few surveys, the upfront work of including this as a parameter has made this very flexible in dealing with multiple scales. As a result, I do not need to require that the columns are ordered in any particular way, just that the titles match existing scales. So I have a long set of if elseif statements that look something like this:
|
|
This is actually really helpful for producing negative values and including some scales in my function which do not have values that are negative (so that it can be used for general stacked charts instead of just net-stacked):
|
|
(Recall that quest is what I call the dataframe and is equivalent to x in your code)
Another neat trick that I have instituted is having dynamic x-axis limits rather than always going from -100 to 100. I generally like to keep my scales representing the full logical range of data (0 - 100 for percentages, etc) so I might consider this a manipulation. However, after getting many charts with stubby centers, I found I was not really seeing sufficient variation by sticking to my -100 to 100 setup. So I added this:
|
|
Which helps to determine the value furthest from 0 in either direction across the data frame (I have to admit, this code looks a bit like magic reading it back. My comments actually are quite helpful:
|
|
To make this more generalizable (my production code always compares two bars at once) , it would be fairly trivial to loop over all the rows (or use the apply functions which I’m still trying to get a hang of).
I then pad the x_limits
value by some percent inside the limits
attribute.
In my production code I also have the scale_fill_manual
attribute
added separately to the ggplot object. However, rather than add this
after the fact like at the point of rendering, I include this in my
function again set by scaleName
. However, I think the best organization
is probably to have a separate function that makes it easy to select the
color scheme you want and apply it so that your final call could be
something like colorNetStacked(net_stacked(x), 'blues')
.
My actual final return looks like this:
|
|
Where colors is set by a line like: colors <- brewer.pal(name='Blues',n=7)[3:7]
Seriously though, I am super excited you found my post and thought it was useful and improved what I presented!
Current production code:
|
|
-
Mainly, I would like to abstract this code further. I am only about halfway there to assuring that I can use Likert-scale data of any size. I also would like to take in more than one question simultaneously with the visualize function. The latter is already possible in my production code and is particularly high impact for these kinds of graphics ↩︎