Wonderful and concise presentation, I really enjoyed the amount of effort that went into helping the viewer follow along and this helped me with my term project. Thanks!
+Marvin Aliaga Marvin, I always like to look at these numbers. My goto code for doing this is shown in this example: sysuse auto, clear tabstat mpg, statistics(n mean sd min max q) Now, if you mean how can these numbers be included in the visualization, that is a different answer.
One solution is to calculate a statistic, say a mean, for each group in your boxplot. Then, you can display them on the plot. An additional complication is that technically, the boxplot does not have an X-axis so you cannot use the added text option. Here is a solution that puts the means into a legend: sysuse auto, clear foreach num of numlist 1(1)5 { quietly summarize mpg if rep78==`num' local mean`num' : di %4.2f =`r(mean)' } #d ; graph box mpg, over(rep78) legend(on order(- "1=`mean1'" "2=`mean2'" "3=`mean3'" "4=`mean4'" "5=`mean5'") title("Means") pos(11) ring(0)); #d cr
George, Stata does not provide that capability since boxplots assume quantile location measures. In other words, the median or 2nd quartile. But, you can produce an acceptable result using a user-written application called stripplot. Use the command "findit stripplot" to find and install the program. Here is some Stata code to get you going on your problem: sysuse auto, clear egen mean = mean(mpg) , by(foreign) gen foreign2 = foreign + 0.3 stripplot mpg, over(foreign) stack height(0.2) /// box(barw(0.15)) boffset(0.3) vertical /// addplot(scatter mean foreign2, ms(D)) egen loq = pctile(mpg), p(25) by(foreign) egen upq = pctile(mpg), p(75) by(foreign) stripplot mpg, over(foreign) /// box(barw(0.15)) vertical ms(none) /// addplot(scatter mean foreign, ms(D) || /// scatter mpg foreign if !inrange(mpg, loq, upq))
Hi Roseani, It is difficult to know exactly what comparisons are important for you to highlight. One method would be to use the "by()" option to graph box. Maybe something like this: sysuse auto, clear graph box length displacement, by(rep78) In this example, length and displacement are two (of your five variables) and rep78 stands in for a variable representing your panels. This produces one graph for each value of rep78 (our for a panel in your case). You can reorganize the graph to put all the graphs in a column or row which might help. This might get you a little further along in solving your problem.
Hi Abdulahad, you can find the latest General Social Survey at gss.norc.org/get-the-data. When you download the data, you can save some space by compressing it using the Stata command compress. I don't have an archive of the Stata code, so pause the videos and copy from there. Best, Alan
Hi Roseani, there are a couple of ways. What I typically do is recode my "over" variable in the order I want. For example, 1=married, 2=widowed, 3=separated, 4=divorced, and 5=never married could be recoded to be 1=married, 2=never married, 3=divorced, 4=separated, and 5=widowed. The plots will be in order from low to high 1 to 5. Alternatively, you can define a new variable with a defined order and specify that has an option to "over". Finally, there is a sort option that, while limited, may be exactly what you want. These two options are suboptions for the "over" option and you can find help by typing the following into the command window: help graph box##over_subopts.
Wonderful and concise presentation, I really enjoyed the amount of effort that went into helping the viewer follow along and this helped me with my term project. Thanks!
I'm glad the video was helpful.
Hi Alan, can you explain how to include the values of your box blot? That is, it would be nice to the the actual value for the median, Q1-Q4...
+Marvin Aliaga Marvin, I always like to look at these numbers. My goto code for doing this is shown in this example:
sysuse auto, clear
tabstat mpg, statistics(n mean sd min max q)
Now, if you mean how can these numbers be included in the visualization, that is a different answer.
+Alan Neustadtl YEs- I meant how these numbers can be included in the visualization, at least the mean. Thank you!
One solution is to calculate a statistic, say a mean, for each group in your boxplot. Then, you can display them on the plot. An additional complication is that technically, the boxplot does not have an X-axis so you cannot use the added text option. Here is a solution that puts the means into a legend:
sysuse auto, clear
foreach num of numlist 1(1)5 {
quietly summarize mpg if rep78==`num'
local mean`num' : di %4.2f =`r(mean)'
}
#d ;
graph box mpg, over(rep78)
legend(on order(- "1=`mean1'"
"2=`mean2'"
"3=`mean3'"
"4=`mean4'"
"5=`mean5'")
title("Means") pos(11) ring(0));
#d cr
very helpful!
Great video there, thanks. How do I include a bar for the mean please?
George, Stata does not provide that capability since boxplots assume quantile location measures. In other words, the median or 2nd quartile. But, you can produce an acceptable result using a user-written application called stripplot. Use the command "findit stripplot" to find and install the program. Here is some Stata code to get you going on your problem:
sysuse auto, clear
egen mean = mean(mpg) , by(foreign)
gen foreign2 = foreign + 0.3
stripplot mpg, over(foreign) stack height(0.2) ///
box(barw(0.15)) boffset(0.3) vertical ///
addplot(scatter mean foreign2, ms(D))
egen loq = pctile(mpg), p(25) by(foreign)
egen upq = pctile(mpg), p(75) by(foreign)
stripplot mpg, over(foreign) ///
box(barw(0.15)) vertical ms(none) ///
addplot(scatter mean foreign, ms(D) || ///
scatter mpg foreign if !inrange(mpg, loq, upq))
Thanks very much.
What is the best way to make a 5-variable box plot on a 5-year panel?
Hi Roseani, It is difficult to know exactly what comparisons are important for you to highlight. One method would be to use the "by()" option to graph box. Maybe something like this:
sysuse auto, clear
graph box length displacement, by(rep78)
In this example, length and displacement are two (of your five variables) and rep78 stands in for a variable representing your panels. This produces one graph for each value of rep78 (our for a panel in your case). You can reorganize the graph to put all the graphs in a column or row which might help.
This might get you a little further along in solving your problem.
Hi Alan, could share the link for your codes and data set you used in your illustrations. thanks
Hi Abdulahad, you can find the latest General Social Survey at gss.norc.org/get-the-data. When you download the data, you can save some space by compressing it using the Stata command compress.
I don't have an archive of the Stata code, so pause the videos and copy from there.
Best,
Alan
how do I determine the order of the boxes in the box plot?
Hi Roseani, there are a couple of ways. What I typically do is recode my "over" variable in the order I want. For example, 1=married, 2=widowed, 3=separated, 4=divorced, and 5=never married could be recoded to be 1=married, 2=never married, 3=divorced, 4=separated, and 5=widowed. The plots will be in order from low to high 1 to 5. Alternatively, you can define a new variable with a defined order and specify that has an option to "over". Finally, there is a sort option that, while limited, may be exactly what you want. These two options are suboptions for the "over" option and you can find help by typing the following into the command window: help graph box##over_subopts.
/*
Using Stata to Create Boxplots - The middle part
*/
/* Basic boxplot examples */
graph box realrinc, name(box1, replace)
graph box realrinc, nooutsides name(box1, replace) // Box regions cover the middle 50%.
graph box realrinc, over(marital) name(box3, replace) // Graph box plots for each category in the marital status.
graph box realrinc, nooutsides over(marital) name(box3, replace)
/* Using multiple -over()- options */
graph hbox agewed, over(divorce) over(sex) name(box1, replace) nooutsides // hbox: horizontal box plot
graph hbox agewed, over(sex) over(divorce) name(box2, replace) nooutsides // Choose the graph that best fits your story.
graph hbox agewed, over(sex) over(divorce) name(box2, replace) nooutsides asy // as y bear. Take the first variable as an outcome variable.
/* Relabeling */
#delimit ;
graph hbox agewed, over(sex, relabel(1 "Men" 2 "Women")) // Put -relabel- within -over-.
over(divorce, relabel(1 "Divorced" 2 "Not divorced"))
nooutsides
name(box4, replace);
#delimit cr
/* Example of a polished graphic */
#delimit ;
graph hbox agewed, over(sex, relabel(1 "Men" 2 "Women"))
over(divorce, relabel(1 "Divorced" 2 "Not divorced") axis(noline))
nooutsides asy
title("Age When First Married by Gender and Ever Divorced", span)
ytitle("Age")
note("Source: General Social Survey, 2006")
ymtick(10(5)40)
legend(col(1) ring(0) position(1))
graphregion(color(white))
name(box4, replace);
#delimit cr