TL;DR: A proposed further adaptation, the violin plot, pools the best statistical features of alternative graphical representations of batches of data and adds the information available from local density estimates to the basic summary statistics inherent in box plots.
Abstract: Many modifications build on Tukey's original box plot. A proposed further adaptation, the violin plot, pools the best statistical features of alternative graphical representations of batches of data. It adds the information available from local density estimates to the basic summary statistics inherent in box plots. This marriage of summary statistics and density shape into a single plot provides a useful tool for data analysis and exploration.
TL;DR: An open-source application, called BoxPlotR, and an associated web portal that allow rapid generation of customized box plots, which represent both the summary statistics and the distribution of the primary data in biomedical research.
Abstract: To the Editor
In biomedical research, it is often necessary to compare multiple data sets with different distributions. The bar plot, or histogram, is typically used to compare data sets on the basis of simple statistical measures, usually the mean with s.d. or s.e.m. However, summary statistics alone may fail to convey underlying differences in the structure of the primary data (Fig. 1a), which may in turn lead to erroneous conclusions. The box plot, also known as the box-and-whisker plot, represents both the summary statistics and the distribution of the primary data. The box plot thus enables visualization of the minimum, lower quartile, median, upper quartile and maximum of any data set (Fig. 1b). The first documented description of a box plot–like graph by Spear1 defined a range bar to show the median and interquartile range (IQR, or middle 50%) of a data set, with whiskers extended to minimum and maximum values. The most common implementation of the box plot, as defined by Tukey2, has a box that represents the IQR, with whiskers that extend 1.5 times the IQR from the box edges; it also allows for identification of outliers in the data set. Whiskers can also be defined to span the 95% central range of the data3. Other variations, including bean plots4 and violin plots, reveal additional details of the data distribution. These latter variants are less statistically informative but allow better visualization of the data distribution, such as bimodality (Fig. 1b), that may be hidden in a standard box plot.
Figure 1
Data visualization with box plots
Despite the obvious advantages of the box plot for simultaneous representation of data set and statistical parameters, this method is not in common use, in part because few available software tools allow the facile generation of box plots. For example, the standard spreadsheet tool Excel is unable to generate box plots. Here we describe an open-source application, called BoxPlotR, and an associated web portal that allow rapid generation of customized box plots. A user-defined data matrix is uploaded as a file or pasted directly into the application to generate a basic box plot with options for additional features. Sample size may be represented by the width of each box in proportion to the square root of the number of observations5. Whiskers may be defined according to the criteria of Spear1, Tukey2 or Altman3. The underlying data distribution may be visualized as a violin or bean plot or, alternatively, the actual data may be displayed as overlapping or nonoverlapping points. The 95% confidence interval that two medians are different may be illustrated as notches defined as ±(1.58 × IQR/√n) (ref. 5). There is also an op on to plot the sample means and their confidence intervals. More complex statistical comparisons may be required to ascertain significance according to the specific experimental design6. The output plots may be labeled; customized by color, dimensions and orientation; and exported as publication-quality .eps, .pdf or .svg files. To help ensure that generated plots are accurately described in publications, the application generates a description of the plot for incorporation into a figure legend.
The interactive web application is written in R (ref. 7) with the R packages shiny, beanplot4, vioplot, beeswarm and RColorBrewer, and it is hosted on a shiny server to allow for interactive data analysis. User data are held only temporarily and discarded as soon as the session terminates. BoxPlotR is available at http://boxplot.tyerslab.com/ and may be downloaded to run locally or as a virtual machine for VMware and VirtualBox.
TL;DR: A new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed, and it is shown that when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs.
Abstract: One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.
TL;DR: The rangefinder box plot as discussed by the authors is an extension of the familiar box plot, and it was introduced for rangefinder boxes, an extension to the box plot for rangefinders.
Abstract: This note introduces the rangefinder box plot, an extension of the familiar box plot.
TL;DR: The main advantage of the BLiP plot is that it provides users with basic graphical elements in a friendly and flexible environment so that users can, according to their needs, construct anything from a simple, standard plot to a complex, customized plot to best present their data.
Abstract: A versatile graphical tool, the BLiP plot, was developed for displaying one-dimensional data. The basic building blocks are boxes, lines, and points. Like many standard one-dimensional distribution plots, the BLiP plot is capable of displaying individual data values in points or lines and grouped information in lines or boxes. In addition, the BLiP plot includes many new features such as variable-width plots and several choices of point patterns. The main advantage of the BLiP plot is that it provides users with basic graphical elements in a friendly and flexible environment so that users can, according to their needs, construct anything from a simple, standard plot to a complex, customized plot to best present their data.