Let’s look at some heatmaps with different numbers of bins for the same two-Gaussians distribution:Īs you can see, a too-coarse binning grid A cannot represent this distribution faithfully, but with enough bins C, the heatmap will approximate a tiny-dot scatterplot like plot D in the previous figure.
Heatmaps effectively approximate a probability density function over the specified space, with coarser heatmaps averaging out noise or irrelevant variations to reveal an underlying distribution, and finer heatmaps able to represent more details in the distribution. A heatmap has a fixed-size grid regardless of the dataset size, so that they can make use of all the data. To avoid undersampling large datasets, researchers often use 2D histograms visualized as heatmaps, rather than scatterplots showing individual points. The actual shape of the distribution is only visible if sufficient datapoints are available in that region and appropriate plot settings are used, as in D, but ensuring that both conditions are true is a quite difficult process of trial and error, making it very likely that important features of the dataset will be missed. Such problems can occur even when taking very large numbers of samples, if examining sparsely populated regions of the space, which will approximate panel A for some plot settings and panel C for others. But as panel A shows, the shape of an undersampled distribution can be very difficult or impossible to make out, leading to incorrect conclusions about the distribution. At this point, people often simply subsample their dataset, plotting 10,000 or perhaps 100,000 randomly selected datapoints. In any case, as dataset size increases, at some point plotting a full scatterplot like any of these will become impractical with current plotting software. Similar problems occur for the same size of dataset, but with greater or lesser levels of overlap between points, which of course varies with every new dataset. Clearly, not all of these settings are accurately conveying the underlying distribution, as they all appear quite different from one another. Using the “Tiny dots” setting (10 times smaller dots, alpha 0.1) works well for the larger dataset D, but not at all for the 600-point dataset C. The “Small dots” setting (size 0.1, full alpha) works fairly well for a sample of 600 points A, but it has serious overplotting issues for larger datasets, obscuring the shape and density of the distribution B. Just as shown for the multiple-category case above, finding settings to avoid overplotting and oversaturation is difficult. Points ( 'Small_dots', s = 1, alpha = 1 ), opts. Points ( gaussians ( num = 60000 ), label = "60000 points", group = "Tiny dots" )) points.
Points ( gaussians ( num = 600 ), label = "600 points", group = "Tiny dots" ) + hv. Points ( gaussians ( num = 60000 ), label = "60000 points", group = "Small dots" ) + hv. Points ( gaussians ( num = 600 ), label = "600 points", group = "Small dots" ) + hv. hstack ( for d in dists ]) points = ( hv. Defaults to two horizontally offset unit-mean Gaussians. Each distribution is specified as a tuple (x,y,s), where x,y is the mean and s is the standard deviation. If there are more points overlapping in that particular region, a manually adjusted alpha setting that worked well for a previous dataset will systematically misrepresent the new dataset:ĭef gaussians ( specs =, num = 100 ): """ A concatenated list of points taken from 2D Gaussian distributions. Worse, even if one has set the alpha value to approximately or usually avoid oversaturation, as in the plot above, the correct value depends on the dataset. Locations where saturation has been reached have problems similar to overplotting, because only the last 10 points plotted will affect the final color (for alpha of 0.1).
In this example the oversaturated points are located near the middle of the plot, but the only way to know whether they are there would be to plot both versions and compare, or to examine the pixel values to see if any have reached full saturation (a necessary but not sufficient condition for oversaturation). If you compare the two plots closely, you can still see a few locations with oversaturation, a problem that will occur when more than 10 points overlap. Here C and D are improved in that they look very similar, but if they were truly accurate plots they would be identical, since the distributions are identical.