Data Visualization
In one of my previous posts, I have highlighted the power of data visualization, it is so essential to understand data. A single post is not enough to state the necessity of data visualization, from data exploration to analysis till a model trained with the data is operational and post production as well.
Quoted from that post -
“.. a scatter plot helps us understand which measure to utilize for the data. The plot helps identify linear, nonlinear relationships between variables and spot outliers (if any) which may influence the correlation.
While a box plot visually represents inter-quartile range of data, a violin plot shows the shape (density distribution) of data. A violin plot must be used to explore skewed data.”
A histogram shows frequency with bars (binned data), It is useful for detailed view of a single distribution. A hostogram with a fitted kernel density estimate (KDE) is shown above.
A boxplot summarizes data with quartiles (median, IQR) in a box and outliers outside it. It is useful for simple, clear comparisons of central tendencies across multiple group samples.
A violin plot uses a KDE (smoothed histogram) to show probability density revealing underlying data density and shape, making it excellent for comparing distributions across multiple groups and showing nuances missed by histogram or box plot.
Anscombe’s quartet outlines the importance of graphing data before analysing. A visual explanation is mightier than explaining data through statistics, implying the importance of looking at a dataset graphically and not relying solely on its summary statistics.
The quartet data - four identical descriptive statistics, yet four different distributions when plotted and visualized. It only reinstates how powerful exploratory data analysis by visualization is.
When it comes to visualizing data, less is more. Keeping only what is necessary in the visual not only helps the developer understand what information the data yields, but also helps the business (audience) user draw meaning from the visual.
There are caveats to data visualization as well. A collection of such caveats is nicely captured by Yan Holtz. Few of my favorites are “too many distributions in one chart”, “annotate your chart”, and the “Simpson’s paradox”.
Simpson’s paradox is a statistical phenomenon in which a trend appears in several different groups of data when graphed but disappears or reverses when these groups are combined.
It is fascinating how there are different ways to look at the same dataset. Mathematician Jordan Ellenberg said that the lesson of Simpson’s paradox “is not really to tell us which viewpoint to take, but to insist that we keep both the parts and the whole in mind at once”.
Look at the scatterplot of whole data, then look at the same for 3 groups of the dataset. We understand that the positive correlation’s due to a difference between groups. Actually, the correlation is even negative when each group is considered separately.
The fact that the trend between two different variables reverses when a third variable is included is Simpson’s paradox.
We have Anscombe’s quintet in 2025. Thanks to Carl McBride Ellis for his work on these datasets. The new and fifth dataset serves as an illustration of the Simpson’s paradox.
The new (#5) dataset is in complete concordance with the summary statistics of the original Anscombe’s quartet.
Data visualization is a core, analytical skill required for problem-solving. As long as the visuals are clean and effective, it does not matter if data is graphed with python or R, Tableau or Power BI.