Circling Around a Solution
Intro¶
Came across a fun little problem on the subreddit /r/theydidthemath that asked about the accuracy of a joke pie chart.
The top post by the time I showed up took took a quick crack at checking the pixel diameter and-- per the subreddit name-- "doing the math."
But having done something very similar to this in my last post, I figured this is a good an excuse as any to recycle some old code and write for the first time in a couple months.
Using a similar idea as the last post, we'll group all of the colors of the image into like colors. Then we'll simply divide the blue, my birthday
, by the area of the circle.
display.Image('orig.png')
Image to Data¶
We'll start by importing the vanilla Image library in Python.
from PIL import Image, ImageDraw
I went ahead and downloaded the image from the post and trimmed it down to just include the circle.
im = Image.open('circle.png')
im
And we'll stuff that into numpy
to get its per-pixel, numerical representation
arr = np.array(im)
arr.shape
Bit of Color Finagling¶
The legend in our original image was two-tone (red/blue). But we've got a bit of a hiccup when we zoom in on locations where two colors meet. Our eyes don't notice it looking at the regular-sized image, but whatever produced this graphic did so with a bit of color fuzziness.
For instance, there are a ton of different purple-y shades at the boundary where blue meets red.
Image.open('sliver_zoom.PNG')
And pinks where the red meets the white.
Image.open('edge_zoom.PNG')
Distilling¶
So as mentioned up top, we'll employ the same cheeky KMeans application as before to find clusters of "like colors."
By my count, we should expect to see a:
- Red group
- Blue group
- Pink group
- White group
- Purple group
So let's load up a blank KMeans
model that anticipates finding 5 color groupings
from sklearn.cluster import KMeans
model = KMeans(5)
And run it on our data
arr = arr.reshape(-1, 3)
model.fit(arr);
We can then inspect what picked colors are
np.set_printoptions(precision=3, suppress=True)
print(model.cluster_centers_[:, :3])
But this isn't terribly helpful, so we'll borrow some helper code we stashed in this notebook.
from helper import draw_rectangle
draw_rectangle(model.cluster_centers_)
Much better.
Then we can identify each of our points by which "Mean Color" they're closest to-- the index on the left corresponds to the order of the colors above.
import pandas as pd
res = pd.Series(model.predict(arr)).value_counts()
res.sort_index()
So if we wanted to describe "blue divided by everything not white", we'd have
image = res[2] / (res.sum() - res[1])
Which works out to be about a quater of a percent of the area of the circle
image * 100
Going back to the original question, the author wanted to know how this stacked up against the actual ratio of birthdays to not birthdays in a year.
birthday = 1 / 365
birthday * 100
Not bad, yeah? The result that you get when running KMeans
is pretty random and dependent on how your machine happened to kick off the algorithm.
Had a few runs that were nearly identical. A good number that weren't. All told, though, I'd say that this image is pretty accurate.
(birthday - image) / birthday
Though someone more patient than me might consider averaging the "pct blue in the circle" over many, many images to say for certain. But I think I've rabbit-holed on this plenty long enough already :)
Cheers, -Nick
I hope reading my solution was at least half as amusing as it was coming up with.
As always, link to my code can be found here. Feel free to badger me on the Internet if anything looks awry!