SLML Part 2 - Assembling the Dataset

Using visual embeddings to filter thousands of images.

Andrew Look


November 1, 2023

SLML Part 2 - Assembling the Dataset

This post is part 2 of “SLML” - Single-Line Machine Learning.

Check out the previous post, part 1, for some background on why I got started training SketchRNN on my own drawings.

Next in the series is part 3, where I convert my JPEGs into stroke-3 format in preparation for model training.

For years, I’ve maintained a daily practice of making single-line drawings in my sketchbooks.

sb48p072 sb67p077
sb67p004 sb38p073-gown2

When I finish a sketchbook, I scan its pages, number them, and put them into a similar folder structure. I marked which pages were diagrams/mindmaps/etc and put them in a subfolder called notes - the drawings (and watercolors) are in a folder called art. Unfortunately, I didn’t tag watercolors in advance.

Some detail on the subfolders:

  • art subfolders contain drawings / watercolors.
  • notes subfolders contain diagrams, mindmaps, and anything else technical I’m working on. I used to have a separate notebook for these, but I got tired of carrying around an extra sketchbook everywhere I go, so at some point they just ended up merging with my art practice.
  • xtra contains any bad scans or funny bloopers from my scanning process. I kind of like the scans with my hands in them and thought they might be useful for a future project, so I kept them in their own folder.
  • cover contains a scan of the notebook cover. Each cover at least has a post-it with the sketchbook ID number and the start/end dates of when I used that sketchbook. Often I’ll make a little collage on the front of the notebook with airplane tickets / stickers / scraps of paper relevant to somewhere I traveled when I used that sketchbook.

The Problem: Filtering

My first problem is that the “art” sections of my sketchbooks don’t just contain single-line drawings. They also contain watercolor paintings, since I don’t exclusively fill my sketchbooks with single-line drawings.

sb69p086-wc sb67p051-sparky sb69p006 sb35p013

Measuring Visual Similarity with Embeddings

Looking for a scalable way to filter the drawings out of my art scans, I decided to try computing visual embeddings.

I decided to fine-tune a vision model on my dataset to distinguish “art” pages from “notes” pages. My hope was that the model would learn to distinguish some features within the visual domain of my scanned sketchbook pages. Even though I don’t intend to use the predictions of an algorithm trained to classify art-vs-notes, this will produce more meaningful embeddings.

Ultimately, what I want to see is that when I take the embeddings of my sketchbook pages and cluster them, the notes and drawings should neatly separate into separate clusters. Also, I want the watercolors to cleanly separate into their own clusters, away from the single-line drawings.

Inspecting the Embeddings

After I fine-tuned my vision model, I computed the embeddings for all the sketchbook pages I had on hand. As a sanity check, I want to verify that the embedding space has cleanly separated regions for the labels is was trained on.

It’s hard to visualize 512 dimensions, so the T-SNE dimensionality reduction is a good visualization technique to view the data. T-SNE projects the higher number of dimensions into 2D or 3D, with the goal that any 2 points close together in high-dimensional space are close together in lower-dimensional space.

I colorized the points in the resulting graph based on their label, so it’s easy to see at a glance that embedding space has learned something useful about this dataset, as I hoped.

T-SNE representation of sketchbook pages. art pages are blue, notes pages are orange, and cover pages are brown.

Clustering to Identify Sub-Groups

Now I’m hoping to find the watercolors in the dataset. Given that the embedding space captures visual similarity, I want to identify groups of pages that are similar to each other. This lets me label data in bulk, rather than going through the images one-by-one. K-Means Clustering was a quick solution for this. I can decide to ask for K distinct “clusters”, and the algorithm picks K “centroids”.

Given a request for K clusters, pick K random points as the starting centroids. Then, repeat these steps for N iterations: For each point, select the nearest centroid point and compute the distance between these two. Adjusts the centroids with the goal of minimizing the total distance from all points to their nearest centroids.

Using Clusters to Classify

After some experimentation, I found that using K=16 produced groupings that would be useful to me. Each row in the image below consists of the top images from one cluster.

Since each cluster’s centroid is a vector in the same space as the image embeddings, I can “classify” any image embedding by doing a nearest-neighbor search of comparing the embedding of the query image to each of the cluster centroid vectors, and choosing the cluster centroid with the smallest distance. This is also called a K-Nearest Neighbor classifier.

When I applied this process to each of the images, I glanced at a t-SNE colorized by these clusters:

T-SNE representation of sketchbook pages, colored by which of the 16 clusters they were classified by K-Means using the pages’ visual embeddings.

Making Sense of the Clusters

I went through and grouped these clusters based on what I wanted to identify automatically. In particular, I wanted to take only the drawings and separate them from the watercolors. After inspecting the clusters, I also realized that there were a number of bad scans / incorrectly cropped images that I wanted to exclude as well.

I went through the dataset again, and based on the cluster assigned to each image, I assigned it a higher-level group based on the hand-mapping of clusters above.

When I looked at the t-SNE colorized by this higher-level cluster mapping, it was encouraging to see that my embedding space had neatly separated the groups I cared about:

T-SNE representation of pages, colored by cluster group. drawings are purple, notes are dark green, watercolors are yellow, bad scans are blue, covers are light green.

Since my images were tagged with these higher-level groups, I was able to select only the drawings, and begin converting them into usable training data.

Next in my SLML series is part 3, where I convert my JPEGs into stroke-3 format in preparation for model training.