Alright, let’s talk about this coco heads thing I was messing with recently. I needed to get head data, specifically bounding boxes for heads, from the COCO dataset for a little side project.

Getting Started
First off, I had to grab the COCO dataset itself. You know the drill – download the images, download the annotations. The 2017 set, I think it was. The images took a while, they’re pretty big. And that annotation file, the JSON one, it’s huge and packed with info.
Digging into the Annotations
So, I popped open that JSON file. Man, it’s a maze. It’s got categories, image info, bounding boxes, segmentation masks, keypoints… everything but the kitchen sink. My first job was just figuring out how this thing is structured. I needed ‘person’ annotations, obviously. Found the category ID for ‘person’, easy enough.
But then, the heads. I looked through the object detection annotations. You get a bounding box for the whole person, but not usually just the head. That was the first snag. COCO doesn’t just hand you ‘head’ boxes directly in the standard detection annotations.
Trying the Keypoint Angle
I remembered COCO has keypoints for people – nose, eyes, ears, shoulders, etc. My thought was, maybe I can use the head keypoints to figure out where the head is?
- I wrote some code to load the big JSON file.
- Filtered it down to only the annotations for the ‘person’ category.
- For each person, I looked at their keypoints data. This is usually an array of x, y, visibility flags.
- I picked out the keypoints for the head: nose, left eye, right eye, left ear, right ear.
This seemed promising. If I have the locations of eyes, ears, and nose, I should be able to draw a box around them, right?

Making Boxes from Points
My first attempt was basic. For a person, find all the visible head keypoints (checking that visibility flag). Then find the minimum x, minimum y, maximum x, and maximum y among those points. That defines a box.
Well, it sort of worked. For faces looking straight on, it was okay-ish, maybe a bit tight. But if the head was turned, or tilted, or if some keypoints weren’t detected (visibility flag was 0), the boxes looked really weird. Sometimes tiny, sometimes cutting off part of the head. Not really robust.
Trying to Improve the Boxes
Okay, plan B. Maybe add some padding? I tried taking the min/max box and just expanding it by a fixed number of pixels, or by a percentage. That helped make them less tight, but didn’t fix the weird shapes from missing keypoints or odd poses.
I also thought about using just the nose position and maybe guessing a standard head size relative to the person’s overall bounding box? That started feeling way too complicated and full of guesswork. Too many edge cases.
The Reality Check
I spent a good chunk of time fiddling with this keypoint approach. Reading online, seemed like other people hit the same wall. The standard COCO detection stuff just isn’t designed to give you clean head bounding boxes easily. It gives you person boxes and person keypoints.

Turns out, there are other datasets, sometimes derived from COCO or similar ones, that are specifically focused on heads, like CrowdHuman or others mentioned in some research papers. People have done the hard work of creating dedicated head annotations. If I needed really good head boxes, I probably should have looked for one of those specialized datasets first.
What I Ended Up With
For my little project, super high accuracy wasn’t the absolute top priority. So, I stuck with my script based on keypoints, but made peace with it being imperfect.
Here’s roughly what my script did in the end:
- Load COCO annotations.
- Find ‘person’ annotations.
- For each person, grab the head keypoints (nose, eyes, ears).
- Important: Filter out people where most head keypoints were not visible (v=0). If only an ear was missing, maybe okay, but if eyes and nose were gone, I skipped that person.
- Calculate a bounding box using the min/max coordinates of the visible head keypoints.
- Add some percentage-based padding (like 15-20% maybe) to this box.
- Saved these generated ‘head’ coordinates, linked to the image ID, into my own simpler file format.
So, yeah. Got some head boxes out of COCO, but it wasn’t straightforward. It involved digging into the keypoints and accepting a less-than-perfect result based on estimation. If you need proper head detection data, using the standard COCO object annotations requires some extra work or looking for those specialized head datasets. Just sharing what I went through!