Human Pose Estimation from Images

Posted Jul 19, 2025 Updated Jul 19, 2025

By Karan Anand

2 min read

By Karan Anand, PhD

Motivation

Human pose estimation is a fundamental task in computer vision, aiming to predict key joint positions (like head, shoulders, wrists, etc.) from an image. This project explores a bottom-up keypoint detection approach using heatmaps and builds the training pipeline from scratch using COCO keypoints data.

Project Objectives

Predict 17 body keypoints per person using the COCO format
Generate ground truth heatmaps as Gaussian blobs
Train a CNN-based pose estimation model using dual loss:
- Heatmap BCE loss
- Coordinate regression loss (via soft-argmax)
Evaluate using the PCK\@0.2 metric

Dataset Preparation

Subset extracted from COCO Keypoints 2017
Images filtered to include at least one visible person with valid keypoints
Used subsets of 200 and 2000 images for experiments
Input image size: 256x192
Heatmap resolutions: 96x72 and 192x256 (depending on decoder)

Architecture

1. Encoder

Pretrained ResNet18 used to extract deep visual features

2. Decoders

SimplePoseNet: 3-layer conv-transpose decoder
UNetPoseNet: U-Net style decoder with skip connections

3. Heatmap Decoder Output

Predicts 17-channel heatmaps
Soft-argmax used to decode continuous keypoint coordinates

Loss Functions

Heatmap Loss: BCEWithLogitsLoss
Coordinate Loss: L1Loss (between soft-argmax output and ground truth)
Dual Loss: Total = heatmap_loss + 0.1 * coord_loss
Joint-wise weighting added to emphasize harder joints (wrists, ankles)

Evaluation Metrics

PCK\@0.2: Percentage of Correct Keypoints within 20% of torso size
Visualizations: Ground truth vs predicted keypoints overlayed on images

Results

Best performance: PCK\@0.2 ≈ 0.51 (on 2000 images, UNet decoder)
Training for 30 epochs yielded visible improvement in heatmap localization
Soft-argmax significantly improved convergence and keypoint sharpness

Key Observations

Ground truth heatmaps must match model output resolution to avoid misalignment
SimplePoseNet was faster but underfit difficult joints
UNet decoder provided sharper localization but trained slower
Coordinate loss helps guide ambiguous blobs to correct position

Known Issues

Multi-person ambiguity: Only the first visible person is supervised; keypoints may sometimes be projected onto other individuals in crowded scenes
Pose structure mismatch: Even when keypoints are in the right area, the relative arrangement often does not match the skeleton
Heatmap upsampling artifacts: Early experiments with mismatched GT and output resolutions caused training instability

What I Learned

Combining heatmap supervision with coordinate decoding improves spatial precision
U-Net style decoders can enhance weak joints but require careful tuning
Soft-argmax is a differentiable, effective way to extract coordinates
Visual debugging and per-joint metric tracking are essential

Next Steps

Add structural priors or pose refinement modules
Experiment with hourglass or transformer decoders
Extend to multi-person estimation using associative embedding
Deploy trained model on webcam or video stream for real-time inference

📁 Repository

GitHub Link

pytorch

ml computer-vision convolutional-neural-networks

This post is licensed under CC BY 4.0 by the author.

Human Pose Estimation from Images

Motivation

Project Objectives

Dataset Preparation

Architecture

1. Encoder

2. Decoders

3. Heatmap Decoder Output

Loss Functions

Evaluation Metrics

Results

Key Observations

Known Issues

What I Learned

Next Steps

📁 Repository

Trending Tags