HUMAN POSE ESTIMATION

Aashay Patel
10 min readMay 8, 2021

Introduction

Computer vision is a hot field in data science because of all the different fields it can be used in. From traffic surveillance systems to self-driving cars to medical diagnoses, the possibilities are endless. For our project, we decided to employ computer vision and neural networks to create a human pose joint estimation system. Our network takes in images and creates predictions for where the human joints are within the photo. The system is trained using the MPII dataset which is a large dataset of humans doing activities labeled by activity with annotated joints.

Background

A Convolutional Neural Network is a Deep Learning algorithm that is able to take visual input and using a set of weights and biases make predictions about the image. Each neuron’s behavior is distinguished by its weights.

https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

When pixel values are given, the neurons pick out different features. The first layer of CNN usually selects basic features such as horizontal, vertical, and diagonal edges.

The output of this layer is used as input to the next layer, which picks out more complex features, such as corner and edge combinations. As you develop more and more layers, the model begins to detect higher-level features such as patterns, gestures, etc. The word “convolution” comes from the operation of multiplying pixel values by weights and summing them. A CNN contains multiple convolutional layers. But it also has other components, for example, classification layer, which is the final layer of a CNN. It takes the output of the final convolution layer as its input . This classification layer produces the likelihood of each class. If you have a CNN that distinguishes hotdog, pizza and cake, the output of the final layer is the likelihood that the input image contains any of them.

Silicon Vally

AlexNet is an extension of the basic convolutional neural network which was built to combat the drawbacks of the traditional approach. Some of the drawback of a regular CNN is the images must be 256 x 256 so it’s difficult if the images are higher resolution or cropped differently, be in an RGB format so greyscale must be converted, only capable of single gpu, and more. AlexNet came in to alleviate many of these problems and improved performance drastically.

Illustration of AlexNet’s architecture. Credit to Krizhevsky et al., the authors of the AlexNet paper.

AlexNet uses 8 layers 5 convolutional and 3 fully connected layers and allows for splitting the neurons across multiple GPUs for faster training and larger networks. Usually, there are several same-size kernels in one convolutional layer. As we can see in the above graph, we have our first convolutional layer with 48 filters and then go to an overlapping max pooling layer. The second convolutional layer has 128 filters and it is also followed by an overlapping max pooling layer. The third, fourth and fifth convolutional layers are directly connected to each other, without any max pooling layers between them. There’s one more max pooling layer after the fifth convolutional layer, the output of which then would go through two fully connected layers. At last, we have a 1000-class-label softmax classifier.

For AlexNet, we use ReLU function to train our model faster. Comparing the two functions as the following figures shown, we could find that the tanh function has slow gradient descent when z is very large or small, while the ReLU function can help converge (since the slope is 1) when z is positive. Though ReLU function has a zero slope when z is negative, the neurons are mostly positive. As a result, ReLU takes a great advantage when being applied. Also, ReLU is better than the Sigmoid function in AlexNet, which has a similar shape as the Tanh function.

https://learnopencv.com/understanding-alexnet)
https://learnopencv.com/understanding-alexnet

In a traditional CNN, we usually just put the groups of neurons together with their neighbours. But when the researchers added overlapping pooling layers, there’s a 0.5% decrease in error. They also found that using overlapping pooling layers could help avoid overfitting, which is an important problem that could easily happen when we use AlexNet, since AlexNet has 60 million parameters. There are two basic solutions to it. The first one is data augmentation. One way of augmentation is mirroring pictures. In this way, the training size doubles. We can also do random crops to pictures and use them as new input. The second solution to overfitting is dropout. A neuron is dropped from the network with a probability of 0.5. When it is dropped, it does not contribute to propagation. Thus every input goes through a different network architecture. Through this way, the learned weight parameters are more robust and do not get overfitted easily. There is no dropout in testing, but the output is scaled by a factor of 0.5 to account for the missed neurons while training.

Dataset

The MPII Human Pose Dataset is composed of over 25,0000 images with 40,000 people all with annotated body joints. These images are categorized by human activities and they are labeled with an activity.

We used this data set since the humans have labeled joints which is what we were really trying to work with. The images were collected from a youtube video and provided with multiple frames with one being annotated. For each image in the dataset, it contains annotations for each person, where the labels of concern for our project are the scale of the height, position, 16 joint coordinates, and the visibility of those coordinates. The head rectangle, activity, and source were discarded for our use since we were only interested in the joint placement.

Preparing the dataset

The biggest difficulty in handling the dataset was extracting the labels. Everything was stored within a MATLAB struct and each struct was different depending on the number of people. For simplicity, only the first labeled person was used in each image. These labels were then stored as a list of dictionaries that would be loaded from a .json file. Then, having the appropriate labels, the next step was to put it within a dataset.

The images and labels were loaded onto a PyTorch Dataset module, where more image preprocessing was done. The first modification done was building a bounding box around the person of interest in the image. Thanks to the provided object center and scale, it wasn’t as complicated as it could be. Starting with the original image, the person’s center was located and using the scale which gave a person’s height times 200px, a square box was drawn around the person like below. Then, using those indexes, the image, represented as a NumPy array was then sliced to extract it which gave the image on the right. The image was then put through a Pytorch transform that scaled the image and converted it into a tensor. Special attention was paid to persons close to the edges of the picture. One could potentially add padding, but our group opted not to try that method.

The next step of preprocessing was normalizing the data. Neural networks tend to work better with normalized data with a mean of 0. To do that, for each coordinate, the center was subtracted and divided by the box width/height which were equal in our case. Then, the labels were ready, and the data was ready to put in a Dataloader for training.

Full image
Extracted image
Code to extract person

First model:

The first model used was the AlexNet discussed in the course lecture/assignments. Copying the PyTorch module from GitHub, a few modifications were made. First, the number of output classes was changed to 32, 2 for each joint. In the forward() method, for compatibility, the end result vector was reshaped to have its dimensions to be (batch_size, 16,2). Then the neural network was trained using an MSEloss criterion and stochastic gradient descent as an optimizer. Due to hardware and time constraints, only 15 epochs were used on the dataset, but it is worth noting that the training set had nearly 23k images. The training loss was stored through iterations of the epoch and as seen below, it improved after the first epoch, but from there stagnated. During the training, it occurred to us that a lot of the joints were not visible for some and those still accounted for some amount of loss in training, so the next step was to improve that.

Loss vs Epoch

Second model:

The second model would be the same as the first, except that it would use a custom loss function. The loss function would still be an MSE, but would only have the estimates of the visible joint count toward the error. This was implemented as a torch Module. In the loss function, it took the difference between the ground truth and prediction and squared it. The next step was rather crude. It stacked the two tensors of the visibility flag along the last dimension to ensure that all tensors were the same shape. Then they were multiplied so all non-visible joints would have their estimate zeroed and from there the tensor was reduced by summing all of the elements.

In training, it was estimated to have some improvement over the original model. The errors from the first 5 epochs were lost due to a connection issue with the cloud service, but with the saved errors from the next 15 epochs, the model error was stuck around ~6900. Reducing it to the original error by dividing by 32 and taking the square root showed that it was slightly better with that metric.

An example of the pose estimation is seen below.

Third model:

The only difference from the second model was to use a different optimizer. The optimizer used was Adam. For this model, the batch size was doubled from 64 to 128 to speed up training and had 40 epochs. The change in the optimizer proved to have a significant impact on the error, as it had a lower error than the second model on the second epoch. The plot of the error is below.

Evaluating the models:

To evaluate the models, there was no formal metric used. There do exist some described in the mpii website, but they were not able to be implemented successfully. Instead, our group opted to eyeball it. To do so, there was a method created to take joints, and draw it on the input image. The order for each image left to right is ground truth, model 1, model 2, and model 3.

Pose estimation vs true values
Pose estimation vs true values
Pose estimation vs true values
Pose estimation vs true values

As seen by the estimated poses, they are rather inaccurate. The first one appears to estimate the person is sitting and the second appears to give a pose of a person walking across the street in a completely different direction than the actual image. After a few more tests it ended up that the first two models always gave the same estimates. This was hoped to be amended in the third model, but the vast majority of the time, the third model also gave very similar estimates, but with some differences here and there as seen in the diagram.

Conclusion:

Overall, pose estimation is an incredibly complicated problem and AlexNet by itself was unable to provide an estimate, possibly because it didn’t have enough layers or just picked an “optimal” guess over all the images.

Potential future improvements:

There are a few things that can be done further to improve the model. First, is to pad the images that were near the edge and truncated like the volleyball example above. The second, is to limit the model to only images where all the joints are visible and all persons are separable by bounding boxes. The last one would be to use a different architecture, such as the ResNet or other more complicated architectures that our group wasn’t certain about how to modify to suit our needs. Also, more epochs could be used, but were not due to time constraints.

Reference

Nayak, S. (2021, May 04). Understanding AlexNet: Learn OpenCV. Retrieved from https://learnopencv.com/understanding-alexnet)

B. D., By, Dickson, B., -, Ben DicksonBen is a software engineer and the founder of TechTalks. He writes about technology, Ben is a software engineer and the founder of TechTalks. He writes about technology, & Var block_td_uid_6_6095fe3058173 = new tdBlock();block_td_uid_6_6095fe3058173.id = “td_uid_6_6095fe3058173”;block_td_uid_6_6095fe3058173.atts = ‘{“limit”:6. (2020, January 06). What are convolutional neural networks (CNN)? Retrieved from https://bdtechtalks.com/2020/01/06/convolutional-neural-networks-cnn-convnets/

--

--