Design a computer vision solution that can assess the crowd density and behavior in a venue and inform the operators of the typical points of crowd convergence and the current crowd situation.
1. To estimate the number of people in a crowd.
2. To use the count to understand the crowd density of specific areas.
The first thing we have to do to approach the challenge is to break it down into simpler logical steps.
From here we quickly understand that the first step is to estimate the number of people and how it is distributed in a given camera frame. For this we use a Deep Learning Crowd Counting model that receives an image as input and outputs a matrix with the estimated number of persons per pixel of that image.
The second step is to determine the density of specific areas. Having the count of persons per pixel, we apply a grid to the output matrix and determine the density of each square by adding the value of all of its points. The new matrix shows us how many people are currently accumulated in each area. Knowing this, we can create rules to determine the status of a zone in terms of its capacity and alert the event organizers in real-time. We can also create an historical dataset of area occupancy to be studied and used in the preparation of other events.
Our proposed solution utilizes a Density Estimation Model. The model receives the frames from the cameras and produces the present count of people per pixel on that frame in the form of a 2D matrix.
Type of model
The crowd counting model used is a custom Convolutional Neural Network developed in PyTorch that takes an image as input and outputs a matrix with the estimated number of people per pixel.
How the model works
The model is composed of a consecutive sequence of layers connected by activation functions.
Each layer is itself a smaller matrix, called filter, with values called weights. The model receives the image divided in three equal-sized matrices, corresponding to the RGB values of the image colors, runs each filter over each matrix and passes the values along to the activation function. The activation function is then responsible for calculating whether the values obtained from the previous layer are “acceptable” to the model and passes the values along to the next layer, or not.
The logic of the model is similar to a regression function where the values of the image are run multiplied by a series of coefficients to output an estimated value.
In the end, after going through all the activated layers, the model outputs a 2D matrix with its estimates for each pixel of the input image.
Object of focus
Observing a crowd scenario from the typical viewpoint of a surveillance camera (diagonally above the scene) the head is the only part of the human body that is most guaranteed to always be visible. Legs, arms and torso are predictably occluded from view due to how close people stand in the crowd. Given this, the object of focus for the model is the human head.
The data used to train the model is a set of 1300 images of crowds with varying crowd agglomerations. Each image is first run through an annotation software on which the user manually marks every head. The software then outputs a file for each image containing a list of dots which are the center coordinates of the heads present in the image.
The expected output of the model is a 2D matrix of dots with varying values which are its estimations for the number of people in those pixels, so, to train the model we first have to convert the list of dots obtained from the annotation software into a matrix on which a head dot is valued 1 and a non-head is valued 0.
The final step in the data preparation process is to define a radius where a head exists. Currently, we only have the center point and if we used the matrix as it is, the model would only consider one pixel to identify a head, which is too generic. To overcome this, we apply a function called Gaussian Filter which is used to blur images. This process will take as input a value for the radius and the coordinates of a point in the matrix and will “soften” our dot and blur the area in a circular pattern with the maximum value position on the original dot and decreasing values around it. In the end, we have a new matrix in which the sum of all the dots is the same as before, but the distribution of the values is supposed to be similar to the distribution of the heads across the image pixels. Naturally, in areas of dense agglomeration of people, the values will be higher and in areas where there is nothing, the value will be 0.
After these steps we consider the whole dataset of images and reserve a part for validation and another for after-training testing.
Now that we have the ground-truth matrix file prepared for each image we can start the training.
The training process has a “simple” trial/error logic:
- The model reads the image and produces a matrix output with its estimates per pixel;
- The model reads the ground-truth file with the correct matrix and compares it with its output matrix to determine the value of its errors;
- The model uses the errors to calculate adjustments to its weights and propagates these adjustments through the layers back to the beginning.
When the model does this process for all the images in the training set, it runs the first two steps for the validation set. We call this an Epoch.
After each Epoch, we verify the model’s loss and accuracy on the training and validation sets to determine the evolution of its training. We finish the training when the model has reached a desired accuracy score.
Considering that the accuracy scores of the model on training and validation sets are convergent, we assume that the model is neither underfitting nor overfitting and run it through the unseen third set of data, the test set, to confirm the accuracy.
Crowd Counting and Grid Map
Below we have a global view of the stages that the input image goes through.
Here we can see how the dot annotation is converted into a ground-truth heatmap that translates the density of people per pixel of image. After this transformation, the number of dots remains equal to the sum of the value of all the points on the heatmap.
The raw output provided by the model has the same nature of the ground-truth heatmap. As we are going to apply a grid with its own color code to identify the areas of higher concentration, we use the model output only to overlay a light red shade with the purpose of highlighting the areas where there are people.
To allow for integration in other systems, our solution also outputs the inference values as JSON, CSV files (here with added headers for readability) or any other format
For example, having the coordinates on this file mapped to specific facilities or locations, one can instantly update their occupancy and automatically take action and reorganize resources.
This file shows the number of people on each cell of the grid, at different times of capture.
Summary of how the solution works
1. The crowd counting step is a classic density estimation using people’s heads as the target to the model
2. To assess the density of specific regions we apply a grid to the image and aggregate the count of people in each cell, obtaining a matrix of n*m, on which each point corresponds to the sum of people in a certain cell. This matrix serves as a base to output the current density of each area in a tabular format and to apply a traffic light system to color the edges of the cells that contain a certain number of people, according to predefined thresholds.
3. The solution is ready to be customized, trained and deployed in specific scenarios with varying lighting conditions and camera perspectives, even allowing for the use of focus areas of detection within each frame.
Density estimation vs object detection
When deciding how the count of people on the images was going to be performed, we considered two types of Deep Neural Network models: Density Estimation and Object Detection.
Even though the option to use an Object Estimation model had in its favor the possibility of getting the coordinates of every detected person automatically during the inference process, we felt that using a Density Estimation model was more appropriated to the model for two major factors:
- Annotation – the density estimation model relies on single dot object annotation which makes it faster and more accurate when creating the ground-truth files from images with thousands of objects. The object detection model would have us draw boxes around each object to be able to correctly train the model.
2. Occlusion – crowd scenarios are typically distant from the camera and prone to very high occlusion , especially as the crowd gets denser. The Density estimation model allows us to train the model for both sparse and very dense scenarios even using very occluded heads (our target object).
Video by Timo Volz from Pexels – https://www.pexels.com/video/people-crossing-on-a-busy-pedestrian-5544073/
A Survey of Recent Advances in CNN-based Single Image Crowd Counting and Density Estimation arXiv:1707.01202v1 [cs.CV]