Object Tracking for Image Annotations

This week we had Matt present his Part II project to the group.

A big part of machine learning research at the moment has been creating and training algorithms to recognise objects. To feed these algorithms training data, traditionally you have had to go out and take thousands of pictures of your desired objects, say a human, and then ‘annotate’ the image, letting the trainer know where in the picture the human actually is. As you can imagine, annotating all of these images is a long and very tedious task that is difficult to automate. This is where Matt’s project comes in.

His idea is to take a video of the target object in various positions, say a human walking down a road, instead of separate photographs. What Matt’s project would then do is track the object through the video automatically, finding it’s position in each frame. This position data then gives the training set hundreds of annotated photos, each a frame of the video that it can then use to train.

To teach everyone about how an algorithm may go about tracking an object, Matt first took us through some of the basic techniques currently used, such as finding the gradients of an image by applying different kernels. For example, the image below shows the calculated gradients of an image, which can be seen to be a face. These gradients then give us a higher level look at the image, which is much better to work with then just raw pixel data.


However there are better methods that can be used, such as cross correlation, the method Matt is using in his project. Here, the algorithm gives us a ‘mask’ that we can apply to an image, and this mask will produce a sharp, centred white spot if the applied image matches the object, and a faded white image if not. This is an accurate algorithm, but is extremely expensive to use, as it has an O(p*p) complexity, where p is the pixel count. To improve this time complexity, Matt then showed us a mathematical method you can use to find the result of the cross correlation algorithm much quicker using the Fourier transform. We then had a demo where Matt showed the result of his work on this algorithm, which successfully recognised a face in various positions.