Lance C. Pérez
Date of this Version
The goal of Multiple Object Tracking (MOT) is to locate multiple objects and keep track of their individual identities and trajectories given a sequence of (video) frames. A popular approach to MOT is tracking by detection consisting of two processing components: detection (identification of objects of interest in individual frames) and data association (connecting data from multiple frames). This work addresses the detection component by introducing a method based on semantic instance segmentation, i.e., assigning labels to all visible pixels such that they are unique among different instances. Modern tracking methods often built around Convolutional Neural Networks (CNNs) and additional, explicitly-defined post-processing steps.
This work introduces two detection methods that incorporate multi-dimensional embeddings. We train deep CNNs to produce easily-clusterable embeddings for semantic instance segmentation and to enable object detection through pose estimation. The use of embeddings allows the method to identify per-pixel instance membership for both tasks.
Our method specifically targets applications that require long-term tracking of homogeneous targets using a stationary camera. Furthermore, this method was developed and evaluated on a livestock tracking application which presents exceptional challenges that generalized tracking methods are not equipped to solve. This is largely because contemporary datasets for multiple object tracking lack properties that are specific to livestock environments. These include a high degree of visual similarity between targets, complex physical interactions, long-term inter-object occlusions, and a fixed-cardinality set of targets.
For the reasons stated above, our method is developed and tested with the livestock application in mind and, specifically, group-housed pigs are evaluated in this work. Our method reliably detects pigs in a group housed environment based on the publicly available dataset with 99% precision and 95% using pose estimation and achieves 80% accuracy when using semantic instance segmentation at 50% IoU threshold.
Results demonstrate our method's ability to achieve consistent identification and tracking of group-housed livestock, even in cases where the targets are occluded and despite the fact that they lack uniquely identifying features. The pixel-level embeddings used by the proposed method are thoroughly evaluated in order to demonstrate their properties and behaviors when applied to real data.
Adivser: Lance C. Pérez