Michael Karnes
Department of Civil,
Environmental and Geodetic Engineering
The Ohio State University
Columbus, Ohio 43210
Email: karnes.30@osu.edu

Alper Yilmaz
Department of Civil,
Environmental and Geodetic Engineering
The Ohio State University
Columbus, Ohio 43210
Email: yilmaz.15@osu.edu

arXiv:2202.03695v1 [cs.CV] 8 Feb 2022

Abstract—Feature extraction has always been a critical com
ponent of the computer vision field. More recently, state-of-the
art computer visions algorithms have incorporated Deep Neural
Networks (DNN) in feature extracting roles, creating Deep Con
volutional Activation Features (DeCAF). The transferability of
DNN knowledge domains has enabled the wide use of pretrained
DNNfeature extraction for applications with novel object classes,
especially those with limited training data. This study analyzes
the general discriminability of novel object visual appearances
encoded into the DeCAF space of six of the leading visual recogni
tion DNN architectures. The results of this study characterize the
Mahalanobis distances and cosine similarities between DeCAF
object manifolds across two visual object tracking benchmark
data sets. The backgrounds surrounding each object are also
included as an object classes in the manifold analysis, providing
a wider range of novel classes. This study found that different
network architectures led to different network feature focuses
that must to be considered in the network selection process.
These results are generated from the VOT2015 and UAV123
benchmark data sets; however, the proposed methods can be
applied to efficiently compare estimated network performance
characteristics for any labeled visual data set.
I. INTRODUCTION
Deep neural networks (DNN) provide flexible function
structures for modeling high dimensional patterns making
them highly effective for image processing applications [1],
[2], [3], [4]. Since their origination with LeNet [5], DNN
have been viewed as nested feature embedding functions that
condense the high dimensional image space into a lower
dimensioned deep convolutional activation feature (DeCAF)
space [6], [7] on which the final classification decision is made.
y =fn( f3(f2(f1(x)))
Other feature extraction methods, such as HoG, SIFT, and
ORB, are also used for a similar information condensation
process, representing image regions as descriptors in the
feature space [8], [9], [10], [11], [12], [13]. They were first
developed for keypoint matching in visual mapping [14]; and
now have spread to a variety of visual tasks including image
classification [15], object recognition [16], [17], and object
localization [18]. There have been continual developments
in feature design to improve computational efficiency, view
invariance, and object discriminability [19].

The primary advantage to using pretrained DNN is the
reduction in training requirements. This approach assumes that
the novel task has a similar knowledge domain to the trained
as seen in incremental learning, network fine tuning, transfer
learning, and feature encoding [20], [21]. Without the loss of
generality, our work focuses on feature encoding for long-term
tracking scenarios. Long-term tracking is a strong choice for
studying the DNN activation manifolds of novel targets for two
reasons. The first is the availability of trusted benchmark data
sets of novel objects. The second being the object’s variable
appearances through a sequence. On local time scales, such
as 10 frames, the changes in the appearance of the object
are minor. As the sequence progresses, the range of objects
appearances increases providing samples of object appearances
in different positions and from different camera perspectives.

The DNN ability to generate highly descriptive complex
filters affords DeCAF encoding a distinct advantage. This
work aims to characterize the encoded spaces of the most
prevalent image recognition network architectures and provide
a methodology for quantitatively measuring the relative diffi
culty in learning a set of novel targets. Better understanding of
DeCAF behaviors enables more informed network selections.
We developed this DeCAF characterization methodology to
help our team select the best network for DeCAF encoding for
a particular data set. This is precisely where our contributions
are aimed. In this study:
(1)
We propose a novel methodology for characterizing the
object manifold discriminability across any annotated
custom data set.
We present the first generalized DeCAF manifold dis
criminability characterization. This comparative DeCAF
survey analyzed the DeCAF manifolds of 294 novel
classes over six top image recognition network architec
tures.
We provide novel results that demonstrate the differences
in learned features between network architectures.

READ THE WHOLE ARTICLE

Published on:

Research Lab. directed by Dr. Alper Yilmaz