i
i
i
i
14
Instance Annotation for MultiInstance MultiLabel Learning
FORREST BRIGGS, XIAOLI Z. FERN, RAVIV RAICH, and QI LOU, Oregon State University
Multiinstance multilabel learning (MIML) is a framework for supervised classification where the objects to
be classified are bags of instances associated with multiple labels. For example, an image can be represented
as a bag of segments and associated with a list of objects it contains. Prior work on MIML has focused on
predicting label sets for previously unseen bags. We instead consider the problem of predicting instance
labels while learning from data labeled only at the bag level. We propose a regularized rankloss objective
designed for instance annotation, which can be instantiated with different aggregation models connecting
instancelevel labels with baglevel label sets. The aggregation models that we consider can be factored
as a linear function of a ��support instance�� for each class, which is a single feature vector representing a
whole bag. Hence we name our proposed methods rankloss Support Instance Machines (SIM). We propose
two optimization methods for the rankloss objective, which is nonconvex. One is a heuristic method that
alternates between updating support instances, and solving a convex problem in which the support instances
are treated as constant. The other is to apply the constrained concaveconvex procedure (CCCP), which can
also be interpreted as iteratively updating support instances and solving a convex problem. To solve the
convex problem, we employ the Pegasos framework of primal subgradient descent, and prove that it finds
an
ϵsuboptimal solution in runtime that is linear in the number of bags, instances, and 1
ϵ . Additionally, we
suggest a method of extending the linear learning algorithm to nonlinear classification, without increasing
the runtime asymptotically. Experiments on artificial and realworld datasets including images and audio
show that the proposed methods achieve higher accuracy than other loss functions used in prior work, e.g.,
Hamming loss, and recent work in ambiguous label classification.
Categories and Subject Descriptors: H.3.3 [
Information Storage and Retrieval]: Information Search and
Retrieval; I.5.2 [
Pattern Recognition]: Design Methodology—
Classifier design and evaluation
General Terms: Algorithms, Performance, Experimentation
Additional Key Words and Phrases: Instance annotation, image annotation, multiinstance, multilabel,
support vector machine, subgradient, bioacoustics
ACM Reference Format:
Briggs, F., Fern, X. Z., Raich, R., and Lou, Q. 2013. Instance annotation for multiinstance multilabel learn
ing.
ACM Trans. Knowl. Discov. Data 7, 3, Article 14 (September 2013), 30 pages.
DOI:http://dx.doi.org/10.1145/2500491
1. INTRODUCTION
Many problems in supervised classification have a certain structure, where the ob
jects of interest (e.g., images or text documents) can naturally be decomposed into a
collection of parts called a bagofinstances representation. For example, in image clas
sification, an image is typically a bag, and the pixels or segments in it are instances.
This structure motivates multipleinstance learning (MIL) [Dietterich et al. 1997]. The
A preliminary version of this work appeared in Briggs et al. [2010].
This work is partially funded by the Ecosystems Informatics IGERT program via NSF grant DGE 0333257,
NSF grant 1055113 to Xiaoli Z. Fern, and the College of Engineering, Oregon State University.
Authors�� address: F. Briggs, X. Z. Fern, R. Raich, and Q. Lou, Oregon State University, School of EECS,
Corvallis, Oregon; email: fbriggs@gmail.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is per
mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 101210701 USA, fax +1 (212)
8690481, or permissions@acm.org.
c 2013 ACM 15564681/2013/09ART14 $15.00
DOI:http://dx.doi.org/10.1145/2500491
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:2
F. Briggs et al.
original formulation of MIL concerns problems where bags are associated with a single
binary label. Zhou and Zhang [2007] introduced multiinstance multilabel learning
(MIML), where bags are instead associated with a set of labels. For example, an image
might be associated with a list of the objects it contains.
MIML arises in situations where the cost of labeling individual instances becomes
prohibitive and consequently multiple instances are grouped and associated with a
set of labels. For example, it is much faster to label an image with a few words than
to individually label pixels. In MIML, the training dataset consists of a collection of
bags of instances, where each bag is associated with multiple labels. The standard
goal in MIML is to learn a classifier that predicts the label set for a previously unseen
bag. Numerous algorithms for MIML have been proposed and applied to image, text
[Li et al. 2009; Shen et al. 2009; Zha et al. 2008; Zhou and Zhang 2007; Zhou et al.
2012], and video [Xu et al. 2011] domains. Recently, Wang and Zhou [2012] analyzed
PAClearnability in MIML.
Learning to predict instance labels from MIML training data is a useful problem
that has received little study [Zhou 2004]. For example, one might train a classifier
on a collection of images paired with lists of object names in each image, then make
predictions about the label for each region in an image. This problem is called the
instance annotation problem for MIML. The key issue in instance annotation is how to
learn an instancelevel classifier from a MIML dataset, which presents only baglevel
labels.
A common strategy in designing MIML algorithms for baglevel prediction is to learn
an instancelevel model by minimizing a loss function defined at the bag level. For ex
ample, several previous baglevel MIML algorithms minimize baglevel Hamming loss,
which captures the disagreement between a groundtruth label set, and the predicted
label set. Practically speaking, these instancelevel models could be directly used for
instancelevel prediction. However, the loss functions (e.g., Hamming Loss) used by
such baglevel MIML algorithms are designed to optimize the prediction performance
at the bag level and can be inappropriate for the purpose of making instancelevel
predictions.
For instance annotation, typically the instancelevel classifier outputs a score for
each class, and the instance label is predicted as the highest scoring class. Therefore
the predicted label depends on the ranking of class scores. Hamming loss is not ap
propriate in this context, despite its success at baglevel predictions, because it lacks
a mechanism to calibrate the scores between different classes. This observation mo
tivates us to introduce a rankloss objective for instance annotation, which directly
optimizes the ranking of classes.
In order to learn an instancelevel classifier using a baglevel loss function, it is
necessary to define an aggregation model that connects instancelevel predictions with
baglevel predictions. In this article, we examine two different aggregation models,
which is equivalent to defining a ��support instance�� for each bag. Therefore, we name
our methods Support Instance Machines (SIM).
In this article, we make the following contributions.
—We propose a regularized rankloss objective for instance annotation that can be
instantiated with different aggregation models (Section 4.1).
— The rankloss objective is nonconvex. To address this challenge, we offer two opti
mization methods that are effective in practice. One is a heuristic that linearizes
the aggregation model, and the other casts the objective as a difference of convex
functions and monotonically decreases the objective using the constrained concave
convex procedure (CCCP) [Yuille and Rangarajan 2002].
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:3
—In either optimization method, the core of the algorithm is to alternate between
updating support instances, and solving a convex problem using the Pegasos frame
work for primal subgradient descent. We prove that the convex optimization has
linear runtime in the number of bags, instances, and 1
ϵ
to find an
ϵsuboptimal solu
tion (Section 4.3.4).
— Experiments show that SIM with rank loss achieves higher accuracy than Hamming
loss or ambiguous loss (a comparable state of the art approach [Cour et al. 2009,
2011]). A novel softmax aggregation model generally achieves higher accuracy than
the max model, which has been used in prior multiinstance or MIML algorithms,
and CCCP improves accuracy over the heuristic with the same aggregation model
(Section 5.2.4).
— We suggest a method of extending our proposed classifiers from linear to nonlinear
using a fast approximate kernel trick [Rahimi and Recht 2007]. Experiments show
that this method often improves accuracy significantly, while still achieving linear
runtime in the number of bags and instances (Section 4.4).
— We introduce a realworld MIML dataset for instance annotation derived from over
90 minutes of bird song recordings collected in the field, containing multiple simul
taneously vocalizing birds of different species. Several SIM algorithms achieve over
80% accuracy in predicting the species of bird responsible for each sound in the
recordings given the list of species present in the recording (Section 5.1.1).
2. PROBLEM STATEMENT
In this section we formalize the instance annotation problem and contrast it with sev
eral related problems. Table I summarizes notation.
We are given a training set of
n bags
(X1,
Y1
),
... ,
(Xn,
Yn). Each
Xi is a bag of
ni
instances, i.e.,
Xi = {
xi1, ··· ,
xini }, with
xiq ��
X, where
X = R
d is a ddimensional
feature space. Each bag
Xi is associated with a label set
Yi ⊆
Y where
Y = {1, ··· ,
c}
and
c is the total number of classes. We will assume that each instance
xiq has a
hidden label
yiq ��
Y and that the bag label set is equal to the union of the instance
labels, i.e.
Yi = ��
q=1,
...,
ni yiq. The goal of instance annotation in MIML is to learn an
instancelevel classifier
fIA :
X ��
Y that maps an element of the input space
X to its
corresponding class label.
We focus on classifiers for instance annotation that use one instancelevel model
fj(x) =
wj ·
x for each class
j, and predict a specific label via
f(x) = arg max
j fj(x). The
goal is to learn the weights
W = [
w1,
... ,
wc] (note
x �� R
d,
wj �� R
d, and
W �� R
cd).
The instance annotation problem is different from the traditional MIML learning
problem studied by Zhou and Zhang [2007] and many others, where the goal is to
learn a baglevel classifier
FMIML : 2
X �� 2
Y
. Nonetheless, the design principles of
many traditional MIML algorithms can be used to learn instancelevel classification
models, which is the approach we take in this article.
Instance annotation is closely related to the classic supervised classification prob
lem, where the goal is to learn an instance classifier, but using training data that does
not have a bag structure and has each instance labeled individually. In contrast to
MIML, this classic setting is often referred to as singleinstance singlelabel (SISL)
learning.
Ambiguous label classification (ALC) [Cour et al. 2011; H��ullermeier and Beringer
2006] is another related framework. In ALC, there are no bags; instead instances are
paired with a set of possible labels, only one of which is correct. An ALC dataset is
(x1,
Y1
),
... ,
(xm,
Ym), where
xi ��
X and
Yi ⊆
Y. In ALC the goal is to learn a classi
fier that predicts instance labels, hence an ALC classifier is a function
fALC :
X ��
Y.
A MIML instance annotation problem can be transformed into an ALC problem by
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:4
F. Briggs et al.
Table I. Summary of Notation
Notation
Meaning
n
number of bags
ni
number of instances in bag
i
m
number of instances in all bags =
��
ni
c
number of classes
xiq
instance
q in bag
i, a feature vector in R
d
Xi
bag
i, a set of instances
Yi
label set for bag
i, a subset of {1,
... ,
c}
¯
Yi
complement of
Yi
Yij
+1 if
j ��
Yi and 1 otherwise
fj(x)
instancelevel model for class
j
Fj(X)
baglevel model for class
j
wj
instancelevel weights for class
j
W
concatenation of
w1,
... ,
wc as a vector
W(t)
weights at the start of iteration
t of CCCP or the heuristic
W(t,
��)
weights at iteration
�� of subgradient descent, iteration
t of CCCP or the heuristic
ˆ
xij(W)
support instance for bag
i, class
j as a function of
W
ˆ
x
(t)
ij
support instance for bag
i, class
j at iteration
t of CCCP or the heuristic
ˆ
x
(t,
��)
ij
support instance for bag
i, class
j at iteration
�� of subgradient descent, iteration
t of CCCP
��i
1
n
Yi ¯
Yi
��
ijk
n
��
i=1
��
j��
Yi
��
k/��
Yi
creating one ALC instance for each instance in a MIML bag, paired with all of the la
bels from the bag. Hence ALC algorithms can be applied to MIML instance annotation
problems. However, this reduction may discard useful baglevel structure in the MIML
data. The baglevel structure in MIML implies that each label in a bag label set must
be explained by at least one of the instances in that bag, whereas this constraint is lost
in the reduction to ALC.
3. BACKGROUND
Here we discuss some design patterns that contribute to our proposed methods.
3.1. Connecting Instance Labels with Bag Labels
One common approach in MIML algorithms is to make baglevel predictions based on
the outputs of instancelevel models. The connection from instance labels to bag labels
is made via the assumption that the bag label set is the union of instance labels. This
assumption is used in several MIML algorithms including M3MIML [Zhang and Zhou
2008], and DMimlSvm [Zhou et al. 2012]. The following formulation approximates
this assumption. Let
fj(x) :
X �� R be a function that takes an instance and returns
a realvalued score for class
j. The output at the baglevel for class
j is defined to be
Fj(X) = max
x��
X fj(x). A baglevel classifier can be obtained by applying a threshold
(e.g., 0) to the baglevel scores, i.e.
F(X) = {
y ��
Y :
Fy(X) > 0}. Note that if an instance
x
∗
within bag
X is predicted to belong to class
j, i.e.,
fj(x
∗
) > 0, the predicted label set
for bag
X will necessarily contain
j because
Fj(X) = max
x��
X fj(x) ��
fj(x
∗
) > 0. Hence,
connecting instance and baglevel scores via the max function is related to defining the
bag label set as the union of the instance labels.
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:5
3.2. BagLevel Loss Functions
In contrast with SISL or instance annotation, which are evaluated based on instance
level accuracy, existing MIML algorithms are typically evaluated based on their label
set predictions. Two common performance measures are Hamming loss, and rank loss
[Zhou et al. 2012]. Hamming Loss is the number of false positives and false negatives,
averaged over all classes and bags,
1
nc
n
��
i=1
c
��
j=1
I[
j ��
F(Xi),
j /��
Yi] +
I[
j /��
F(Xi),
j ��
Yi]
(1)
Rank loss captures the number of label pairs that are incorrectly ordered by the scores
of the MIML classifier. For a given bag, classes in its true label set should receive
higher scores than classes that are not. Let ¯
Y denote the complement of
Y. Rank loss
is defined as
1
n
n
��
i=1
1

Yi ¯
Yi
��
j��
Yi,
k�� ¯
Yi
I[
Fj(Xi) ��
Fk(Xi)]
(2)
These objectives are difficult to optimize directly because they are not continuous.
Several prior algorithms for MIML can be viewed as optimizing a surrogate for Ham
ming loss. For example, DMimlSvm [Zhou et al. 2012] and M3MIML [Zhang and Zhou
2008] optimize variations of the following loss function (with different regularization
terms).
1
nc
n
��
i=1
c
��
j=1
max{0, 1 −
YijFj(Xi)},
(3)
where
Yij = +1 if
j ��
Yi and −1 if
j /��
Yi.
The hinged Hammingloss objective (3) can be decomposed into a collection of in
dependent MIL problems, one for each class. As such, it does not calibrate the scores
between classes, which could make predicting an instance label based on the highest
scoring class unreliable. To overcome this limitation, we consider rank loss instead.
Rank loss has been used as an objective for singleinstance multilabel SVMs [Elisseeff
and Weston 2001]. However, we are not aware of any MIML algorithms that learn an
instancelevel model by minimizing rank loss (rank loss has only been used as a per
formance measure in MIML, not as an objective).
4. PROPOSED METHODS
Our proposed methods are based on the observation that the predicted instance label
depends on the ranking of scores for each class. Hence, we propose a rankloss ob
jective that optimizes this ranking. The rankloss objective can be instantiated with
different aggregation models that connect instance labels with bag label sets. We con
sider aggregation models that can be factored as a linear function of
support instances,
which are feature vectors that summarize a bag. We propose two optimization meth
ods for the rankloss objective, which is nonconvex. One is a heuristic, and the other is
based on CCCP, however both exploit the support instance structure in the optimiza
tion problem.
4.1. RankLoss Objective
Because we are learning from an MIML dataset, we can only measure loss at the bag
level. A baglevel loss function measures the agreement between a bag label set and
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:6
F. Briggs et al.
the baglevel scores
Fj(Xi). We propose the following a baglevel loss function, which is
a regularized surrogate for rank loss (2).
hRL(W) =
��
2

W2 +
n
��
i=1
��
j��
Yi
��
k/��
Yi
1
n
Yi ¯
Yi
max{0, 1 −
(
Fj(Xi) −
Fk(Xi)
)
}.
(4)
To shorten our notation, let
��i =
1
n
Yi ¯
Yi
and
��
ijk
��
n
��
i=1
��
j��
Yi
��
k/��
Yi
. Then we can rewrite
the objective as
hRL(W) =
��
2

W2 +
��
ijk
��i max{0, 1 −
(
Fj(Xi) −
Fk(Xi)
)
}.
(5)
This objective is designed to encourage a correct ranking of the baglevel scores for
each class. For a bag
Xi with corresponding label set
Yi, if
j ��
Yi and
k �� ¯
Yi, then the
loss is zero only if
Fj(Xi) > Fk(Xi) + 1 (requiring a difference of at least 1 promotes a
largemargin solution). The objective is also designed to facilitate primal subgradient
descent and CCCP optimization methods, which we discuss in Section 4.3.
4.2. Aggregation Models and Support Instances
Objective (4) can be instantiated with various aggregation models
Fj(·
) that compute
baglevel scores from instancelevel scores. For example, the max model, which has been
used in prior work [Zhang and Zhou 2008; Zhou et al. 2012], with a linear instance
classifier is
Fj(Xi) = max
xiq��
Xi
fj(xiq) = max
xiq��
Xi
wj ·
xiq.
(6)
It is equivalent to write
Fj(Xi) =
wj · ˆ
xij(W), where
ˆ
xij(W) = arg max
xiq��
Xi
wj ·
xiq.
(7)
We refer to ˆ
xij(W) as the support instance for bag
Xi, class
j, because the baglevel out
put for
Xi depends only on the support instance for each class (analogous to a support
vector).
The max model represents each bag with the most characteristic instance of each
class. This approach can ignore other instances that are also useful for learning, and
may not be appropriate when the assumption that the bag label set is equal to the
union of instance labels does not hold. We propose an alternative softmax model, which
can also be expressed in terms of support instances, but has the advantage of basing
the support instances on more than one instance per class for each bag. The softmax
model represents each bag as a weighted average of the instances, with weights specific
to each class:
Fj(Xi) =
��
xiq��
Xi
��
j
iq
fj(xiq) =
��
xiq��
Xi
��
j
iq
wj ·
xiq.
(8)
The weights are defined according to a softmax rule,
��
j
iq =
ewj·
xiq
��
x ��
Xi
ewj·
x
.
(9)
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:7
We can also write the softmax model as
Fj(Xi) =
wj · ˆ
xij(W), with the support instances
defined as
ˆ
xij(W) =
��
xiq��
Xi
��
j
iq
xiq.
(10)
For either model, the rankloss objective in terms of support instances is
ˆ
hRL(W) =
��
2

W2 +
��
ijk
��i max{0, 1 +
wk · ˆ
xik(W) −
wj · ˆ
xij(W)}.
(11)
4.3. Optimization Methods for RankLoss Support Instance Machines
SIM could be instantiated with any aggregation model that fits the pattern of (11).
However, depending on the aggregation model, different optimization techniques are
required. Using the max and softmax models, ˆ
xij(W) is a function of
wj, and the ob
jective is nonconvex. We propose two optimization strategies to handle this nonconvex
objective: (1) if
Fj(·
) is a convex function of
W, we can apply CCCP, or (2) as a heuristic,
treat ˆ
xij as constant, which makes the objective convex.
4.3.1. Constrained Convex Concave Procedure (CCCP). The baglevel surrogate loss func
tions that arise in multiinstance learning are often nonconvex, but can sometimes
be manipulated into a differenceofconvex (DC) form, which allows the use of CCCP
[Sriperumbudur and Lanckriet 2009; Yuille and Rangarajan 2002]. The advantage of
CCCP is that it decreases the objective monotonically and converges to a local optimum
or a saddle point. CCCP can be applied to optimization problems of the form
min
x
f0
(x) −
g0
(x)
(12)
s.
t.
fi(x) −
gi(x) �� 0,
i = 1,
... ,
k,
(13)
where all
fi and
gi are convex. CCCP solves a sequence of convex upper bounds to
the original problem by constructing a linear upper bound of the concave part of the
objective and constraints at the current point. Let
x(t) be the current point, then the
next point
x(t+1
) is obtained by solving the following convex problem.
min
x
f0
(x) −
[
g0
(x(t)) + V
g0
(x(t)) ·
(x −
x(t))
]
(14)
s.
t.
fi(x) −
[
gi(x(t)) + V
gi(x(t)) ·
(x −
x(t))
]
�� 0,
i = 1,
... ,
k.
(15)
If
gi is nondifferentiable, a subgradient
u ��
∂gi(x(t)) can be used in place of the gradi
ent V
gi(x(t)). This is the case for our derivations using CCCP, as
gi involves the max
function.
4.3.2. CCCP for RankLoss Support Instance Machines. When the aggregation model
Fj(·
)
is a convex function of
W, CCCP can be applied to the rankloss objective. The max
model is convex, but the softmax model is not. The objective (11) is not a DC function,
but we can still use CCCP with a simple transformation of the problem. The uncon
strained problem (11) is equivalent to a constrained problem
min
W,
��
��
2

W2 +
��
ijk
��i��ijk
(16)
s.
t.
Fj(Xi) −
Fk(Xi) �� 1 −
��ijk, for
i = 1,
... ,
n,
j ��
Yi,
k /��
Yi.
(17)
��ijk �� 0
(18)
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:8
F. Briggs et al.
The constraint (17) is not convex, but when the aggregation model
Fj(·
) is convex, the
constraint is DC. As an example of applying the CCCP framework to a convex aggerga
tion model, we will use the max model. Substituting in the max model and rearranging,
the constraint becomes
1 −
��ijk + max
x��
Xi
wk ·
x
︸
︷︷
︸
fijk
− max
x��
Xi
wj ·
x
︸
︷︷
︸
gijk
�� 0.
(19)
We define the support instances at iteration
t of CCCP as
ˆ
x
(t)
ij = ˆ
xij(W(t)),
(20)
where
W(t) are the weights at iteration
t.
Following CCCP and constructing the linear upper bound for the concave part −
gijk,
we get
1 −
��ijk + max
x��
Xi
wk ·
x −
wj · ˆ
x
(t)
ij �� 0.
(21)
Hence CCCP solves the following sequence of convex problems
min
W,
��
��
2

W2 +
��
ijk
��i��ijk
(22)
s.
t. 1 −
��ijk + max
x��
Xi
wk ·
x −
wj · ˆ
x
(t)
ij �� 0, for
i = 1,
... ,
n,
j ��
Yi,
k /��
Yi.
(23)
��ijk �� 0
(24)
This problem could be further simplified to a convex QP, but we instead solve the
equivalent unconstrained convex problem (because it is amenable to efficient primal
subgradient descent):
h
(t)
RL,
cccp
(W) =
��
2

W2 +
��
ijk
��i max{0, 1 + max
x��
Xi
wk ·
x −
wj · ˆ
x
(t)
ij }.
(25)
We discuss methods for solving (25) in Section 4.3.4. In summary, the trick we use to
efficiently solve the original nonconvex rankloss SIM objective with the max model (or
potentially any convex model) is to convert it to an equivalent constrained problem,
construct the CCCP upper bound, then convert the bound back to an unconstrained
problem.
4.3.3. Heuristic for Nonconvex Aggregation Models. If
Fj(Xi) is not a convex function of
W, then the constraint (17) does not naturally decompose into a DC function, and
CCCP cannot be applied. This is the situation with the softmax model. In this case,
we propose a heuristic of alternating between computing the support instances, and
solving a convex problem where the support instances are treated as constant. The
heuristic is similar to CCCP, except the convex objective solved in each step changes to
h
(t)
RL,
h
(W) =
��
2

W2 +
��
ijk
��i max{0, 1 +
wk · ˆ
x
(t)
ik −
wj · ˆ
x
(t)
ij }.
(26)
In contrast with the CCCP algorithm, this heuristic is not proven to decrease objec
tive (4) monotonically. The difference between the convex problems solved by CCCP
and the heuristic is whether the aggregation model applied to bags with negative la
bels
Fk(Xi) is linearized in terms of the support instance, or not. The heuristic can also
be applied when the aggregation model is convex.
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:9
ALGORITHM 1: Projected Subgradient with Pegasos Learning Rate
Input: Initial point
W(0
) ��
S, number of iterations
K
for �� = 1,
... ,
K do
Compute a subgradient
V ��
∂h(W(��−1
))
W(�� ) ��
P[
W(��−1
) −
1
�˦�
V]
end
return Wbest = arg min
W(�� )
h(W(�� )
)
4.3.4. Subgradient Descent. To solve the convex problem in each step of CCCP or the
heuristic [(25) or (26)], we use a subgradient descent method similar to Pegasos, an
algorithm for training linear twoclass SVMs [ShalevShwartz et al. 2007]. The Pe
gasos algorithm is based on a general framework for optimizing regularized convex
objectives [ShalevShwartz and Singer 2007]. This framework can be applied to con
vex optimization problems in the form:
min
W��
S
h(W) where
h(W) =
��
2

W2 +
loss(W).
For such problems, Algorithm 1 is efficient. Let a convex set
S be the feasible space;
P[
W] = arg min
W ��
S

W −
W 2 is a projection back into the feasible space. Note that
when
S is a ball of radius
r, i.e.
S = {
W : 
W ��
r}, the projection simplifies to
P[
W] = min{1,
r

W
}
W (for reasons that become evident in the runtime analysis, we
will introduce such a constraint into our optimization problems).
Consider iteration
t of CCCP or the heuristic, where the convex objective is
h
(t)
RL,
cccp
(W) (25) or
h
(t)
RL,
h
(W) (26) respectively. We will show how to apply subgradi
ent descent to these objectives. We will start by summarizing relevant notation and
stating a subgradient of each objective.
First consider
hRL,
h as an example. Recall that the decision variable
W is the con
catenation of components
wq for each class
q = 1,
... ,
c. We denote the component of
the subgradient of
h
(t)
RL,
h
corresponding to
wq as
v
(t,
��),
q
RL,
h
, where the superscript
t indi
cates the outer iteration of the heuristic,
�� indicates the inner iteration of subgradient
descent, and
q indicates the class. The full subgradient is
V
(t,
��)
RL,
h = [
v
(t,
��),1
RL,
h
,
... ,
v
(t,
��),
c
RL,
h
].
The components of the subgradient of
h
(t)
RL,
cccp
are defined similarly, e.g.,
v
(t,
��),
q
RL,
cccp
.
For the heuristic, it is sufficient to compute the support instances ˆ
x
(t)
ij
once at the be
ginning of each outer iteration
t; then at every inner iteration
�� of subgradient descent,
the subgradients depend only on ˆ
x
(t)
ij
. The subgradient for the heuristic is
v
(t,
��),
q
RL,
h =
��w(t,
��)
q
+
��
ijk
��iI[1 +
w
(t,
��)
k
· ˆ
x
(t)
ik −
w
(t,
��)
j
· ˆ
x
(t)
ij
> 0]
⎧
⎪⎨
⎪⎩
ˆ
x
(t)
iq
if
q =
k
−ˆ
x
(t)
iq
if
q =
j
0 otherwise
. (27)
For CCCP, the subgradients depend on both ˆ
x
(t)
ij
, and a different set of support in
stances that change with each inner iteration
�� of subgradient descent. We introduce
the notation
ˆ
x
(t,
��)
ik
= ˆ
xik(W(t,
��))
(28)
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:10
F. Briggs et al.
This notation indicates the support instance computed from the weights at iteration
��
of subgradient descent within iteration
t of CCCP, in contrast with ˆ
x
(t)
ik
, which indicates
the support instance computed from the weights at the start of iteration
t of CCCP. The
subgradient for CCCP at the inner iteration
�� is
v
(t,
��),
q
RL,
cccp =
��w(t,
��)
q
+
��
ijk
��iI[1 +
w
(t,
��)
k
· ˆ
x
(t,
��)
ik
−
w
(t,
��)
j
· ˆ
x
(t)
ij
> 0]
⎧
⎪⎨
⎪⎩
ˆ
x
(t,
��)
iq
if
q =
k
−ˆ
x
(t)
iq
if
q =
j
0 otherwise
. (29)
We have not discussed how to compute the subgradients efficiently, or explained
why a ballconstraint is introduced. These subjects fit naturally into a discussion of
the runtime of subgradient descent.
Runtime of Subgradient Descent. Let
W
∗
be the solution, i.e.
W
∗ = arg min
W��
S
h(W).
ShalevShwartz and Singer [2007] showed the following convergence rate for Algo
rithm 1,
min
��
h(W(�� )) ��
h(W
∗
) +
O(
log
K
K
L
��
),
where
L is a constant bounding the magnitude of the subgradient, i.e. ∀
��, 
V2 ��
L.
For practical purposes, the number of iterations of subgradient descent
K is small
enough to treat log
K as constant. Hence, to obtain a solution that is within
ϵ of opti
mal, it suffices to run
K ��
O( L
��ϵ
) iterations of the preceding algorithm [ShalevShwartz
and Singer 2007].
To prove a rate of convergence for subgradient descent, we must establish
L, the
bound on the subgradient square magnitude. We begin with the following Lemma.
LEMMA 4.1.
Consider any objective of the form h(W) =
��
2
W2 +
loss(W), such that
loss(W) �� 0
and loss(0) = 1
. Let the solution be W
∗ = arg min
W
h(W). Then 
W
∗2 �� 2
��
.
PROOF. The solution must be at least as good as
W =
0, therefore
h(W
∗
) �� 1.
Furthermore,
loss(W
∗
) �� 1 (assuming the contrary implies
h(W
∗
) > 1, which is a
contradiction). Because the loss is nonnegative, 0 ��
loss(W
∗
) �� 1. We will use this
property to finish the proof:
h(W
∗
) =
��
2

W
∗2 +
loss(W
∗
) �� 1
��
2

W
∗2 �� 1 −
loss(W
∗
) �� 1

W
∗2 ��
2
��
.
Now we will briefly derive a bound
L =
c
(��
2
�� +
R
)2
. The
h
(t)
RL,
cccp
(W) and
h
(t)
RL,
h
(W)
objectives satisfy the criteria of Lemma 4.1, so we can replace unconstrained minimiza
tion with minimization restricted to the set
S = {
W : 
W2 �� 2
�� } without changing the
solution. It is also necessary to bound the magnitude of an instance feature vector,

x ��
R. Note that
W ��
S implies 
��wq ��
��
��
2
�� =
��
2
��. By the triangle inequality,
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:11
ALGORITHM 2: Subgradient
V
(t,
��)
RL,
h
Input: ˆ
x
(t)
ij
,
W(t,
��), {
(Xi,
Yi)}
n
i=1
for q = 1,
... ,
c :
do
vq ��
��w
(t,
��)
q
end
for i = 1,
... ,
n;
j ��
Yi,
k �� ¯
Yi do
if 1 +
w
(t,
��)
k
· ˆ
x
(t)
ik −
w
(t,
��)
j
· ˆ
x
(t)
ij
> 0
then
vj ��
vj −
��i ˆ
x
(t)
ij
vk ��
vk +
��i ˆ
x
(t)
ik
end
end
return V
(t,
��)
RL,
h = [
v1,
... ,
vc]
ALGORITHM 3: Subgradient
V
(t,
��)
RL,
cccp
Input: ˆ
x
(t)
ij, ˆ
x
(t,
��)
ij
,
W(t,
��), {
(Xi,
Yi)}
n
i=1
for q = 1,
... ,
c :
do
vq ��
��w
(t,
��)
q
end
for i = 1,
... ,
n;
j ��
Yi,
k �� ¯
Yi do
if 1 +
w
(t,
��)
k
· ˆ
x
(t,
��)
ik
−
w
(t,
��)
j
· ˆ
x
(t)
ij
> 0
then
vj ��
vj −
��i ˆ
x
(t)
ij
vk ��
vk +
��i ˆ
x
(t,
��)
ik
end
end
return V
(t,
��)
RL,
cccp = [
v1,
... ,
vc]

v
(t,
��),
q
RL,
cccp ��
��
2
�� +
R. Summing over classes, 
V
(t,
��)
RL,
cccp2 =
��
c
q=1 
v
(t,
��),
q
RL,
cccp2 ��
c
(��
2
�� +
R
)2
.
With a more elaborate derivation, we obtain a tighter bound of
L =
(��
2
�� + 2
R
)2
(see Appendix 1), which gives a shorter runtime. The bound
L is applicable for both
the heuristic and CCCP.
To compute
V
(t,
��)
RL,
h
or
V
(t,
��)
RL,
cccp
, one could compute each component according to Equa
tion (27) or (29), but this will take
O(nc3
) time. Algorithms (2) and (3) compute the
subgradients in
O(nc2
) time (assuming the support instances have already been cal
culated). Note that updating the support instances takes
O(m) time, where
m is the
number of instances in all bags. For the heuristic method, computing the support in
stances is done once at the beginning of each outer iteration, hence it is not part of the
runtime for a single iteration of subgradient descent. In contrast, the CCCP method
must recompute the support instances in each iteration of subgradient descent, so the
runtime of one iteration of subgradient descent is
O(m +
nc2
). Running
T =
O(L
��ϵ
)
iterations of subgradient descent gives a runtime of
O(m +
nc2
R2
ϵ
��
��
) time to find an
ϵ
suboptimal solution for the heuristic method, or
O((m+
nc2
)R2
ϵ
��
��
) for the CCCP method.
4.3.5. Summary of Differences Between CCCP and the Heuristic. Algorithms (4) and (5) list
the heuristic and CCCP optimization methods for the rankloss SIM objective. We refer
to these algorithms as SIMHeuristic and SIMCCCP. A major difference between SIM
CCCP and SIMHeuristic is that in SIMCCCP, the support instances must be recom
puted in each iteration of subgradient descent, and the subgradient depends on these
different support instances. The heuristic also runs a constant number of iterations
of subgradient descent, whereas the CCCP version runs enough to ensure monotonic
decrease of the objective (see Appendix 2 for further discussion of monotonicity). The
heuristic can be applied to either the max or softmax model, and the CCCP algorithm
can only be applied to the max model because softmax is nonconvex.
Because the general rankloss objective is nonconvex, the starting weights may af
fect the optimum that is found by SIMCCCP or SIMHeuristic. We discuss starting
weight construction in Appendix 2.
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:12
F. Briggs et al.
ALGORITHM 4: SIMHeuristic with rankloss and max or softmax model
Input:
T,
K,
��, MIML dataset {
(Xi,
Yi)}
n
i=1
for t = 1,
... ,
T do
if t = 1
then
W(t) =
0
∀
(i,
j) : ˆ
x
(t)
ij = 1
ni
��
xiq��
Xi
xiq
else
∀
(i,
j) : compute ˆ
x
(t)
ij
using the max or softmax model
end
W(t,1
) =
W(t)
for �� = 1,
... ,
K do
Compute
V =
V
(t,
��)
RL,
h
with Algorithm 2
W =
W(t,
��) −
1
�˦�
V
W(t,
��+1
) = min{1,
��
2
��
/
W}
W
end
��
∗ = arg min
��
h
(t)
RL,
h
(W(t,
��)
)
W
t+1 =
W(t,
��∗
)
end
return W(T+1
)
4.4. Nonlinear Classification via Kernels
So far we have only considered linear classifiers. There are several ways to extend
our proposed methods to nonlinear classification via kernels. One way is to use the
standard dual kernel trick. ShalevShwartz et al. [2011] suggest a different trick for
Pegasos that could be applied to our proposed methods as well. The key idea in ker
nelizing Pegasos is to observe that all changes to the weights in the primal algorithm
are either adding a linear multiple of instance features or multiplying by a scalar.
We could redefine
fj(x) =
��
m
i=1
��ijK(xi,
x) and proceed with the primal algorithm, but
make equivalent changes to
�� rather than
W. Both of these methods have the disad
vantage of increasing the asymptotic complexity to solve the optimization problem.
We instead use the random Fourier feature method of Rahimi and Recht [2007]
(Algorithm (6)), which approximates a kernel while preserving linear runtime. The
main idea in this method is to transform the feature vectors first, then apply exactly
the same linear training algorithm. The feature vector
x is transformed to a feature
z(x) such that with high probability
z(x) ·
z(y) ��
k(x,
y), where
k(x,
y) =
k(x −
y) is
a shift invariant kernel. This method can be applied with the radial basis function
(RBF) kernel
k(x,
y) = exp{−
��
x −
y2}. To use this algorithm with the RBF kernel
k,
we need to compute its Fourier transform
p(��) and draw
D independently, identically,
distributed samples
��1,
... ,
��D �� R
d from
p(��), where
D is a parameter that can be
adjusted to control approximation accuracy. It can be shown that for the RBF kernel
with parameter
��, this is equivalent to sampling each of the
d components of
��i from
a normal distribution with 0 mean and variance 2
��.
5. EXPERIMENTS
Our experiments compare different loss functions, aggregation models, and optimiza
tion methods. We also investigate the effectiveness of random Fourier kernel features
for nonlinear classification.
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:13
ALGORITHM 5: SIMCCCP with rank loss and max model
Input:
T,
K,
Kmax,
��, MIML dataset {
(Xi,
Yi)}
n
i=1
for t = 1,
... ,
T do
if t = 1
then
W(t) =
0
∀
(i,
j) : ˆ
x
(t)
ij = 1
ni
��
xiq��
Xi
xiq
else
∀
(i,
j) : ˆ
x
(t)
ij = arg max
xiq��
Xi
wj ·
xiq
end
W(t,1
) =
W(t)
improved =
False
��total = 0
while (¬
improved) ��
��total < Kmax do
for �� = 1,
... ,
K do
��total =
��total + 1
∀
(i,
j) : ˆ
x
(t,
��)
ij
= arg max
xiq��
Xi
wj ·
xiq
Compute
V =
V
(t,
��)
RL,
cccp
with Algorithm 3
W =
W(t,
��) −
1
�˦�total
V
W(t,
��+1
) = min{1,
��
2
��
/
W}
W
end
��
∗ = arg min
��
h
(t)
RL,
cccp
(W(t,
��)
)
W
t+1 =
W(t,
��∗
)
if h
(t)
RL,
cccp
(W(t+1
)) < h
(t)
RL,
cccp
(W(t)) then
improved =
True
end
end
end
return W(T+1
)
ALGORITHM 6: Random Fourier Kernel Features [Rahimi and Recht 2007]
Input: Positive definite shiftinvariant kernel
k(x,
y) =
k(x −
y), feature dimension
d,
parameter
D
Compute the Fourier transform
p of
k:
p(��) = 1
2
��
��
e−
j��·
k( )d
Draw
D i.i.d. samples
��1,
... ,
��D �� R
d from
p(��)
Let
z(x) =
��
1
D
[
cos(��1 ·
x),
sin(��1 ·
x),
... ,
cos(��D ·
x),
sin(��D ·
x)]
5.1. Experimental Setup
This section discusses the datasets, baseline methods, parameter optimization, and
transductive or inductive modes used in our experiments.
5.1.1. Datasets. Table II summarizes the properties of each dataset used in our exper
iments. All of these datasets are available online.1
1http://web.engr.oregonstate.edu/∼briggsf/kdd2012datasets
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:14
F. Briggs et al.
Table II. MIML Datasets
Dataset
Classes
Dimensions
Bags
Instances
HJA Birdsong
13
38
548
10,232
MSRC v2
23
48
591
1,758
LetterCarroll
26
16
166
717
LetterFrost
26
16
144
565
Pascal VOC 2012
20
48
1,053
4,143
HJA Bird Song. Our collaborators have collected audio recordings of bird song at the
H. J. Andrews (HJA) Long Term Ecological Research Site, using unattended omnidi
rectional microphones. Our goal is to automatically identify the species of bird respon
sible for each utterance in these recordings, thereby generating an automatic acoustic
survey of bird populations. This problem is a natural fit for the MIML instance an
notation framework. We treat a 10second audio recording as a bag with labels corre
sponding to the set of species present in the recording. The instances are segments in a
spectrogram. A spectrogram is a graph of the spectrum of a signal as a function of time
(computed by applying the Fast Fourier Transform to successive overlapping frames
of the audio signal). Figure 1 shows an example spectrogram for a 10second audio
recording containing several species of birds.
Starting with a 10second audio recording, we first convert it to a spectrogram. A se
ries of preprocessing steps are then applied to the spectrogram to reduce noise, and to
identify bird song segments in the audio. Each segment is considered an instance and
described by a 38dimensional feature vector characterizing the shape of the segment,
its time and frequency profile statistics, and a histogram of gradients.
This dataset contains 548 10second recordings (bags), and a total of 10,232 seg
ments (instances), of which 4998 are labeled, and the rest are unlabeled. The available
instance labels were provided by a human expert. The spectrograms were originally
labeled by manually drawing bounding boxes (not shown) around regions of the spec
trogram corresponding to bird sound, and giving a single species label to each box. The
baglevel label sets were formed by taking the union of these bounding box labels. The
instances (segments) are produced by an automatic segmentation/detection algorithm,
which does not necessarily match the manually drawn boxes. The labels for instances
are obtained by matching each segment with the bounding box that overlaps with it
most. If a segment does not overlap with any bounding box, it is designated as unla
beled. Further details on the collection of audio and feature extraction for this dataset
are available in Briggs et al. [2012b].
This dataset presents some deviations from the standard MIML assumption that a
bag��s label set is the union of its instance labels. First, there are unlabeled instances
that may not be accounted for in the bag label set. Second, sometimes when mul
tiple birds make sounds that directly overlap in the spectrogram, the segmentation
algorithm may fail to separate those sounds, and create a segment that is given a sin
gle label although it actually represents two species. In rare cases, this can lead to
labels in a bag label set that do not correspond with any instance (according to the
instancelevel labels).
One of the goals of our experiments is to evaluate how various instance annotation
algorithms handle the issue of unlabeled instances. Toward this goal, we consider two
variants of this dataset: ��filtered�� and ��unfiltered��. For the filtered variant, all of the
unlabeled instances are removed, and in the unfiltered variant they are left in during
the training process. In both variants, the accuracy is measured only on the labeled
instances.
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:15
Fig. 1. An example spectrogram from the HJA Birdsong dataset. This spectrogram corresponds to one bag.
Each outlined region is an instance.
Fig. 2. An image from MSRC v2 and the corresponding pixellevel labeling. The classes in this image are
��sky��, ��trees��, ��grass��, ��body��, and ��car��. The black regions are ��void��; we discard void regions.
Image Dataset: MSRC v2. A subset of the Microsoft Research Cambridge (MSRC)
image dataset2 [Winn et al. 2005] named ��v2�� contains 591 images and 23 classes. The
MSRC v2 dataset is useful for the instance annotation problem, because pixellevel
labels are included (Figure 2). Several authors used MSRC v2 in MIML experiments
[Vijayanarasimhan and Grauman 2009; Yakhnenko 2009; Zha et al. 2008].
We construct a MIML dataset from MSRC v2 as follows. We treat each image as a
bag. The bag label set is the list of all classes present in the groundtruth segmentation
(the union of the instance labels). The instances correspond to each contiguous region
in the groundtruth segmentation (to simplify the experiment, we use the groundtruth
segmentation rather than automatic segmentation). Each instance is described by a
16dimensional histogram of gradients [Dalal and Triggs 2005], and a 32dimensional
histogram of colors.
2http://research.microsoft.com/enus/projects/objectclassrecognition/default.htm
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:16
F. Briggs et al.
Fig. 3. An image from PASCAL VOC 2012, the corresponding segmentation into instances, and the instance
labels.
PASCAL Visual Object Challenge 2012. The PASCAL Visual Object Challenge (VOC)
2012 Segmentation dataset [Everingham et al. 2010] consists of images with perpixel
groundtruth labeling into 20 classes such as ��car��, ��boat��, ��bird��, ��person��, ��cow��, and
��chair��. Each image has a corresponding segmentation into objects, and a classlabel
for each region (Figure 3). We construct an MIML dataset by treating each image as a
bag, and each object as an instance. In the groundtruth segmentation, each object is
denoted with a different color; in some cases multiple disconnected regions are grouped
as the same object. We construct the same histogram of gradients and histogram of
color features as used in the MSRCV v2 dataset to describe the collection of pixels in
each object.
Many images in VOC only contain one type of object, but multiple instances. There
fore all of the instances in such images are labeled unambiguously. To make the prob
lem more challenging, we discard all images with a single label. After discarding
singlelabel images, we are left with 1053 images.
Synthetic MIML Datasets. Limited availability of MIML datasets with instance la
bels has been a barrier to studying instance annotation (because instance labels
are needed to evaluate accuracy). Using the Letter Recognition dataset [Frey and
Slate 1991] from the UCI Machine Learning repository, we construct two synthetic
MIML datasets. The Letter Recognition dataset consists of 20,000 instances with
16dimensional features, and 26 classes. Note that randomly forming the bags will not
be realistic because realworld MIML problems often have correlations between labels.
Instead, we generate datasets derived from the words in two poems, ��Jabberwocky��
[Carroll 1896], and ��The Road Not Taken�� [Frost 1916]. We call these datasets Letter
Carroll and LetterFrost. For each word in these poems, we create a bag, with in
stances corresponding to the letters in the word. For each instance, we sample
(without replacement), an example from the Letter Recognition dataset with the cor
responding letter. The baglevel labels are the union of the instance labels. For exam
ple, the word ��diverged�� is transformed into a bag with 8 instances, and the label set
{d, i, v, e, r, g}.
5.1.2. Transductive vs. Inductive. We consider instance annotation in two different
modes: transductive and inductive. In the transductive mode, the goal is to predict
the instance labels for bags with known label sets. In this mode, the instancelevel
classifiers can only predict labels that appear in the bag label set. Formally, the in
stance classifier in the transductive mode is
f(xiq) = arg max
j��
Yi
fj(xiq). In the induc
tive mode, the goal is to predict instance labels in previously unseen bags (with un
known label sets). There is no restriction on which label an instance may be given in
this case.
For both modes, we compute classifier accuracy as the fraction of instances correctly
classified. For the inductive mode, we run 10fold crossvalidation and report average
accuracy �� standard deviation in accuracy between folds. We did not use 10fold cross
validation for the VOC dataset, but instead used a partition of the dataset into ��train��
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:17
and ��val�� subsets, which was included with the dataset. For the transductive mode,
we disregard this partition and use all 1053 images, and for the inductive mode we
learn from the ��train�� subset and evaluate accuracy on the ��val�� subset. That is why
the VOC columns in Tables III(c) and (d) do list a standard deviation.
5.1.3. Baseline Methods. To directly compare our proposed rank loss objective with
Hamming loss, we modify the SIMHeuristic algorithm to use Hamming loss instead
of rank loss. We also consider another baseline M3MIML, which is designed to make
predictions at the bag level but learns an instancelevel model using Hamming loss.
We also compare rank loss to the ambiguous loss function used by Cour et al. [2011].
Finally, as a reference we evaluate a SISL SVM classifier, which has the unfair ad
vantage of learning directly from instance labels. In the following we summarize these
baseline methods.
HammingLoss SIM. To compare Hamming loss to rank loss, we use the following
Hammingloss objective, with
Fj(·
) instantiated with either the max or softmax aggre
gation models.
hHam(W) =
��
2

W2 +
1
nc
n
��
i=1
c
��
j=1
max{0, 1 −
YijFj(Xi)},
(30)
or in terms of support instances,
ˆ
hHam(W) =
��
2

W2 +
1
nc
n
��
i=1
c
��
j=1
max{0, 1 −
Yijwj · ˆ
xij}.
(31)
To optimize this objective, we use a variation of the SIMHeuristic algorithm with the
subgradient3
v
(t,
��),
q
Ham =
��w(t,
��)
q
−
1
nc
n
��
i=1
I[
Yiqw(t,
��)
q
· ˆ
x
(t)
iq
< 1]
Yiq ˆ
x
(t)
iq
.
(32)
M3
MIML Algorithm. M3MIML is a MIML algorithm designed to make predictions
at the bag level, however, it learns an instancelevel model, which can be used for
instance annotation. Similar to our approach, M3MIML learns one linear model per
class and uses the max baglevel model for
Fj. The optimization problem for M3MIML
is formulated as minimizing 
W2, where
W = [
w1,
... ,
wc], subject to the constraints
of correct output at the bag level,
YijFj(Xi) �� 1 for
i = 1,
... ,
n,
j = 1,
... ,
c.
(33)
Note that the equivalent unconstrained objective is regularized Hammingloss. If
Yij =
−1, the constraint is convex, and is converted to a set of linear constraints
− max
x��
Xi
wj ·
x �� 1
(34)
∀
x ��
Xi : −
wj ·
x �� 1.
(35)
3Note that (30) satisfies the conditions for Lemma 4.1 and a bound on the magnitude of the subgradient of
L =
c
2
(��
2
�� +
R
c
)2
suffices. The proof of this bound is similar to the short derivation in Section 4.3.4.
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:18
F. Briggs et al.
If
Yij = +1, the constraint is nonconvex, and M3MIML��s formulation replaces it with a
linear upper bound:
max
x��
Xi
wj ·
x ��
1
ni
��
x��
Xi
wj ·
x �� 1.
(36)
After several more steps (e.g., adding slack variables and switching to the dual),
M3MIML is formulated as a QP with linear constraints. Unlike the CCCP approach,
only a single QP is solved rather than a sequence.
Ambiguous Label Classification (ALC). Cour et al. [2009] proposed an SVM formu
lation for the ALC problem. We refer to their algorithm as CLPL because it is imple
mented in the
Convex Learning from Partial Labels Toolbox [Cour et al. 2011]. We
compare our proposed method to CLPL because they both learn one linear model per
class
fj(x) =
wj ·
x and predict the instance label as argmax
j fj(x), and both use an
L2 regularized loss function. The primary difference in Cour��s ALC method is that the
loss function is designed for use with ALC data (instead of the baglevel loss functions
we use for MIML data). The ALC loss function is a convex upper bound to the 0/1 loss
with respect to the true (unknown) instance labels,
L(f,
x,
Y) = max{0, 1 −
1

Y
��
j��
Y
fj(x)}2 +
��
j/��
Y
max{0, 1 +
fj(x)}2.
Minimizing this loss function can be accomplished by a reduction to a SISL SVM
problem with squared hingeloss, and solved using an offtheshelf linear SVM. In our
experiments, we use the
L2 regularized squareloss dual solver in LIBLINEAR [Fan
et al. 2008] to implement CLPL.
Single Instance Single Label SVM. We also run a standard SISL SVM for the in
ductive mode, whose performance can be interpreted as an empirical upper bound for
inductive instance annotation because it is trained using unambiguously labeled in
stances. For this experiment, we use LIBSVM [Chang and Lin 2001] with a linear
kernel. Note that LIBSVM uses one linear model for each pair of classes rather than
one for each class.
5.1.4. Parameter Optimization. The SIM algorithms have a regularization parameter
��.
Similarly, CLPL and M3MIML have a regularization parameter
C. We repeat all ex
periments for each value of
�� �� {10−6
,
... , 10
−9},
C �� {101,
... , 104} for CLPL, and
C �� {10−2
,
... , 106} for M3MIML, then report the maximum accuracy achieved by each
method over all parameters (we refer to this process as post hoc parameter selection).
We expect these ranges of parameters to be sufficient based on preliminary experi
ments in which we used larger ranges [Briggs et al. 2012a].
Where the random Fourier kernel features are used, the parameter
D = 50 (as in
Rahimi and Recht [2007]), and we jointly optimize the regularization parameter and
the RBF kernel parameter over a grid with the aforementioned values of
�� or
C, and
�� �� {103, 104, 105}. This range of values for
�� was selected by manual tuning.
For the SISL SVM, the regularization parameter
C is optimized by nested 10fold
crossvalidation (within each fold of 10fold cross validation, we run 10fold cross
validation in the training set to select the parameter). We search over the range
C �� {101,
... , 107}. With values of
C larger than 107, LIBSVM becomes prohibitively
slow.
The parameters
T,
K, and
Kmax control the tradeoff between convergence and
runtime. These parameters are manually selected by empirically observing the typical
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:19
convergence behaviors of different algorithms. In particular, for the SIM heuristic
algorithms, we use
T = 10 outer iterations, with
K = 100 iterations of subgradient
descent. For the CCCP algorithm, we use
T = 10,
K = 100, and
Kmax = 1000.
Figure 4 shows that most improvement in the rank loss objective occurs within
T = 10
outer iterations. We presented empirical results in Briggs et al. [2012a] showing that
K = 100 is a reasonable number of iterations of subgradient descent.
Kmax = 1000 is
chosen rather conservatively, based on empirical observation to ensure monotonicity
in CCCP. Note that it is possible further increasing
T,
K and
Kmax values may lead
to slightly better performance due to better convergence, however, we consider these
values sufficient for problems with a similar number of classes.
5.2. Results
Accuracy results are listed in Table III. In particular, Tables III(a) and (b) present
the accuracy results for the transductive mode (with no kernel, and the RBF kernel
respectively). Tables III(c) and (d) present the results for the inductive mode.
Following the recommendations of Demšar [2006] for statistical comparison of multi
ple classifiers on multiple datasets, we summarize comparisons between two methods
using wintieloss counts (and do not discard some wins or losses based on a pairwise
significance test). Counts are aggregated over all datasets including Birdsong* but not
including the unfiltered variant.
5.2.1. Comparison of Loss Functions. We first compare rank loss vs. Hamming loss ver
sions of SIMHeuristic. This is a direct comparison between the two loss functions be
cause all other aspects are kept the same, i.e. the aggregation model and optimization
method. Considering five datasets (all but the unfiltered Birdsong dataset, to avoid
the compounding factor of noise instances), two different aggregation models (max and
softmax), and two different settings (transductive and inductive), there are a total of 20
direct comparisons between the two loss functions. The wintieloss count for rankloss
vs Hamming loss is 2000, a decisive win for rank loss.
We next compare rankloss SIMHeuristic and SIMCCCP with M3MIML. Since
M3MIML uses the max aggregation model, we compare it with the SIMheuristic
with max, resulting in 10 total comparisons, and a wintieloss count of 901 in fa
vor of SIMHeuristic. For SIMCCCP vs M3MIML, the wintieloss count is also 901.
Finally, comparing SIMHeuristic with rank loss and max or softmax to CLPL
(ambiguous loss), the wintieloss count is 1802. Overall, these results suggest that
rank loss consistently outperforms Hamming or ambiguous loss.
5.2.2. Comparison of Aggregation Models. Next we compare the max and softmax aggre
gation models. Focusing on rank loss, we count wins, ties, and losses using the SIM
Heuristic algorithm across five datasets, in transductive and inductive modes, with or
without a kernel, resulting in a total of 20 comparisons. The overall wintieloss count
for softmax vs max is 1316 in favor of softmax. Interestingly, if we focus on only linear
models (no kernel features), the wintieloss count is 811, suggesting a dominant win
for softmax. In contrast, when using kernel features, the count is 505, indicating a
tie between the two aggregation models. We suggest that softmax achieves higher ac
curacy than max when using linear models because softmax is less sensitive to outliers
and noise. We speculate that the difference between max and softmax is less when a
kernel is used because outliers and noise have a local effect on the decision boundaries
of the classifier, rather than skewing the boundaries globally as with a linear classifier.
5.2.3. The Effect of Unlabeled Instances. Recall that the Birdsong dataset has a filtered
and unfiltered variant (Birdsong* is the filtered variant). The unfiltered variant con
tains instances that were left unlabeled and are not necessarily accounted for in the
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:20
F. Briggs et al.
Table III. Instance Annotation Accuracy Results
(a) Transductive accuracy, no kernel. * – filtered variant
Algorithm
Loss
Model
Carroll
Frost
Birdsong*
Birdsong
MSRC v2
VOC
SIMHeuristic Rank
max
.719
.780
.815
.801
.699
.633
SIMCCCP
Rank
max
.744
.805
.816
.803
.720
.634
SIMHeuristic Rank
softmax
.721
.814
.815
.810
.718
.630
SIMHeuristic Hamming
max
.415
.495
.707
.599
.581
.541
SIMHeuristic Hamming
softmax
.500
.548
.781
.603
.603
.557
CLPL
Ambiguous –
.672
.688
.742
.678
.678
.598
M3MIML
Hamming
max
.454
.532
.651
–
.547
.533
(b) Transductive accuracy, random Fourier kernel features
SIMHeuristic Rank
max
.817
.792
.822
–
.756
.642
SIMCCCP
Rank
max
.807
.780
.829
–
.798
.623
SIMHeuristic Rank
softmax
.794
.819
.833
–
.766
.634
(c) Inductive accuracy �� standard deviation over 10fold cross validation, no kernel
SIMHeuristic Rank
max
.531 �� .054
.562 �� .057
.602 �� .033
.522 �� .032
.442 �� .044
.354
SIMCCCP
Rank
max
.551 �� .038
.555 �� .065
.607 �� .038
.530 �� .021
.473 �� .029
.350
SIMHeuristic Rank
softmax .540 �� .049
.573 �� .052
.618 �� .041
.556 �� .062
.460 �� .042
.357
SIMHeuristic Hamming
max
.114 �� .030
.141 �� .045
.239 �� .051
.133 �� .062
.152 �� .049
.143
SIMHeuristic Hamming
softmax .150 �� .049
.166 �� .062
.342 �� .044
.197 �� .052
.223 �� .034
.215
CLPL
Ambiguous –
.464 �� .058
.506 �� .063
.620 �� .038
.535 �� .033
.431 �� .036
.345
M3MIML
Hamming
max
.288 �� .041
.313 �� .041
.433 �� .073
–
.317 �� .055
.396
SISL SVM
Hinge
–
.772 �� .049
.753 �� .038
.772 �� .032
–
.638 �� .045
.440
(d) Inductive accuracy �� standard deviation over 10fold cross validation, random Fourier kernel features
SIMHeuristic Rank
max
.565 �� .060
.590 �� .051
.645 �� .039
–
.499 �� .044
.339
SIMCCCP
Rank
max
.618 �� .042
.576 �� .065
.630 �� .040
–
.519 �� .044
.343
SIMHeuristic Rank
softmax .596 �� .041
.587 �� .066
.642 �� .039
–
.506 �� .038
.337
baglevel label sets. Note surprisingly, all algorithms suffer some degradation in accu
racy in the presence of these unlabeled instances. There are a few interesting points
to note. First, the performance degradation for rank loss is often substantially smaller
relative to the other loss functions used in the comparison, including both Hamming
loss and CLPL. This suggests that rankloss is more noise robust than other loss func
tions. We also note that the accuracy of the max model decreases by more than the
accuracy of the softmax model; max is known to be sensitive to outliers and noise. In
contrast the softmax model is more robust in the presence of noise instances.
5.2.4. Comparison of CCCP and Heuristic Optimization. Because SIMCCCP is not applica
ble to the softmax model, we focus on the max model in a comparison of SIMHeuristic
and SIMCCCP. One point of interest is the convergence of these two algorithms on
the nonconvex rankloss objective. Figure 4 shows the objective
hRL vs the number of
outer iterations for SIMHeuristic and SIMCCCP (recall one outer iteration consists
of updating support instances, and solving a convex problem). Observe that CCCP
monotonically decreases the objective, while the heuristic occasionally increases the
objective [e.g., Figure 4(e)]. CCCP generally achieves lower (better) objective values
than the heuristic.
Comparing the accuracy of the two methods over all results in the transductive and
inductive modes, with or without a kernel, the count for SIMCCCP vs SIMHeuristic
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:21
Fig. 4. The objective
hRL vs number of iterations of SIMCCCP and SIMHeuristic. SIMCCCP is the dotted
line. Transductive mode, max model,
�� = 10−7.
is 1307, slightly in favor of CCCP. These results suggest that the lowerobjective
solutions obtained by CCCP often translate to improved accuracy.
5.2.5. Random Fourier Kernel Features. We now examine accuracy improvements
achieved by using random Fourier kernel features [Rahimi and Recht 2007]. Re
sults using this method are listed in Tables III(b) and (d). From these results we
see that the random Fourier kernel features often improve accuracy, sometimes by as
much as 10% (e.g., transductive SIMHeuristic with max on the LetterCarroll dataset).
More specifically, we compare the results obtained using kernel features (Tables III(b)
and (d)) against the results of corresponding linear methods (the first three rows of
Tables III(a) and (c) respectively), resulting in a total of 30 comparisons. The aggre
gated wintieloss counts for kernel vs no kernel features is 2505 in favor of kernel
features.
5.2.6. Parameter Selection by CrossValidation. In all of the results discussed so far, we
use posthoc parameter selection. In other words, we run the whole experiment multi
ple times with different parameters, and report the result with the best instancelevel
accuracy. This parameter selection is done the same way for all algorithms (includ
ing CLPL and M3MIML), so no algorithm gains an unfair advantage. However, this
method cannot be used in practice unless some instance labels are known. One might
label a relatively small number of instances specifically for this purpose.
In a scenario where no instance labels are known, crossvalidation cannot be used to
select the regularization parameter
�� that gives the best instancelevel accuracy. We
propose instead to use crossvalidated rank loss as a selection criteria for
�� as follows:
—
Apply ��fold crossvalidation. For each fold
l = 1,
... ,
��, partition the dataset into
two sets
Train(l) and
Test(l).
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:22
F. Briggs et al.
Fig. 5. (a–e) Solid line—accuracy vs
��. Dotted line—
RL(��), i.e. crossvalidated baglevel rank loss. (f) Scatter
plot of accuracy and
RL(��) for the LetterFrost dataset.
— In each fold
l, train an SIM on
Train(l), then compute unregularized rank loss on
Test(l). Compute the average rank loss over all folds as a function of
��,
RL(��) =
1
��
��
��
l=1
��
i��
Test(l)
��
j��
Yi,
k/��
Yi
��i max{0, 1 −
(
Fj(Xi) −
Fk(Xi)
)
}.
(37)
— Select the
�� that gives the lowest loss averaged over all folds,
ˆ
�� = arg min
��
RL(��).
(38)
Note that we did not use this method in other experiments because it increases
runtime by an order of magnitude compared to post hoc selection (applying this method
in the transductive mode involves crossvalidation; in the inductive mode it would be
nested crossvalidation, e.g., 10fold CV within each fold of 10fold CV). Cour et al.
[2011] proposed a similar parameter selection scheme for ALC wherein the regulariza
tion parameter is selected by crossvalidated ambiguous loss.
To test our proposed parameter selection method within a reasonable amount of
time, we focus on the transductive mode. Using the SIMHeuristic algorithm with
softmax, we vary
�� �� {10−9
,
... , 100} and at each value of
��, compute the instance
level accuracy and crossvalidated rank loss. Figure 5(a–e) shows the instancelevel
accuracy and crossvalidated rankloss as a function of
�� for each dataset. From these
results, we see that the value of
�� selected by minimizing crossvalidated rank loss is
generally close to the value of
�� that maximizes instancelevel accuracy. Figure 5(f)
shows an example scatter plot of accuracy and crossvalidated rank loss from the
LetterFrost dataset. Each point corresponds to one value of
��. The line is a linear
leastsquares fit. This plot shows that lower crossvalidated rank loss generally cor
responds with higher instance accuracy. The scatter plots are similar for the other
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:23
Table IV. Empirical Runtime (Seconds), Transductive, No Kernel,
�� = 10−7
Algorithm
Loss
Model
Carroll
Frost
Birdsong*
MSRC v2
VOC
SIMHeuristic
Rank
max
12.6
9.8
21.0
64.1
139.1
SIMHeuristic
Rank
softmax
13.6
10.2
23.3
68.9
147.1
SIMCCCP
Rank
max
53.7
59.8
132.8
269.9
443.3
datasets. We find experimentally that the proposed method is reasonably effective for
selecting the regularization parameter in the absence of any instance labels.
5.2.7. Empirical Runtime. Table IV gives empirical runtimes in seconds4 for our pro
posed methods. These results are obtained in the transductive mode, with a fixed reg
ularization parameter
�� = 10−7
, and no kernel. In all cases, the runtime is on the
order of minutes or seconds. The runtime using the softmax model is slightly more
than using the max model. The runtime for CCCP is longer than the heuristic because
it recomputes support instances in every iteration of subgradient descent, and runs
more iterations of subgradient descent to ensure monotonicity.
5.2.8. Overall Summary. Comparing our proposed methods SIMCCCP with max and
SIMHeuristic with softmax, neither is best all the time. Generally, we observe that
SIMCCCP with max achieves higher accuracy when bag label sets are exactly equal
to the union of instance labels. When this assumption does not hold, SIMHeuristic
with softmax tends to achieve higher accuracy. SIMHeuristic with softmax provides
the best balance of runtime and accuracy.
SISL SVM in the inductive mode [Table III(c)] achieves a much higher accuracy than
any of the MIML algorithms. This result is expected, as the SISL SVM has access to
labeled instances when training, while the MIML algorithms do not. However, this
accuracy gap suggests that there is still a lot of room for improvement in the instance
annotation algorithms.
6. RELATED WORK
MIML algorithms are developed under multiple frameworks, some of which naturally
lend themselves to instance annotation. One such framework is graphical models,
which have been previously used to perform bag and instancelevel classification. Such
models often treat instance labels as latent variables. Inference over such models al
lows the classification of instances. While a variety of algorithms exists, we highlight
some representative examples of recent work. DirichletBernoulli Alignment [Yang
et al. 2009] and the Exponential Multinomial Mixture model [Yang et al. 2010] are
topic models for MIML datasets and use variational inference to perform instance la
beling. Zha et al. [2008] proposed the MLMIL algorithm, a conditional random field
model for MIML image annotation that uses Gibbs sampling to infer instance labels.
Du et al. [2009] proposed another application of graphical models to simultaneous im
age annotation and segmentation.
While graphical models offer intuitive probabilistic interpretation, the computa
tional complexity of inference in such models is one of the standing challenges. In this
work, we focus on another class of MIML approaches based on regularized loss mini
mization. HammingLoss SIM can be viewed as an alternative approach for optimizing
a similar objective to the M3MIML algorithm [Zhang and Zhou 2008] (which learns an
instancelevel model, but was not designed for instance annotation). Also note that
4These runtimes are measured on a 2010 MacPro with 2.4 GHz Intel Xeon processor and 16 GB of 1066
MHz DDR3 memory. The algorithms are implemented in C++ and compiled with GCC 4.0.
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:24
F. Briggs et al.
CLPL follows the same framework of regularized loss minimization (but using a loss
function designed for ALC).
There are a small number of other works that address MIML instance annotation.
Vijayanarasimhan and Grauman [2009] developed an MIML SVM that learns a bag
level model with a set kernel (it does not learn a model of the instance feature space).
Their algorithm makes predictions at either the bag or instance level by treating an
instance as a bag of one instance. Vezhnevets et al. [2010] proposed an algorithm for
instance annotation in MIML data, where images are represented as a bag of pixels.
Their algorithm alternates between sampling instance labels from an estimated dis
tribution, and training an ensemble of decision trees on the sampled labels. Similarly,
Nguyen [2010] proposed an MIML SVM algorithm that alternates between assigning
instance labels and maximizing margin, given the assigned labels (although they did
not conduct experiments on instance annotation).
Similar to our approach, the MISVM algorithm [Andrews et al. 2002] for multi
instance learning (MIL) uses CCCP to handle nonconvexity (note the interpretation
of MISVM as an instance of CCCP is due to Cheung and Kwok [2006]). DMimlSvm
[Zhou et al. 2012] also uses CCCP to optimize Hamming loss.
In recent work, Li et al. [2012] consider the problem of identifying the ��key in
stances�� that trigger labels. This problem is similar to instance annotation, but differs
in that the goal is not to label all instances, but instead to select a set of instances
explaining each label.
7. CONCLUSION AND FUTURE WORK
In this work, we proposed rankloss support instance machines for instance annota
tion. The goal of instance annotation is to learn an instance levelclassifier using an
MIML dataset for training data, which does not directly associate instances with la
bels. We explained why and empirically showed that rankloss is superior to other
loss functions e.g., Hamming or ambiguous loss for the instance annotation problem.
The SIM algorithm is based on the observation that for the commonly used max model
connecting baglevel labels and instancelevel labels, the baglevel output can be repre
sented as a linear function of a ��support instance��, which summarizes the instances in
a bag. The SIM model can also be used with the softmax model, which is less sensitive
to noise. The SIM model poses a nonconvex optimization problem; we offer a heuristic
solution that is applicable to either max or softmax and a CCCP method that can be ap
plied to the max model and guarantees monotonic decrease in the nonconvex objective.
In either optimization method, the basic process is to alternate between updating sup
port instances, and solving a convex problem to minimize rank loss. We give a primal
subgradient descent algorithm for solving these convex problems with linear runtime
in the number of bags and instances. We also demonstrate that an approximate kernel
method often improves accuracy, while retaining linear runtime.
In future work we will examine deviations from the basic assumptions of MIML
instance annotation. For example, the assumption that a bag label set is the union
of instance labels, is often violated. This may be the case when the labeler does not
provide a complete labeling of the data, and only labels a subset of relevant classes.
Another interesting problem is recognizing when an instance does not belong to any of
the known classes. This is a common issue in machine vision datasets, where there are
often background scenery or novel objects that are not labeled. Finally, one should re
member that when MIML instance annotation is applied, the classic SISL approach is
also an option, and typically gives higher accuracy at the cost of increased effort to la
bel instances. Learning from a mixed granularity dataset consisting of mostly baglevel
label sets and a few unambiguously labeled instances is one potential way to achieve
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:25
the higher accuracy of SISL with the reduced labeling effort of MIML. However, this
idea has received little study so far [Vijayanarasimhan and Grauman 2009].
APPENDIXES
APPENDIX 1. Proof of Bound on Subgradient
We follow the approach of ShalevShwartz and Singer [2007] to establish the con
vergence for subgradient descent. Because the number of iterations to reach an
ϵsuboptimal solution is
O( L
��ϵ
), where
L is a bound on the subgradient 
V2 ��
L, we
proceed with a derivation of
L. The bound proved in this appendix applies to subgradi
ents (27) and (29). Suppressing all sub and superscripts except for the class index
q,
recall that the complete subgradient is
V = [
v1,
... ,
vq,
... ,
vc].
We will start with a bound on the
qth component of
V; using the triangle inequality:

vq ��
��
wq +
��
ijk
��iI[ ·]
(
 ˆ
x
I[
q =
k] + ˆ
x
I[
q =
j]
)
.
(39)
Note that
I[ ·] �� 1 and for any support instance  ˆ
x ��
R. Using these properties

vq ��
��
wq +
R
��
ijk
��i
(
I[
q =
k] +
I[
q =
j]
)
(40)

vq ��
��
wq +
R
(��
ijk
��iI[
q =
k] +
��
ijk
��iI[
q =
j]
)
.
(41)
The first sum on the RHS of (41) can be computed as
��
ijk
��iI[
q =
k] =
n
��
i=1
��i
��
j��
Yi
��
k/��
Yi
I[
q =
k]
(42)
=
n
��
i=1
��i
c
��
j=1
c
��
k=1
I[
q =
k]
I[
j ��
Yi]
I[
k /��
Yi]
(43)
=
n
��
i=1
��iI[
q /��
Yi]
c
��
j=1
I[
j ��
Yi]
︸
︷︷
︸

Yi
=
n
��
i=1
��iI[
q /��
Yi] 
Yi.
(44)
Similarly, the second sum on the RHS of (41) can be simplified as
��
ijk
��iI[
q =
j] =
n
��
i=1
��iI[
q ��
Yi]  ¯
Yi
(45)
Substituting these terms back into (41),

vq ��
��
wq +
R
(
n
��
i=1
��iI[
q /��
Yi] 
Yi +
n
��
i=1
��iI[
q ��
Yi]  ¯
Yi
)
.
(46)
To condense notation we will rewrite this statement as 
vq ��
aq +
bq, where
aq =
��
wq
(47)
bq =
R
n
��
i=1
��i
(
I[
q /��
Yi] 
Yi +
I[
q ��
Yi]  ¯
Yi
)
.
(48)
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:26
F. Briggs et al.
LEMMA A.1. 
V2 ��
(
��
��
��
��
c
��
q=1
a2
q +
��
��
��
��
c
��
q=1
b2
q
)2
PROOF.

vq2 ��
(aq +
bq)
2 =
a2
q +
b2
q + 2
aqbq
(49)

V2 =
c
��
q=1

vq2 ��
c
��
q=1
a2
q +
c
��
q=1
b2
q +
c
��
q=1
2
aqbq
(50)
��
c
��
q=1
a2
q +
c
��
q=1
b2
q + 2
��
��
��
��
(
c
��
q=1
a2
q
)(
c
��
q=1
b2
q
)
by CauchySchwarz
(51)
��
(
��
��
��
��
c
��
q=1
a2
q +
��
��
��
��
c
��
q=1
b2
q
)2
.
(52)
LEMMA A.2.
c
��
q=1
a2
q �� 2
��
PROOF.
��
c
q=1
a2
q =
��
c
q=1
��2
wq2 =
��2 ��
c
q=1 
wq2 =
��2
W2 �� 2
��
because
W ��
S.
LEMMA A.3.
c
��
q=1
b2
q �� 4
R2
PROOF.
c
��
q=1
b2
q =
R2
c
��
q=1
(
n
��
i=1
��i(I[
q /��
Yi] 
Yi +
I[
q ��
Yi]  ¯
Yi
)
)2
(53)
=
R2
c
��
q=1
n
��
i=1
n
��
i =1
��i��i
(
I[
q /��
Yi] 
Yi +
I[
q ��
Yi]  ¯
Yi
)(
I[
q /��
Yi ] 
Yi  +
I[
q ��
Yi ]  ¯
Yi 
)
.
(54)
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:27
Note that this result is obtained from the identity
(
n
��
i=1
xi
)2
=
n
��
i=1
n
��
i =1
xixi . Expanding
the right side of the expression, we will obtain four terms, e.g.,
R2
c
��
q=1
n
��
i=1
n
��
i =1
��i��i I[
q /��
Yi]
I[
q /��
Yi ] 
Yi
Yi 
(55)
=
R2
n
��
i=1
n
��
i =1
��i��i 
Yi
Yi 
c
��
q=1
I[
q /��
Yi]
I[
q /��
Yi ]
(56)
��
R2
n
��
i=1
n
��
i =1
��i��i 
Yi
Yi 
��
��
��
��
(
c
��
q=1
I[
q /��
Yi]2
)(
c
��
q=1
I[
q /��
Yi ]2
)
by CauchySchwarz (57)
=
R2
n
��
i=1
n
��
i =1
��i��i 
Yi
Yi 
��
 ¯
Yi ¯
Yi 
(58)
=
R2
n
��
i=1
n
��
i =1

Yi
Yi 
��
 ¯
Yi ¯
Yi 
n2
Yi ¯
Yi
Yi  ¯
Yi 
=
R2
n2
n
��
i=1
n
��
i =1
1
��
 ¯
Yi ¯
Yi 
=
R2
n2
(
n
��
i=1
1
��
 ¯
Yi
)(
n
��
i=1
1
��
 ¯
Yi
)
.
(59)
The other three terms can be derived similarly. It follows that
c
��
q=1
b2
q ��
R2
n2
((
n
��
i=1
1
��
 ¯
Yi
)2
+
(
n
��
i=1
1
��

Yi
)2
+ 2
(
n
��
i=1
1
��
 ¯
Yi
)(
n
��
i=1
1
��

Yi
))
(60)
=
R2
n2
(
n
��
i=1
( 1
��

Yi
+
1
��
 ¯
Yi
))2
��
R2
(1 +
1
��
c − 1
)
2.
(61)
To obtain the last inequality, observe that a label set
Yi may not be empty or contain all
c classes, therefore 1 �� 
Yi ��
c − 1 and 1 ��¯
Yi ��
c − 1. Finally, note that 1 + 1
��
c−1
�� 2,
therefore
��
c
q=1
b2
q �� 4
R2.
Combining Lemmas A.1, A.2, and A.3, we get

V2 ��
(
��
��
��
��
c
��
q=1
a2
q +
��
��
��
��
c
��
q=1
b2
q
)2
(62)
��
(��
2
�� + 2
R
)2
.
(63)
therefore
L =
(��
2
�� + 2
R
)2
is a suitable bound for the Pegasos analysis. This result
applies to either SIMCCCP or SIMHeuristic with rank loss.
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:28
F. Briggs et al.
APPENDIX 2. Implementation Details
In this appendix, we further discuss the design and implementation of of SIM.
Average Support Instance Initialization. The rank loss objective is nonconvex and
hence the optimum found may depend on the starting point. In preliminary experi
ments, we tried starting with random weights, then computing the first support in
stances with max or softmax based on the random weights. However, a problem with
this approach is that the support instance ˆ
xij computed from random weights is un
likely to be a good representation of class
j for bag
i. This approach is prone to get
ting stuck in bad local optima. Also note that meaningful support instances cannot be
computed from
W = 0. Our proposed SIM algorithms instead use an average model
ˆ
xij = 1
ni
��
x��
Xi
x, which does not depend on the weights. After solving the convex prob
lem once with these average support instances, the weights are good enough to com
pute support instances with max or softmax in subsequent iterations.
Warm Start. Going from one outer iteration of CCCP or the heuristic to the next,
we start at the weights with the lowest objective evaluation on the convex problem
from the previous iteration. We do this because projected subgradient descent is not
guaranteed to monotonically decrease the convex objective (an upper bound on the
objective is guaranteed to decrease monotonically). Hence the best solution may occur
in some iteration of subgradient descent other than the last.
Using an Approximate Solver within CCCP. To show that CCCP monotonically de
creases the objective, it is assumed that the convex problem in each iteration is solved
exactly [Sriperumbudur and Lanckriet 2009]. However, we are using an approximate
solver in each iteration. We can still ensure monotonic decrease in the original DC
problem by running a sufficient number of iterations of subgradient descent. Recall
that the convex problem solved in each iteration of CCCP uses a linear upper bound
of the concave part of the objective at the current point, which touches the original
DC objective at the current point. Hence at iteration
t of CCCP, we have
hRL(W(t)) =
h
(t)
RL,
cccp
(W(t)). At iteration
�� of subgradient descent within iteration
t of CCCP, the ob
jective value for the convex problem is
h
(t)
RL,
cccp
(W(t,
��)). If this objective is less than the
DC objective at the start of the iteration of CCCP, i.e.
h
(t)
RL,
cccp
(W(t,
��)) ��
h
(t)
RL,
cccp
(W(t)),
then
�� is a sufficient number of iterations to ensure DC objective is monotonically
decreasing, because
hRL(W(t+1
)) ��
h
(t)
RL,
cccp
(W(t,
��)) ��
h
(t)
RL,
cccp
(W(t)) =
hRL(W(t)).
(64)
In our implementation of CCCP, we start by running
K iterations of subgradient
descent. If
K iterations is enough to decrease the
hRL objective, we move on to the
next iteration of CCCP. If not, we repeatedly run
K more iterations until either the
hRL objective decreases, or the total number of iterations reaches
Kmax. In the latter
case, we terminate CCCP and return the best solution found so far. We do not use this
scheme of increasing the number of iterations of subgradient descent for the heuristic
optimization method (it always uses exactly
K iterations).
Feature Rescaling. We apply the following preprocessing to the instance features.
First, we transform each feature to the range [ 0, 1]. Next, we apply the same feature
rescaling process used in the
Convex Learning from Partial Labels Toolbox (for am
biguous label classification [Cour et al. 2011]), which centers the data and scales each
feature by
1
����
m
i=1 
xi2
. When the Random Fourier Kernel features are used, we first
apply the preceding process, then apply the transform
z, then repeat the process from
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
Instance Annotation for MultiInstance MultiLabel Learning
14:29
the CLPL a second time. We have observed that the proposed methods are sensitive to
feature scaling, and found these processing steps effective.
Numerical Issues with softmax
. Due to large exponents, numerical overflow may
occur when computing the softmax weights as
��
j
iq =
ewj·
xiq
��
x ��
Xi
ewj·
x
.
This problem is more likely to occur when the regularization parameter
�� is small,
because the weights are constrained to a ball with large radius. An equivalent formula
for calculating the softmax weights that avoids numerical issues is
��
j
iq =
ewj·
xiq−
b
��
x ��
Xi
ewj·
x −
b
where
b = max
x��
Xi
wj ·
x.
(65)
ACKNOWLEDGMENTS
We would like to thank Matthew Betts, Sarah Frey, Adam Hadley, and Jay Sexsmith for their help in col
lecting HJA data, Iris Koski for labeling the data, Katie Wolf for her work on noise reduction, and Lawrence
Neal for his work on segmentation.
REFERENCES
Andrews, S., Tsochantaridis, I., and Hofmann, T. 2002. Support vector machines for multipleinstance learn
ing. In
Proceedings of Advances in Neural Information Processing Systems 15, 561–568.
Briggs, F., Fern, X., and Raich, R. 2012a. Rankloss support instance machines for MIML instance annota
tion. In
Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining. 534–542.
Briggs, F., Lakshminarayanan, B., Neal, L., Fern, X., Raich, R., Hadley, S., Hadley, A., and Betts, M. 2012b.
Acoustic classification of multiple simultaneous bird species: A multiinstance multilabel approach.
J.
Acoust. Soc. Amer. 131, 4640.
Carroll, L. 1896.
Through the LookingGlass: And What Alice Found There. Macmillan UK.
Chang, C.C. and Lin, C.J. 2001. LIBSVM: A library for support vector machines.
ACM Trans. Intell. Syst.
Technol. 2, 3.
Cheung, P. and Kwok, J. 2006. A regularization framework for multipleinstance learning. In
Proceedings of
the 23rd International Conference on Machine Learning. 193–200.
Cour, T., Sapp, B., Jordan, C., and Taskar, B. 2009. Learning from ambiguously labeled images. In
Proceed
ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 919–926.
Cour, T., Sapp, B., and Taskar, B. 2011. Learning from partial labels.
J. Mach. Learn. Res. 12, 1225–1261.
Dalal, N. and Triggs, B. 2005. Histograms of oriented gradients for human detection. In
Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition. Vol. 1. 886–893.
Demšar, J. 2006. Statistical comparisons of classifiers over multiple data sets
. J. Mach. Learn. Res. 7, 1–30.
Dietterich, T., Lathrop, R., and LozanoP��rez, T. 1997. Solving the multiple instance problem with axis
parallel rectangles.
Artif. Intell. 89, 1–2, 31–71.
Du, L., Ren, L., Dunson, D., and Carin, L. 2009. A Bayesian model for simultaneous image clustering, anno
tation and object segmentation. In
Proceedings of Advances in Neural Information Processing Systems
22, 486–494.
Elisseeff, A. and Weston, J. 2001. A kernel method for multilabelled classification. In
Proceedings of
Advances in Neural Information Processing Systems 14, 681–687.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. 2010. The Pascal visual object
classes (VOC) challenge.
Intern. J. Comput. Vis. 88, 2, 303–338.
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., and Lin, C.J. 2008. LIBLINEAR: A library for large
linear classification.
J. Mach. Learn. Res. 9, 1871–1874.
Frey, P. W. and Slate, D. J. 1991. Letter recognition using Hollandstyle adaptive classifiers.
Mach. Learn. 6,
161.
Frost, R. 1916.
Mountain Interval. Henry Holt and Company.
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.
i
i
i
i
14:30
F. Briggs et al.
H��llermeier, E. and Beringer, J. 2006. Learning from ambiguously labeled examples.
Intell. Data Anal. 10,
5, 419–439.
Li, Y., Ji, S., Kumar, S., Ye, J., and Zhou, Z., 2009. Drosophila gene expression pattern annotation through
multiinstance multilabel learning. In
Proceedings of the 21st International Joint Conference on Artifi
cial Intelligence. 1445–1450.
Li, Y., Hu, J., Jiang, Y., and Zhou, Z. 2012. Towards discovering what patterns trigger what labels. In
Proceedings of the 26th AAAI Conference on Artificial Intelligence.
Nguyen, N. 2010. A new SVM approach to multiinstance multilabel learning. In
Proceedings of the 10th
IEEE International Conference on Data Mining (ICDM). 384–392.
Rahimi, A. and Recht, B. 2007. Random features for largescale kernel machines. In
Proceedings of Advances
in Neural Information Processing Systems 20. 1177–1184.
ShalevShwartz, S. and Singer, Y. 2007. Logarithmic regret algorithms for strongly convex repeated games.
Tech. rep., The Hebrew University.
ShalevShwartz, S., Singer, Y., and Srebro, N. 2007. Pegasos: Primal estimated subgradient solver for SVM.
In
Proceedings of the 24th International Conference on Machine Learning. 807–814.
ShalevShwartz, S., Singer, Y., Srebro, N., and Cotter, A. 2011. Pegasos: Primal estimated subgradient solver
for SVM.
Math. Prog. 127, 1, 3–30.
Shen, C., Jiao, J., Wang, B., and Yang, Y. 2009. Multiinstance multilabel learning for automatic tag rec
ommendation. In
Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics
(SMC).
Sriperumbudur, B. and Lanckriet, G. 2009. On the convergence of the concaveconvex procedure. In
Proceed
ings of Advances in Neural Information Processing Systems 22, 1759–1767.
Vezhnevets, A., Buhmann, J., and Zurich, E. 2010. Towards weakly supervised semantic segmentation by
means of multiple instance and multitask learning. In
Proceedings of Conference on Computer Vision
and Pattern Recognition.
Vijayanarasimhan, S. and Grauman, K. 2009. What��s it going to cost you? Predicting effort vs. informative
ness for multilabel image annotations. In
Proceedings of CPVR.
Wang, W. and Zhou, Z. 2012. Learnability of multiinstance multilabel learning.
Chinese Sci. Bull., 1–4.
Winn, J., Criminisi, A., and Minka, T. 2005. Object categorization by learned universal visual dictionary. In
Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV). 1800–1807.
Xu, X., Xue, X., and Zhou, Z. 2011. Ensemble multiinstance multilabel learning approach for video anno
tation task. In
Proceedings of the 19th ACM International Conference on Multimedia. 1153–1156.
Yakhnenko, O. 2009. Learning from text and images: Generative and discriminative models for partially
labeled data. Ph.D. thesis, Iowa State University.
Yang, S., Zha, H., and Hu, B. 2009. DirichletBernoulli alignment: A generative model for multiclass multi
label multiinstance corpora. In
Proceedings of Neural Information Processing Systems. 2143–2150.
Yang, S., Bian, J., and Zha, H. 2010. Hybrid generative/discriminative learning for automatic image anno
tation. In
Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Yuille, A. and Rangarajan, A. 2002. The concaveconvex procedure (CCCP). In
Proceedings of Advances in
Neural Information Processing Systems 2, 1033–1040.
Zha, Z., Hua, X., Mei, T., Wang, J., Qi, G., and Wang, Z. 2008. Joint multilabel multiinstance learning for
image classification. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). 1–8.
Zhang, M. and Zhou, Z. 2008. M3MIML: A maximum margin method for multiinstance multilabel learning.
In
Proceedings of the 8th IEEE International Conference on Data Mining (ICDM). 688–697.
Zhou, Z. 2004. Multiinstance learning: A survey. Tech. rep., AI Lab, Department of Computer Science and
Technology, Nanjing University.
Zhou, Z. and Zhang, M. 2007. Multiinstance multilabel learning with application to scene classification. In
Proceedings of Advances in Neural Information Processing Systems.
Zhou, Z., Zhang, M., Huang, S., and Li, Y. 2012. Multiinstance multilabel learning.
Artif. Intell. 176, 1,
2291–2320.
Received September 2012; revised December 2012; accepted March 2013
ACM Transactions on Knowledge Discovery from Data, Vol. 7, No. 3, Article 14, Publication date: September 2013.