Home > Representation and Recognition of Compound Spatio-Temporal Entities Josiah Wang M.Sc. in Cognitive Systems Session 2006/2007

Representation and Recognition of Compound Spatio-Temporal Entities Josiah Wang M.Sc. in Cognitive Systems Session 2006/2007

Page 1
Representation and Recognition of Compound Spatio-Temporal Entities Josiah Wang M.Sc. in Cognitive Systems Session 2006/2007
The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student)

Page 2
Abstract
A compound spatio-temporal entity can be described as a group of objects interacting as a coherent entity, such as a queue. This thesis concerns the representation and recognition of compound spatio- temporal entities, and presents a method to learn these entities from example videos. This work is novel in a sense that there is no known work which deals with compound spatio-temporal entities. Various approaches to tackling the problem are discussed. These include representing a compound spatio-temporal entity as a sequence of occurring events, and using grammar parsing to localise the entity. The proposed solution is also extended to facilitate learning of representations from example videos. In this respect, event sequences are represented as bag of n-grams as a substitute to the grammar- driven approach due to data sparseness. Schemes for temporal localisation and classification using the n-gram model are also presented. The proposed solutions are successfully evaluated on temporal localisation and classification tasks. i

Page 3
Acknowledgements
I would like to extend my deepest gratitude to: • God who always works in mysterious ways, and who constantly provides me with inspiration when I needed it the most. • My parents for their endless support and love. • Prof. David Hogg for always answering my questions patiently no matter how insignificant they may seem, and for always guiding me towards the right path. • The Project LAVID team – Prof. Tony Cohn, Roberto Fraile, Hannah Dee and Krishna – for their insightful comments, suggestions, ideas, feedback and discussions during their (almost) weekly meetings. • Prof. Roger Boyle for all the useful feedbacks and suggestions. • The various talented actors and actresses who volunteered to participate in my video shoot. I truly appreciate your time and effort. • M.Sc class of 2006/07 for their stimulating conversations, advice and casual talk throughout the year. • Academicians of the School of Computing – Prof. Ken Brodlie, Mr. Eric Atwell, Dr. Martyn Clark, Dr. Katja Markert, Dr. Marc de Kamps and Dr. Nick Efford – from whom I have gained invaluable knowledge throughout the programme. • Anybody else whom I have failed to mention, but has helped me in one way or another. ii

Page 4
Contents
1 Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Compound Spatio-Temporal Entities . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 AimandObjective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Methodology and Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background and Previous Work 5 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 ObjectRecognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Representing Spatio-temporal Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Representing Compound Spatio-temporal Entities . . . . . . . . . . . . . . . . . . . . 8 2.3.1 FirstOrderLogic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 StringGrammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.3 GraphGrammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Recognising Compound Spatio-temporal Entities from Visual Data . . . . . . . . . . . 10 2.5 ObjectTracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6 LearningfromData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Representing and Recognising Compound Spatio-Temporal Entities 13 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 iii

Page 5
3.2.2 TrackingObjectsinVideo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.3 Mapping of Image Coordinates to Ground Plane Coordinates . . . . . . . . . . 14 3.3 Qualitative Scene Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.1 SpatialRelationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.2 TemporalRelationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.3 Combining Spatial and Temporal Relationships . . . . . . . . . . . . . . . . . 17 3.4 GeneralRepresentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.1 GraphGrammarRepresentation . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.2 QueueasaSequenceofEvents . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.2.1 Inferring Events from Changes in State . . . . . . . . . . . . . . . . 20 3.4.2.2 Representing Sequences of Events as a Grammar . . . . . . . . . . . 21 3.4.2.3 Additionof��meet��Event . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5.1 RecognisingCompleteSequences . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.2 Localising Subsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6.1 Preprocessing and Scene Representation . . . . . . . . . . . . . . . . . . . . . 26 3.6.2 GenerationofSequenceofEvents . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6.3 Localisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 Learning from Examples 28 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 GraphGrammarInduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 LearningEvents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1 FeatureVector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4 LearningSequencesofEvents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 StringGrammarInduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.2 N-gramModelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6 ReductionofNoiseinEventSequences . . . . . . . . . . . . . . . . . . . . . . . . . 35 iv

Page 6
4.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.7.1 GenerationofFeatureVectors . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.7.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.7.3 N-gramModelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 Evaluation 38 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Temporal Localisation of Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.1 EvaluationMeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.2 Rule-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3.3 Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.4 ClassificationofQueuesandNon-Queues . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4.1 EvaluationMeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4.2 Evaluation of N-gram Based Classifier . . . . . . . . . . . . . . . . . . . . . . 48 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6 Conclusions 52 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Bibliography 54 A Reflection 59 B Interim Report 61 C Dataset 62 D Detailed Results for Nearest Neighbour Classifier 65 v

Page 7
Chapter 1
Introduction
1.1 Overview
1.1.1 Compound Spatio-Temporal Entities
This thesis involves representing and recognising compound spatio-temporal entities. For the purpose of this work, we define a compound spatio-temporal entity as a group of objects — animate or inanimate — interacting directly or indirectly with each other over time as an emergent, coherent entity. The entity is ��compound�� because it involves more than a single object. It is ��spatio-temporal�� in a sense that each component constitutes space and time. Examples of compound spatio-temporal entities include the for- mation of queues at a checkout counter or the stacking of containers onto a cargo ship.
1.1.2 Aim and Objective
This work aims at proposing a suitable representation for a compound spatio-temporal entity, with queues as an example application. It also aims to recognise the represented compound entity from a video sequence. An extension to this is to learn the concepts of a compound entity from training data (i.e. videos), whether supervised or unsupervised. 1

Page 8
1.2 Research Methodology and Schedule
Since this work is strongly research-oriented, conventional software development methodologies are not applicable. Instead, a customised rapid-prototyping model would be more appropriate. Figure 1.1 illustrates the custom methodology used for this work. The methodology involves four primary stages: Background Research. This stage involves identifying and understanding the problem to be tackled. A review of existing literature is required to assess their suitability as a solution for this research- based work. Proposal of Solution. This stage involves proposing a solution to the problem based on relevant litera- ture. Potential problems with the proposed solution are also identified at this stage. Implementation. This stage involves implementing the proposed solution as proof-of-concept. In most cases, existing software or libraries are used. Many data conversion tasks may require coding new programs from scratch. Evaluation. This stage involves evaluating the implementation of the proposed solution using suitable evaluation criteria. The results are compared with ground truth. This methodology is not merely a linear pipeline comprising these four stages. It is instead a con- tinuous process. Many iterations are required to arrive at the final solution(s). Background research is constantly done and reviewed, and new solutions may be proposed at any time. Implementations are also often refined after evaluation to achieve better results. In addition, a completely different solution can be proposed and implemented after evaluating a prior solution. The results of this new solution can then be compared with the results of previous solutions. Each proposed solution of this work is made up of two primary components, i.e. representation and recognition. These two components are strongly dependent on each other. An additional component involves learning the representation from training data. Background research and preliminary work have been done throughout the second semester, and full-time work is done during summer. The first solution has been proposed, implemented and evaluated in June. The following months are used primarily to 2

Page 9
work on extending the solution to include learning from data, evaluation, as well as proposing alternative solutions. Figure 1.1: The custom rapid-prototyping model used throughout this research.
1.3 Thesis Outline
This thesis aims at proposing suitable high-level representations for compound spatio-temporal entities and recognising them from video sequences. This chapter serves as an introduction to this thesis by defining the term ��compound spatio-temporal entities��, and by giving an overview of the methodology used for this research. The remainder of this thesis is organised into the following chapters: Chapter 2 presents a literature review of related work. This includes qualitative representation of scenes, representations for spatio-temporal entities, recognition of various representations, ob- ject tracking and learning from training data. Chapter 3 proposes several solutions with respect to representing and recognising compound spatio- temporal entities. It describes the pre-processing work required to obtain coordinates of entities from video sequences, and to convert these image coordinates to ground-plane coordinates. It then presents various solutions to tackle the problem. It also gives a critical evaluation of each proposed solution. Finally, it describes the rule-based implementation of the proposed solutions. 3

Page 10
Chapter 4 describes the extension to facilitate learning from training data. Besides building on the solution from Chapter 3, some changes and alternative representations are also proposed to cope with various problems. The implementation is also discussed. Chapter 5 describes the evaluation results of implementations from Chapters 3 and 4. Evaluation mea- sures are defined, and results of each implementation are compared and evaluated in temporal localisation and classification tasks. Chapter 6 concludes this thesis by presenting a summary. It also makes recommendations for future work. 4

Page 11
Chapter 2
Background and Previous Work
2.1 Background
2.1.1 Object Recognition
Object recognition is a classical problem in Computer Vision. The main idea behind object recognition is to locate an object in an image, a video frame or a video sequence. Object recognition often involve extracting low-level features from training images and videos, and representing these features as a de- scriptor, e.g. a SIFT descriptor [24]. These descriptors will then be used to describe the object. Various machine learning techniques such as support vector machines (SVM) are used to recognise instances of these objects. This project aims at tackling a slightly different problem. As opposed to classical object recogni- tion which recognises an individual object, we are more interested in recognising a group of objects as an emergent, coherent entity. More specifically, this work aims at recognising the concept of a chosen compound entity within a video sequence. The problem is complicated in a sense that individual compo- nents of the compound entity need to first be identified. Spatial relationships between these components should also be taken into account. In addition, compound entities are not static entities; spatial relation- ships between components evolve over time, and not all components are constantly present throughout 5

Page 12
the whole period of activity. Hence there is also a need to consider temporal relations between static relationships within different time intervals. This problem is in fact a novel take on the classical recognition problem. There has not been much prior research in tackling this particular problem. This project will serve useful in many different ap- plications in the real world such as security surveillance. However, these real world applications are beyond the scope of this project. This work will instead concentrate on the recognition of one example compound spatio-temporal entity, namely queues.
2.1.2 Queues
Not much literature can be found on the representation of queues. Most information on queues involve the abstract data structure used primarily in algorithms and programming. This is unrelated to our task, as we are more interested in the ��waiting line�� definition of a queue. Alternatively, a branch of Operations Research known as Queueing Theory [32] seems more desirable to this thesis. Queueing Theory involves theoretical analysis and simulation of waiting lines. Since this work is neither about operations nor simulations, it only serves as an inspiration. For the purpose of this thesis, we will define a queue as a compound spatio-temporal entity involv- ing at least three people, and include two kinds of behaviour: the formation of a queue and the serving of a queue. The formation of a queue occurs when a person joins the end of a line comprising at least two people. The serving of a queue on the other hand is characterised by the detachment of the first person in the waiting line, followed by an ��update�� sequence of the remaining members of the queue which involves stepping forward to fill in the gap left by the person before them. The occurrence of at least one of these behaviours is sufficient for a queue, although a combination of these is desirable. With this definition, a row of people moving forward in a uniform manner (e.g. soldiers marching) is not considered to be a queue.
2.2 Representing Spatio-temporal Scenes
Qualitative representation concerns the use of qualitative descriptions, as opposed to quantitative de- scriptions, to deal with knowledge. Both qualitative spatial and temporal representations are driven by 6

Page 13
the fact that everyday knowledge in the physical world are often described in qualitative terms, rather than in numerical terms [6] (e.g. she is beside the boy, he arrived before her). The choice of using a qualitative representation over a quantitative representation is due to its intuitive nature: it is more natural for example to describe a queue as something where ��a person is in front of another person�� as opposed to ��a person is 2.54 metres apart from another person��. The domain of qualitative temporal reasoning has long contributed encouraging results. One of the most well-known qualitative temporal representations is Allen��s Interval Calculus [1]. Allen��s Interval Calculus is used to reason about relations between two time intervals. It comprises a total of thirteen binary relations between time intervals, i.e. before, meets, overlaps, during, starts, finishes, each of their inverses, and equals. Figure 2.1 shows an illustration of these interval relationships. Figure 2.1: Thirteen relations of Allen��s Interval Calculus, and an illustration of these relations. Adapted from [1]. In contrast, qualitative spatial reasoning is more complex because of the multi-dimensionality of space: it can be used to describe distances, topologies, sizes, shapes etc. Most work on spatial reasoning in literature deal only with a single aspect. For example, the Cardinal Direction Calculus [23] deals only with directionality. Another well-known spatial calculus is the Region Connection Calculus (RCC, or RCC-8) proposed by Randell et al. [35], which is an interval logic for reasoning about topological rela- 7

Page 14
tions in space. RCC-8 is made up of a set of eight dyadic relations, i.e. disconnected (DC), externally connected (EC), partially overlapping (PO), identical (EQ), tangential proper part (TPP), non-tangential proper part (NTPP) and the inverses for both tangential and non-tangential proper parts (TPP−1 and NTPP−1). It is also possible to reason about spatio-temporal continuity between these RCC-8 relations [13], enabling explanation of spatio-temporal data over time. This ability enhances its attractiveness in representing spatio-temporal relationships. Figure 2.2 shows the eight relations of RCC-8 and possible transitions between the relations. Figure 2.2: Illustration of relations of RCC-8 and their possible transitions. Reproduced with permission from [7]. RCC-8 serves as a useful inspiration for modelling relationships between components of compound entities. However, it may be too fine-grained for our problem, as the distinction between being exter- nally connected and partially overlapping is not essential. It is sufficient to use a simplified version, in contrast to using all eight relations of RCC-8. The same can be said about Allen��s Interval Calculus. There may perhaps be a need to introduce a new, simple relation relevant to the task.
2.3 Representing Compound Spatio-temporal Entities
Apart from the representation of relations between entities, there is also a need to consider the repre- sentation of an instance of the compound entity within a scene. In addition, there is also a need for a general representation of a compound entity which can represent all instances in a condensed way. 8

Page 15
2.3.1 First Order Logic
First order logic is a natural choice for representation of knowledge. Most literature dealing with quali- tative relations are represented by first order logic. For example, Magee et al. [25] use Prolog statements to represent objects, states, actions etc. and infer different actions using logical induction. The ability of first order logic to represent both spatial and temporal aspects of knowledge (e.g. various relations of spatial entities at different time intervals) as well as to produce sound inferences makes it a potential candidate for the task at hand. However, a good representation of different states will first need to be devised in order for it to perform well.
2.3.2 String Grammars
A string grammar is the specification of a formal language comprising strings of characters. Formally, a string grammar, G can be defined as a 4-tuple (V, ��, P, S), where V is a finite set of variables or non-terminals, �� is the vocabulary, P is a set of production rules, and S is the start variable. A language is a set of valid sequences of strings which can be generated by its grammar. More detailed descriptions of string grammars can be found in [39]. String grammars can be an ideal representation for this work. A prerequisite for this, however, is that instances of the compound spatio-temporal entity are represented as a string sequence. Whilst the temporal aspect of the entity is of a linear sequence, the spatial aspects are not. However, it is still a possible solution if the spatial aspects can be represented as character strings, and the temporal aspects as sequences of these strings. A valid grammar can then be constructed to generalise this representation of a compound spatio-temporal entity.
2.3.3 Graph Grammars
Graph grammars are a more expressive generalisation of string grammars. Whereas a textual grammar describes a linear sequence of data, graph grammars can be used to describe non-sequential data. A graph grammar is a formal specification of rules that define a valid graph. It can be used to parse a graph or to generate a valid graph. Figure 2.3 shows an example of a graph grammar rule. Graph grammars are 9

Page 16
first introduced by Pfaltz and Rosenfeld in [33] for representing images. There are many applications for graph grammars, such as optical music recognition [2] and biology [10]. More recently, graph gram- mars have been used extensively in defining visual languages; many variations of graph grammars have been proposed for this purpose [20, 36, 47]. In computer vision, graph grammars have been used to rep- resent images, e.g. image parsing on scenes [12]. However, there is no known work on graph grammar representation of scenes in videos. Nevertheless, the successful use of graph grammars in existing ap- plications is a major motivation for the use of graph grammars to represent compound spatial-temporal entities. Figure 2.3: An example production rule of a graph grammar.
2.4 Recognising Compound Spatio-temporal Entities from Visual Data
The recognition aspect of this work is heavily dependent on the chosen representation. A general rep- resentation of a compound entity in first order logic will require the use of a logical reasoner or in- ference engine to match instances of the compound entity with the general representation. Various implementations of Prolog, such as SICStus Prolog (http://www.sics.se/isl/sicstuswww/site/index.html) or SWI-Prolog (http://www.swi-prolog.org/) can be used for this purpose. On the other hand, a string grammar representation will require the construction of a parser for recognition purposes, i.e. to deter- mine whether an instance belongs to the grammar, also known as the membership problem. Similarly, graph grammar parsers are required to match an instance of a compound entity with graph grammar specifications. A major setback in this aspect is that there are no known implementations of graph grammar parsers; most parsers discussed in literature are either theoretical in nature [27] or are too spe- cialised for common use [47]. In fact, Rozenberg and Welzl [37] claim that the membership problem is NP-hard even for restricted classes of graph grammars. Hence, graph grammar based solutions may not be ideal for this work at this time. 10

Page 17
2.5 Object Tracking
A prerequisite to reasoning about compound spatio-temporal entities is to track individual components of the entity (e.g. people). A popular method involves image-based or motion-based segmentation at a pixel level. McKenna et al. [28] track moving people by employing background subtraction using colour and gradient information. Stauffer and Grimson [41] model pixels as a mixture of Gaussian models; pixels which do not fit the background model are considered foreground. Apart from pixel- level segmentation, there is also model based tracking which concerns fitting a generalised model of an object, or a hypothesis, to an image sequence. Cootes et al.��s Active Appearance Model [9] builds a deformable statistical model and matches the model to a new image by predicting changes to parameters based on residual errors. Wren et al. [45] use blob models to represent different parts of the body. More recently, object tracking has been used in combination with detection to deal with scene occlusion and moving cameras [22, 46]. Despite the advances in object tracking, it still remains an unsolved problem. In many cases, oc- clusion is a problem; this can be handled using multiple cameras [21], or by reasoning about temporal continuity at a higher level [3]. In addition to occlusion, most trackers in literature work well only in controlled environments although some work has been done to deal with crowded scenes (e.g. in [48]). Due to time constraints, this thesis assumes that objects in the scene have already been tracked.
2.6 Learning from Data
As opposed to manually defining rules governing compound spatio-temporal entities, another approach is to allow the system to learn the configuration of a compound spatio-temporal entity from examples. Various approaches to machine learning such as decision trees, clustering, instance-based learning, neu- ral networks and graphical models are possible and are discussed in depth in [4, 29] and will not be discussed here due to space constraints. In computer vision, learning is often used in pattern and object recognition. For example, the Ad- aboost learning algorithm is used by Viola and Jones in [43] to build classifiers for face detection. Support vector machines (SVM) are also frequently used to perform classification, and are often used 11

Page 18
in object recognition (e.g. in [34]). More recently, Needham et al. [30] use Inductive Logic Program- ming to learn protocols of activities by observation. Although there exist many other works in computer vision with regards to learning, one recent approach which may be useful to our task is described in a paper by Hamid et al. [11]. They propose a framework to activity discovery in an unsupervised manner by representing activities as bags of n-grams and modelling an activity as a variable memory Markov chain. As our task is similar in a sense that compound spatio-temporal entities can be considered ��ac- tivities��, it is possible to learn the concepts associated with compound spatio-temporal entities using the same approach. Besides that, it is also possible to classify compound spatio-temporal entities using the various machine learning techniques mentioned. 12

Page 19
Chapter 3
Representing and Recognising Compound Spatio-Temporal Entities
3.1 Overview
This chapter proposes several solutions to representing and recognising compound spatio-temporal en- tities. Preprocessing tasks will first be discussed. A qualitative spatio-temporal representation of scenes from videos will then be presented, and several compact representations of compound spatio-temporal entities will be proposed. Recognition of compound entities will then be discussed. Finally, an overview of the implementation of these proposed solutions is given.
3.2 Preprocessing
3.2.1 Dataset
For the purpose of this research, footage of people forming and leaving queues has been recorded. The result after some minor editing is a collection of videos to serve as the dataset for this thesis. Some videos clearly contain queues, whilst others contain some noise. In addition to queues, videos of non- queues have also been captured for evaluation purposes. The content of these videos will be discussed 13

Page 20
in more detail in Chapter 5.
3.2.2 Tracking Objects in Video
Although it will be ideal to automatically track objects within a video, it is beyond the scope of this work. Instead, this thesis assumes that the objects have already been tracked and will instead concentrate on high-level representation and reasoning of the behaviour of objects. A manual approach is used to an- notate the position of objects at each frame. This is done using an open-source ground-truth authoring tool named ViPeR-GT (http://viper-toolkit.sourceforge.net/products/gt/). The output is an XML-based file containing various metadata such as objects and their positions in each frame.
3.2.3 Mapping of Image Coordinates to Ground Plane Coordinates
Videos are recorded from a fixed angle and perspective. Therefore, two people standing near the camera may appear further apart than those further away from the camera. Hence there is a need to project the location of entities relative to the ground plane to eliminate the effects of skewing and perspective. A 2- dimensional homography is used to achieve this. Homography is the mapping between a point projected on a camera and the same point on another camera. Formally, a 2D homography can be defined as the relation �� ⎡ ⎢ ⎢ ⎢ ⎣ x2 y2 1 ⎤ ⎢ ⎢ ⎢ ⎦ = H ⎡ ⎢ ⎢ ⎢ ⎣ x1 y1 1 ⎤ ⎢ ⎢ ⎢ ⎦ (3.1) where H = ⎡ ⎢ ⎢ ⎢ ⎣ a b c d e f g h i ⎤ ⎢ ⎢ ⎢ ⎦ is the homography matrix, (x1,y1) and (x2,y2) are coordinates of the same point from two different viewpoints, and �� is a constant scalar value. 14

Page 21
For this work, homography can be used to map image coordinates (from video) to ground plane coordinates. As the camera viewpoint is identical for all videos from the same dataset, a single homog- raphy matrix can be calculated for each dataset and used to find the equivalent mapping between image coordinates and ground plane coordinates. Four set of points from the video are chosen, and their equiv- alent ground plane coordinates are defined. These set of points are used to determine the homography matrix by solving Equation 3.1. This is manually calculated, and is verified separately using MATLAB. The initial homography matrix has produced some problems due to differences in scale between the x-axis and y-axis, causing disparities between distances of different queue orientations. This is rectified by selecting better ground plane coordinates to produce a uniform scale for both axes. The homography matrices used in this research for the first and second datasets are ⎡ ⎢ ⎢ ⎢ ⎣ 0.0060 −0.0048 1.2879 0.0032 0.0153 −5.6228 −0.0001 0.0033 −0.1511 ⎤ ⎢ ⎢ ⎢ ⎦ and ⎡ ⎢ ⎢ ⎢ ⎣ 0.0059 −0.0049 1.2107 0.0036 0.0147 −5.0811 −0.0001 0.0030 0.0280 ⎤ ⎢ ⎢ ⎢ ⎦ respectively, with �� set to 1.
3.3 Qualitative Scene Representation
With coordinates of entities available, there is now a need to represent scenes from video to enable reasoning from these representations. For this work, we are concerned about representing qualitative spatial relationships between entities as well as temporal relationships between different states. An ideal data structure to represent these relationships is a graph. Vertices in the graph can represent objects, in- stances of objects and different time states. Edges on the other hand can be used to represent spatial and temporal relationships.
3.3.1 Spatial Relationships
RCC-8 [35] is concerned about representing relationships qualitatively between regions. However, our work deals with points as opposed to regions. Points are used in this research because they are sufficient 15

Page 22
as a solution; the use of regions do not bring any additional advantages. In fact, it would only increase the computational complexity of our work. In addition to that, the relationships of RCC-8 are also too fine-grained for our task. In defining relations between components, it is sufficient to reason whether a component is interacting with another component. Therefore, a dyadic ��connected to�� relation is ade- quate to represent this interaction. The connectedness of two entities is determined by measuring their distance between each other. A value of 0.4250 has been empirically determined to be the maximum distance (in ground plane coordinates) allowed for two entities to be considered ��connected to�� each other. Disconnectedness between components is implicit if there is no ��connected to�� relation between them. A new ��formerly connected to�� relation is later introduced as research progressed. The rationale behind this addition is to represent the relation that an object was previously connected to another. This extension allows the representation to be more informative, as this relation can be used to determine whether a member is previously a component of the compound entity. It could enable differentiation between people reconnecting and people connecting for the first time. This additional information may prove crucial in the recognition aspect of our work. A graph is used to represent these spatial relationships at a time state. Two types of vertices are involved: an ��entity�� vertex to represent the object which spans several time states, and an ��instance�� vertex to represent the instance of an object bound to a particular time state. ��Entity�� and ��instance�� are linked with an edge named ��is instance of��. Spatial relationships between instances are represented with the edges ��connected to�� and ��formerly connected to��. Figure 3.1 illustrates an example graph representing spatial relationships between instances at a time state. Figure 3.1: Visualisation of a graph representing spatial relationships between instances of entities. 16

Page 23
3.3.2 Temporal Relationships
In addition to spatial relationships, temporal relationships are required to connect different time states. It is impractical to record the state of spatial relationships for every single frame in a video sequence. Therefore, in this thesis, a time state represents only changes in spatial relationships between entities over time. It is sufficient to use a ��before�� relation to connect each vertex (time state) as a sequential pipeline. Figure 3.2 illustrates this. Figure 3.2: Visualisation of a graph representing temporal relationships between time states.
3.3.3 Combining Spatial and Temporal Relationships
The graph in Sections 3.3.1 and 3.3.2 can be combined to form a spatio-temporal representation of the scene. This can be done by connecting each ��instance�� vertex with a ��time�� vertex, denoting the time state corresponding to the instance. The spatio-temporal graph contains three types of vertices: ��instance��, ��entity��, and ��time��. It also contains several types of edges: ��connected to��, ��formerly connected to��, ��is instance of��, ��at time��, and ��before��. The first two kinds of edges are undirected, and only connect two ��instance�� vertices. The remaining edges are directed: ��is instance of�� connects an ��instance�� to an ��entity��; ��at time�� connects an ��instance�� to a ��time��; and ��before�� connects two ��time�� vertices. Figure 3.3 shows an example graph of two successive time states.
3.4 General Representation
The graph representation in Section 3.3 denotes an instance of a scene containing a queue. There is also a need to generalise this representation to produce a compact definition of a queue in general. Section 3.4.1 discusses an initial proposal of a solution, and discusses the problems with the approach. Section 3.4.2 proposes an alternative solution, which form part of the final solution of the task. 17

Page 24
Figure 3.3: Visualisation of a spatio-temporal graph representing a scene.
3.4.1 Graph Grammar Representation
An initial proposed solution is to define compound spatio-temporal entities using graph grammars. This solution seems ideal since graph grammars can be used to generalise the graph representation in Section 3.3. A more in-depth analysis, however, revealed otherwise. The proposed spatio-temporal graph representation requires at least a context-sensitive graph grammar. This is due to the multi- dimensionality of the graph: there are multiple edges pointing to a single ��time�� or ��entity�� vertex. Thus, a context-sensitive graph grammar for this graph may require a rule as shown in Figure 3.4 (c.f. Figure 2.3). As discussed in Section 2.4, there are no known context-sensitive graph grammar parsers at this time. Thus, it is not viable solution at the moment. Figure 3.4: An example production rule of a context-sensitive graph grammar. One way to restrict the class of the graph grammar is to simplify the graph representation altogether. As opposed to having ��time�� and ��entity�� nodes, ��instance�� vertices can be linked to another ��instance�� vertex in a subsequent time state using an ��equal�� relation, which also implicitly represents a ��before�� relation. This proposal is illustrated in Figure 3.5. This representation significantly reduces the number 18

Page 25
Figure 3.5: Visualisation of a simplified spatio-temporal graph. of edges in the graph, and enables it to be represented with a context-free graph grammar. An attempt has been made to construct a graph grammar for this simplified representation by hand: part of these are shown in Figure 3.6. However, the main problem is that the grammar requires many external anno- tations to specify certain conditions or restrictions (e.g. degree of node must be zero). In addition, it is error-prone, and will not be able to handle the slightest bit of noise. Hence, as expected in our research methodology, a completely different solution would need to be proposed by exploring other alternatives. Figure 3.6: Part of the proposed context-free graph grammar. 19

Page 26
3.4.2 Queue as a Sequence of Events
Because of the problems with the graph grammar approach, a completely different solution is proposed in this section. As opposed to representing a queue as a sequence of states, it can instead be represented as a sequence of perceived events. For example, an instance of a queue could be described as a sequence of events such as ��joining, joining, leaving, moving forward, reconnecting��. This removes the complex- ities of defining a graph grammar by encapsulating the activities and removing graph-specific details. In this case, string grammars are sufficient to represent this sequence of events. 3.4.2.1 Inferring Events from Changes in State An event can be inferred from changes between two states. In this context, an event answers the question ��what is happening between two successive states?�� For example, a ��join�� event can be inferred when a disconnected entity connects to a queue in a subsequent time state. Nine events have been identified for this work. They are: • join (join) A person joins the back of a queue (of at least 2 people), • leave (lv) The head of a queue (of at least 3 people) disconnects, • intermediate disconnection (idc) A person in the middle of a queue disconnects, • reconnect (rc) A person reconnects to another, • connect (con) 2 people connect to each other, both disconnected from anybody else, • disconnect (dc) 2 people connected only to each other disconnect, • appear (app) A person appears in the scene, • vanish (van) A person completely leaves the scene, and • undefined (undefined) Any unknown events. These events are merely labels: they could have just as easily been represented as numbers or symbols. They are represented here using natural language for clarity. There is initially another event, ��forward�� which describes a person in the middle of a queue stepping forward to fill up the void left by 20

Page 27
the person in front of him or her. However, it has since been merged with ��reconnect�� due to the strong similarities between the two events. The assignment of events is rule-based. A rule in this context can be defined as a particular pattern between subgraphs of two consecutive time states. Several distinct rules may denote the same event. Figure 3.7 illustrates just some of the rules used in this thesis to assign appropriate events to changes between two consecutive time states. Figure 3.7: Examples of several rules and their corresponding events. 3.4.2.2 Representing Sequences of Events as a Grammar With a sequence of events in place, it is only a matter of defining a grammar which governs valid sequence of events to be considered a queue. The complexity in providing this definition is that it requires the consideration of different instances of queues; the grammar should be lenient enough to handle different versions of queues, but strict enough to reject non-queues. In addition, the grammar should also provide some allowance in handling noise. An initial string grammar is proposed after studying the dataset and considering various scenarios. The grammar, with rules expressed as regular expressions, is as follows: Queue := Noise (��join�� Noise | ��lv�� Noise ��idc�� Noise (��idc�� Noise ��rc�� Noise)* ��rc�� Noise)+ Noise := (��app�� | ��van�� | ��con�� | ��dc��)* 21

Page 28
The grammar considers a ��join�� event to be a minimum and sufficient requirement for identifying the formation of a queue. In the case of a queue being served, a sequence of events are required: the head of the queue leaves, others temporarily disconnect from the person behind them to step forward, and the last person simply steps forward. The sequence ��lv idc (idc rc)* rc�� is used to describe this sequence of events. Events such as ��appear��, ��vanish��, ��connect�� and ��disconnect�� are considered to be noise. It is quite possible for any of these to occur at any time during the sequence, and has therefore been put into consideration during the construction of the grammar. A problem with this initial grammar, however, is that a ��join�� event cannot occur when a queue is being served. This is in fact quite possible, and has occurred in several videos. Some revisions have been made to rectify this. The final proposed grammar for a queue is: Queue := Noise (��join�� Noise |��lv�� Noise2 ��idc�� Noise2 (��idc�� Noise2 ��rc�� Noise2)* ��rc�� Noise2)+ Noise := (��app�� | ��van�� | ��con�� | ��dc��)* Noise2 := (��app�� | ��van�� | ��con�� | ��dc�� | ��join��)* The representation of compound spatio-temporal entities as event sequences eliminates the problems with the graph grammar approach. However, the encapsulation process has also obscured some detailed information, such as the components involved in the queue. This makes spatial localisation difficult as individual components cannot be identified. 3.4.2.3 Addition of ��meet�� Event An observed problem during testing is the large amount of ��join�� events within non-queue sequences. This may be a problem for the rule-based system as the existence of a single ��join�� event is deemed suf- ficient for a queue. Some wrongly assigned ��join�� events are shown in Figure 3.8. The main problem is that the rules governing the assignment of the ��join�� event are too lax. One solution is to define a stricter rule to describe ��join��: a person can only ��join�� a queue if everybody within the queue is connected to at most one person. In addition, more events can be introduced to replace ��join��. A new event introduced as a result of this change is the event ��meet�� used to describe the event of a person joining a group of people not standing in a single line. This does not fully fix the problem of the presence of ��join�� within non-queue sequences as such cases still exist: whilst Figure 3.8(a) is rightfully assigned a ��meet�� event, 22

Page 29
Figures 3.8(b) and 3.8(c) are still considered ��join�� events . Further possible improvements may be to include the notion of directionality to assist in the assignment process. There is also a possibility of revising our hand-crafted grammar to increase the minimum number of ��join�� events which must occur to be considered a queue. However, both are not investigated because of time constraints.
(a) The leftmost person ��joins�� the right ��queue��. (b) The fourth person from the left ��joins�� the left ��queue��.�� (c) The rightmost person ��joins�� the couple on her ��left��.
Figure 3.8: Example scenes of misclassified ��join�� events. Out of the three examples, only (a) is successfully changed to ��meet�� in the updated version.
3.5 Recognition
With a representation of a queue, some means to recognise its existence in a video is required. There are several possible problems to tackle in the recognition aspect of our work. These include answering the following questions: 1. Does a queue exist in the video? (Classification) 2. When does a queue occur, if it exists? (Temporal localisation) 3. Which objects are part of the queue when it exists? (Spatial localisation) The main focus of this work is to deal with temporal localisation. Classification is also investigated in Chapter 4. Because of time constraints, spatial localisation is outside the scope of this research. 23

Page 30
3.5.1 Recognising Complete Sequences
Based on the proposed representation, recognising a complete sequence of events as a queue involves parsing the string sequence using a customised parser for the proposed grammar. This is basically a membership problem, i.e. validating whether a sequence belongs to the grammar. In this case, a parser needs to be constructed to take a string of tokens (with each token representing an event) and construct a valid parse tree for this sequence based on the defined grammar. The output will be a ��yes/no�� answer, i.e. whether this sequence is a queue.
3.5.2 Localising Subsequences
Section 3.5.1 deals with a complete sequence of events: it validates whether a whole sequence is a queue. In reality, this is not always the case. It is possible that only a partial sequence of the video contains an instance or perhaps several instances of a queue. Thus, the proposed solution should also be able to handle such cases. Algorithm 1: Recognise queues within subsequences of a sequence of events Input: Sequence of events, input Output: Collection of start & end indices of subsequences identified as queues, Q Q �� {}
1
index �� 0
2
while index < input.length do
3
end-indices �� {}
4
for i �� index+1 to input.length do
6
if substring(index, i) is a valid sequence then
7
append i to end-indices
8
end
9
end
10
append (index, max(end-indices)) to Q
11
index �� max(end-indices)
13
end
14
24

Page 31
This work uses a greedy approach to search for potential queue candidates (i.e. subsequences). The proposed algorithm will try to finds the largest, non-overlapping subsequences that successfully match the defined grammar. The candidate subsequences are validated using the parser from Section 3.5.1. Algorithm 1 describes the initial proposed algorithm. In this algorithm, the start index will initially be pivoted at the beginning of the input whilst scanning through the whole sequence from left to right (line 6). The start index will then move to the end of the largest found queue (line 13) and the process repeats itself. The main problem with Algorithm 1 is the increased time complexity when it repeatedly scans the input from left to right and chooses the largest valid subsequence. A better method is to do the scanning from right to left, and choosing the first subsequence found, which also happens to be the largest. This idea is exploited in Algorithm 2. Figure 3.9 illustrates the execution of both algorithms and gives a visual comparison of time complexities of both. Algorithm 2 is favoured since it executes much faster on average because of the reduced number of irrelevant comparisons. The algorithm will return a list of start and end indices of subsequences deemed to contain instances of queues. These event indices are mapped to their equivalent frame numbers. These frame numbers indicate the occurrence of queues. Algorithm 2: Recognise queues within subsequences of a sequence of events (version 2) Input: Sequence of events, input Output: Collection of start & end indices of subsequences identified as queues, Q Q �� {}
1
index �� 0
2
while index < input.length do
3
for i �� input.length down to index do
5
if substring(index, i) is a valid sequence then
6
append (index, i) to Q
7
break
8
end
9
end
10
if i == index then index �� index + 1 else index �� i
12
end
13
25

Page 32
Figure 3.9: Illustration of Algorithms 1 & 2. Algorithm 2 has a lower average time complexity.
3.6 Implementation
The proposed solution is implemented in Java. Java is chosen for its useful collection of libraries, as well as its ease in integrating third party packages. Multiple subprograms have been written for differ- ent parts of the solution, and they are later combined into a single command-line program for ease of use.
3.6.1 Preprocessing and Scene Representation
A program has been written to convert the coordinates for each person in each frame from the annotated data file (in XML format) produced by ViPER-GT to ground plane coordinates. For this task, the homography matrices from Section 3.2.3 are used. The output is a collection of ��Entity�� data structures containing ground plane coordinates of each object at each frame in which it is present. A text file containing this information is also produced by the program for analysis purposes. These tasks took about a total of 100-200 lines of code. The next stage is to produce a high-level representation of the scene as a graph data structure, as described in Section 3.3. An open source Java library, JGraphT (http://jgrapht.sourceforge.net/), is 26

Page 33
used for this task. It is chosen for its support for several graph data structures including directed and undirected graphs, as well as various useful graph operations. The graph can easily be serialised to different file formats such as the ASCII-based Graph Modelling Language (http://www.infosun.fim.uni- passau.de/Graphlet/GML/gml-tr.html) or the XML-based GraphML (http://graphml.graphdrawing.org/). For this stage, the Java program will generate a new state whenever a change in spatial relation occur between any two entities; each state is linked to the next state using a ��before�� relation. The complete generation of the graph data structure required roughly about 200 lines of code.
3.6.2 Generation of Sequence of Events
In this stage, the program will produce a sequence of events associated with a video. To achieve this, the program will scan through the graph from the previous stage two successive time states at a time. It will analyse these two time states in terms of the differences between their edges, vertices etc., and assign an appropriate event as described in Section 3.4.2.1. The equivalent frame number for each time state is also stored for use in the recognition stage. The complication of analysing the graph data structure as well as the rule-based approach to assigning events resulted in about 400 lines of code just for this stage.
3.6.3 Localisation
The recognition stage of the program involves constructing a customised parser as well as implementing Algorithm 2 from Section 3.5.2 to localise queues within a video. To ease the otherwise tedious process of constructing a parser, a Java-based parser generator known as JavaCC (https://javacc.dev.java.net/) is used for this task. JavaCC will generate a parser based on the provided grammar and lexical specifica- tions. The implementation of Algorithm 2 involves reading an input and breaking it up into different candidate subsequences to be validated by the parser, and returning the indices containing events cor- responding to queues. The mapping from event indices to frame numbers is done using the stored information from the previous stage. 27

Page 34
Chapter 4
Learning from Examples
4.1 Overview
In Chapter 3, we have proposed a high-level scene representation using a spatio-temporal graph data structure. We have also explored a rule-based approach in representing events and sequences of events. In addition, the recognition aspect of the work involves validating sequences using a grammar parser. The primary problem with rule-based approaches is that they are ad hoc and may not deal well with varied instances. In addition, it would be difficult to extend the system to recognise different compound spatio-temporal entities. It would be better if compound entities are recognised simply through learning from example videos. This chapter extends the work in Chapter 3 by exploring the possibilities of in- corporating learning into our proposed solution.
4.2 Graph Grammar Induction
In section 3.4.1, an initial proposal of representing compound spatio-temporal entities as graph gram- mars has been discussed. Although this solution is not viable as a rule-based approach, it is still worth investigating the idea from a machine learning point of view. Not many algorithms exist for unsuper- 28

Page 35
vised graph grammar induction. Some attempts can be found in [5] and [15]. Jonyer et al. [16] propose a graph grammar inference algorithm called SubdueGL based on frequent common substructures in graphs, and have proceeded to show a real world application of the algorithm to infer the graph gram- mar for protein sequences in [17]. Nevertheless, SubdueGL is only able to infer context-free graph grammars. Therefore, it is not suitable for our proposed representation which requires at least a context- sensitive graph grammar. In addition, no full, working implementation of the algorithm is available at this time; we have failed to acquire an implementation from the author, and implementation of the algorithm is also costly time-wise. However, there exist an implementation of the SUBDUE algorithm for substructure discovery [8], of which SubdueGL is based on. To examine its potential, a graph of an example scene from our implementation from Chapter 3 is generated and converted to SUBDUE��s native file format, and is inputted into SUBDUE. Some substructures produced by SUBDUE are note- worthy, and are visualised in Figure 4.1. However, it is concluded that although the results show some promise, it is not a suitable solution for the task at this time. A better solution needs to be proposed. Figure 4.1: Substructures found using SUBDUE. Figure (a) possibly depicts a ��leave�� or ��disconnect�� event. Figure (b) can be inferred as a ��join�� or ��connect�� event.
4.3 Learning Events
Chapter 3 proposes a solution where an instance of a queue is represented as an event sequence. Events are assigned using rules defined from patterns of two successive time states. This section proposes a method to learn these rules and their corresponding events from example data. By learning from examples, the process of defining rules and assigning events can be automated. The method proposed 29

Page 36
involves generalising events in an unsupervised fashion by means of clustering.
4.3.1 Feature Vector
Like many machine learning algorithms, clustering requires a feature vector to describe an object, or in this case, an event. Since events are defined as patterns emerging from the subgraph of two successive time states, the feature vector will similarly be represented from various graph relations, each being an attribute of the feature vector. The attributes are as follows: 1. Difference between number of vertices. This attribute is used mainly to describe events such as ��appear�� and ��vanish��. This is initially represented as integer values, i.e. the result of subtracting the number of vertices in statet+1 from the number of vertices in statet. Never- theless, since the exact number of vertices is not significant for our task, only three possible values are used in the final solution: positive (an object has vanished from the scene), nega- tive (a new vertex has appeared) and zero. 2. Difference between number of edges. This attribute is used primarily to distinguish events like ��join��, ��leave��, ��connect�� and ��disconnect��. It is calculated by subtracting the number of edges in statet+1 from the number of edges in statet. Like the first attribute, the three possible values are positive, negative and zero. 3. Graph Difference. This attribute denotes the difference between the graphs for statet and statet+1 (i.e. Graph for statet - Graph for statet+1). It returns the edges which exist in statet but not in statet+1. The possible values for this attribute are several type of edges: ��connected to��, ��formerly connected to�� or ��empty��. For simplicity, this work assumes that there is at most one change in spatial relation between two time states. If the result happens to contain more than one edge, only the first edge is chosen. 4. Graph Difference (Inverse). This is similar to the attribute above, except that it is now the in- verse (i.e. Graph for statet+1 - Graph for statet). This attribute denotes edges in statet+1 which do not exist in statet. Both this and the previous attribute are used to reason about the change in spatial relationships between two entities. 5. Edges of Neighbouring Nodes. This attribute takes the two adjacent vertices sharing the same edge from the graph difference between two time states, and evaluates the edges for other 30

Page 37
vertices adjacent to these two vertices. The possible values are: ��connected to��, ��for- merly connected to��, ��both�� (indicating that a vertex is incident to both types of edges) and ��empty��. The output for both vertices are listed as two separate attributes. For each two successive time states, a feature vector with the aforementioned attributes will be generated. A scene will contain a list of feature vectors, each representing the equivalent of an event. Figure 4.2 illustrates an example feature vector for two successive time states. Figure 4.2: Example feature vector for two successive time states.
4.3.2 Clustering
Events are learnt by clustering instances into different clusters, with each cluster denoting an event. Each feature vector acts as an instance to be clustered. A set of instances are produced by combining feature vectors from all training data. This work uses a classic clustering technique known as k-means [44]. It is chosen because it is effective and sufficient for the task. This technique requires defining the number of clusters to be sought, i.e. the parameter k. Ideally, k should be equal to the total number of events. However, matters like noise as well as varieties of events should also be taken into account. It has been decided empirically that k be set to 18 for this experiment. It is also possible to automate the selection of k by iteratively incrementing 31

Page 38
the value of k and choosing the value with maximum error below some threshold. However, this is beyond the scope of this thesis. Assigning events to a video sequence is a matter of comparing the change between two time states and allocating the cluster (label) with the minimum distance measure (e.g. Euclidean distance) to its feature vector. As the attribute values are nominal, the distance is either 0 (same) or 1 (different) for each attribute. The output of the assignment will be a sequence of cluster labels, each representing an event learnt from the clustering process. Some example sequences of clustered events are illustrated in Figure 4.3. Figure 4.3: Examples of rule-based event sequences and their corresponding clustered event sequences.
4.4 Learning Sequences of Events
The previous section concerns only the learning of events. It is still necessary to learn the valid se- quences which govern the activity of queue.
4.4.1 String Grammar Induction
In Chapter 3, we have proposed the use of a string grammar to define valid sequences of events. The main problem with the hand-crafted approach is that it is rigid in its definition, and may not deal well with exceptional cases. It will be ideal if the grammar for queues can be inferred directly from posi- tive examples (i.e. an event sequence) of queues. Various algorithms for grammar induction such as [18, 31, 42] have been proposed in literature. More recently, Solan et al. have proposed the ADIOS algorithm [40] which has been used to learn context-free grammars from natural language. However, lack of example sentences has prevented the use of this algorithm in this work. Our experiment using 32

Page 39
artificially generated sentences shows that at least 50 sentences are required for the algorithm to pro- duce a decent grammar. Since only 10 example sentences are available in our queue dataset, it is not a plausible solution at this time. Nevertheless, the availability of more example videos could make this an attractive and even practical solution.
4.4.2 N-gram Modelling
Since data sparseness is a major problem in this work, inferring grammars from examples is not an ap- propriate solution. To overcome the lack of examples, an n-gram based solution is proposed. Although n-gram models are often used in natural language processing, many applications in other areas, includ- ing computer vision, are also possible. The use of bags of discrete event n-grams to represent activities has been successfully explored in [11]. The proposed solution for this research is based on this idea. N-grams are attractive to our task because they capture temporal information as well as content from a given sequence. In our solution, an event n-gram represents a sequence of events of length n. A model of a queue can then be represented by a histogram of event n-grams. Figure 4.4 gives an illustration of the idea. The model can be built in an unsupervised manner by generating overlapping event n-grams from training data and constructing histograms from this bag of n-grams. There is also the problem of deciding on the value of n. In our implementation, n is chosen to be 3 (i.e. trigrams). This decision is based on the fact that although higher-order n-grams provide a more rigid temporal order information, it would also cause the over-fitting of training data [11]. In contrast, lower-order n-grams (i.e. bigrams) will reduce temporal information as well as produce an over-generalised model.
4.5 Recognition
The recognition aspect of our work with regards to the machine learning approach involves two com- ponents: temporal localisation and classification. Both involve generating the n-gram model from Sec- tion 4.4.2 from training data, and comparing the learnt model(s) with test data. For the temporal localisation part of this work, a single n-gram model (i.e. a histogram) is generated from example queue event sequences. The process of localising an instance of a queue concerns identi- 33

Page 40
Figure 4.4: Modelling of an event sequence as an n-gram model. A list of n-grams are extracted from an example sequence, and a histogram of this set of n-grams is generated. fying clusters of valid n-grams in the test data. N-grams are considered valid if they are part of the learnt model��s vocabulary. It is worth noting that the frequencies of n-grams do not play a role in this model. A consecutive sequence of a fixed number of valid n-grams indicates a cluster. The number of valid n-grams required to be considered a cluster is an empirical problem. The occurrence of a single n-gram cannot be considered a cluster since it may just be a coincidence. A smaller cluster size will increase the acceptance of instances of brief queues, but is at the same time more susceptible to noise. In contrast, a larger cluster size may produce more false negatives, but at the same time reduces the number of false positives. Chapter 5 will discuss the results of evaluation using cluster sizes of 2, 3 and 4. The classification problem on the other hand requires generating models for both queues and non- 34

Page 41
queues. Rather than generating a combined histogram, a histogram is generated for each instance. In this case, a common vocabulary needs to be determined from the union of all instances. In addition, it is also possible to use bit vectors instead of histograms. Empirical results suggest that the use of bit vectors as opposed to histograms does not generally make much difference, and in fact produces slightly better results in some cases. For this reason, bit vectors are used in our implementation. These bit vectors can then be trained using appropriate machine learning techniques. The classifiers chosen are support vector machines, nearest neighbour, and decision trees. The Euclidean distance measure is used to compare histograms in most of the above cases. Another similarity metric, as proposed by Hamid et al. in [11], is also used in a separate experiment. The equation is reproduced as follows: sim(A,B) = 1− 1 (Y+Z)
��
i��Y,Z
|f(��i | HA)− f(��i | HB)| f(��i | HA)+ f(��i | HB) (4.1) where HA and HB are histograms for two event sequences, Y and Z are sets of n-grams in HA and HB which are greater than zero, and f(��i | HA) and f(��i | HB) are frequencies of the n-gram ��i in HA and HB respectively. Classification can be achieved using nearest neighbour, i.e. classify the test data to be the same as the most similar training instance. Bit vectors are not applicable for this measure since histograms are required. Due to difference in representation, this is treated as a separate experiment.
4.6 Reduction of Noise in Event Sequences
A main problem with the current event sequence is the occurrence of a large amount of ��appear�� and ��vanish�� events within the sequences. The occurrence of these events may not contribute much to our representation. In fact, they might disrupt the n-gram model by appearing between more relevant se- quences of events. In accordance to the research methodology, the solution is updated to discard these events. They are only retained if they happen to be boundary cases (i.e. the last person vanishes or the first person appears) to clearly indicate the start or end frames of a compound entity. As a comparison, both initial and modified versions of event sequences are evaluated in Chapter 5. 35

Page 42
4.7 Implementation
Like Chapter 3, the implementation of the learning aspect of this work is done in Java. This involves three phases: generating event instances, clustering these instances, and generating n-gram models from queue event sequences. The model is then trained and tested in various temporal localisation and classification tasks.
4.7.1 Generation of Feature Vectors
In addition to assigning events, the previous implementation has been extended to include the generation of feature vectors for each event, as discussed in Section 4.3.1. This is done by evaluating each two sets of subgraphs, and assigning the appropriate attribute values to its feature vector. Each two successive time states will produce an instance. Since Weka (http://www.cs.waikato.ac.nz/ml/weka/) will be used in the clustering stage, the output is formatted as a text file (in Weka��s ARFF format) containing a set of instances to be clustered. An additional 200 lines of code is produced for accomplishing this task.
4.7.2 Clustering
The clustering algorithm in Weka is chosen for this task because of its compatibility with Java as well as the ease in integrating the library within our implementation. More specifically, the SimpleKMeans class is used as our clusterer. The implementation involves merging all ARFF files produced by the feature vector generator for each example video to produce a single set of instances to be clustered. The system is then trained by clustering the set of instances into 18 clusters, with a random seed of 10. Both these numbers are hard-coded. Each event is then assigned a numeric label, producing a sequence of clustered (numbered) events for each example video.
4.7.3 N-gram Modelling
The generation of n-gram models is implemented by extracting n-grams from each event sequence using a brute-force algorithm. In the case of temporal localisation, the histogram comprises the union of n- grams from all training (queue) instances; this histogram acts as a mean prototype for queues. For classification tasks, a histogram is generated for each training instance, all of which share the same vocabulary. 36

Page 43
Temporal localisation is implemented by scanning the input to find occurrences of at least k con- secutive sequences of n-grams, where k is the minimum cluster size as discussed in Section 4.5. These consecutive n-gram sequences are interpreted to be instances of queues; the subprogram will return a list of start and end indices to mark these instances. These indices are finally mapped to their equivalent frame numbers. Classification involves generating a set of feature vectors to act as training and test data. Each fea- ture vector is a bit vector for an event sequence. A text file (in Weka��s ARFF format) is generated, with n-grams as attributes for each instance. Four versions of ARFF files are produced: they are gen- erated from either the rule-based-events sequence or the clustered-events sequence, and for both cases, the initial version and the modified sequence (with ��appear�� and ��vanish�� removed). These files are then evaluated using various implementations of machine learning algorithms in Weka. These include SMO (Weka��s implementation of support vector machines), IB1 (nearest neighbour) and ID3 (decision trees). The default options are used in all cases. In addition to the experiments in Weka, an additional implementation is done in Java for the separate experiment mentioned towards the end of Section 4.5. Histograms are generated instead of bit vectors, and the nearest neighbour approach is used for the classification task. Leave-one-out cross-validation is used in this case as it is easier to implement than 10-fold cross-validation. 37

Page 44
Chapter 5
Evaluation
5.1 Overview
This chapter reports on a series of formal experiments performed to evaluate the performance of the proposed solutions. The experiments involve both temporal localisation and classification tasks. This chapter begins by describing the available datasets for both queues and non-queues. It then goes on to report the temporal localisation and classification experiments. For each experiment, the evaluation measures used are described. The results of each experiment are then reported and discussed. This chapter concludes with a discussion on the performance of the system.
5.2 Datasets
As discussed briefly in Chapter 3, two datasets are used in this work: the first is a collection of 10 videos of queues forming and being served, the second is a collection of 14 videos without any queues. Both datasets are manually annotated using ViPER-GT. All scenes are acted out by volunteers. However, they are situated in a relatively unconstrained environment, and many instances of the videos contain passers-by. The resolution for all videos is 640x480 pixels, and the frame rate is 29 frames per second. 38

Page 45
(a) (b)
Figure 5.1: Examples of people walking through queues. The first dataset (queue dataset) was recorded early in anticipation of its need when research has just commenced. Most videos depict scenes of people joining and leaving queues except three: these only contain people leaving queues (being served). Two of the videos contain an instance where somebody disrupts the queue by walking through the queue (Figure 5.1). Several videos contain passers-by, most of them being in the background except one (Figure 5.2). The length of the videos ranges from 802 to 2620 frames, with an average of about 1630 frames.
(a) Passer-by in the background (b) Passer-by in the foreground
Figure 5.2: Examples of passers-by in the queue dataset. The second dataset (non-queue dataset) was filmed at the same location, albeit at a slightly different camera angle and scale. This dataset is primarily for classification tasks. The production of this dataset was not in our schedule: it took an additional week to record and annotate the videos. This was not a major issue as there was plenty of buffer time from being ahead of schedule. The second dataset comprises various scenes involving more than one person: groups of people walking, people meeting up, 39

Page 46
people fighting, a marathon etc. Some videos contain a combination of several scenes. In addition, many videos also contain passers-by. The length of the videos is about 1245 frames on average, and varies between 409 frames to 2025 frames. Some example scenes from this dataset are shown in Figure 5.3. Descriptions of each video in both datasets are provided in Appendix C.
(a) (b) (c)
Figure 5.3: Example scenes from the non-queue dataset.
5.3 Temporal Localisation of Queues
This experiment involves temporal localisation of queues in the queue dataset. More precisely, it evalu- ates how well the systems identify when a queue starts and ends. The rule-based system from Chapter 3 and the n-gram based system from Chapter 4 are both evaluated in this experiment.
5.3.1 Evaluation Measures
Several evaluation measures are chosen for this experiment. The first three criteria evaluate how well the system identifies the start frame (StartTEST ) and end frame (EndTEST ) of a queue against annotated ground truth (StartGT and EndGT ). These criteria are: Start Displacement(in number of frames) = StartTEST −StartGT (5.1) End Displacement(in number of frames) = EndTEST −EndGT (5.2) 40

Page 47
Error = |StartTEST −StartGT |+|EndTEST −EndGT | # frames in video �� 100 (5.3) Equations 5.1 and 5.2 are used to evaluate the accuracy in identifying the start and end frames of a queue respectively. A positive displacement denotes the start or end frames being identified later than ground truth; a negative displacement indicates the frames being identified too early. Since start and end frames are evaluated separately, they are especially useful if a user is more interested in one over the other. On the other hand, Equation 5.3 combines the displacements (in absolute terms) for both start and end frames in a single measure. It serves as an overall summary of the error the system makes when compared to ground truth. It is possible to modify the equation to give more weight to one of the components (start or end error). However, it is assumed that they are of equal weight in this task. These three criteria are good evaluation measures since they give an idea of how well the system localise a queue. However, they are insufficient in providing a complete evaluation. In addition to the start and end frames, it is also important to evaluate how well the system localises the whole queue, i.e. whether the queue is fully or partially identified, and whether the recognised queue is indeed a queue. In this respect, two additional evaluation measures are proposed: Precision = |frames predicted as queues �� frames annotated as queues| |frames predicted as queues| �� 100 (5.4) Recall = |frames predicted as queues �� frames annotated as queues| |frames annotated as queues| �� 100 (5.5) Precision and Recall are frequently used measures in Information Retrieval [26]. They are also applicable to our task by treating frames as ��information��. Precision measures the percentage of pre- dicted frames which are indeed queues. Recall measures the proportion of the queue frames in ground truth successfully predicted by the system. These measures complement the previous three measures by evaluating the completeness of the localised queue. All five evaluation criteria are used to evaluate each video from the queue (first) dataset. The results are summarised using standard statistical measures, e.g. mean, standard deviation and median. 41

Page 48
5.3.2 Rule-based Approach
Evaluation of the rule-based system in Chapter 3 is performed by providing the system input data and analysing the output produced. Two versions of the implementation are evaluated and compared: the ini- tial version and the modified version (with ��appear�� and ��vanish�� removed as discussed in Section 4.6). As both ��appear�� and ��vanish�� events are treated as noise in the rule-based system, not much difference is expected between both versions. Out of the 10 videos from the queue dataset tested, only 8 were identified to contain queues. The queues in the remaining 2 videos could not be localised by the system because they contain only at most two participants in a queue at a single point in time. This is a problem as the defined grammar requires the presence of at least 3 persons in the queue. Table 5.1 summarises the results of the experiment for the 8 videos. The results for both initial and modified versions of event sequences were similar for all measures except recall; they are hence displayed as a single summary (with the exception of recall). In general, the results show that the system performed considerably well with a considerably low mean error rate of 5.45%. The average precision was 98.88%, whilst the average recall was about 90%. The standard deviation of the measures, however, was considerably high. Start End Error Precision Recall Recall Displacement Displacement (initial) (modified) Mean -15.75 -61.63 5.45% 98.88% 90.30% 89.40% Median 0.00 -17.00 2.72% 100.00% 95.73% 95.73% Std. Dev. 41.67 110.07 6.65% 2.96% 13.74% 16.03% Table 5.1: Statistical summary of evaluation of rule-based system The system appears to have performed well in recognising start frames of queues. As illustrated in Figure 5.4 (a), the start frames of queues for all except one video were correctly identified. Nevertheless, the results may be slightly skewed: the queues in 7 of the 8 videos start at the first frame. Only the queue in the outlier video has a start frame of 217; this was predicted as frame 91. This major discrepancy in that particular video is due to the system recognising the queue the moment two people walking side- by-side appear in the scene (Figure 5.5(a)), whereas ground truth only considers the queue formed in frame 217 (Figure 5.5(c)). 42

Page 49
Figure 5.4: Distribution of Start and End Displacements On the other hand, the results for end displacement were much more varied as can be seen from Fig- ure 5.4 (b). In this case, all displacements were negative, indicating that the system always recognised the end of a queue too early. In most cases, this was due to the annotation of people by the location of their feet. This gave the impression that a person has completely left the scene when the location of their feet cannot be tracked. In contrast, people are considered in ground truth to have left the scene only when their body cannot be seen. Without this difference in definition, we can conclude that the system performed generally well. Nevertheless, there was also an outlier with a displacement of -348. This was due to a unique occurrence in the scene: a person who has walked through the queue joins the queue at a later time without leaving the scene (henceforth known as the ��disruption�� case). The consequence of this is that the person continues to form a ��formerly connected to�� relation with other potential head of queues, disrupting the ��leave�� event and causing an invalid sequence according to the grammar definition. This shows that the rule-based system does not handle such cases well. Figure 5.6 illustrates the distribution of error of the test data. Errors were generally below 10%, except for an outlier with an error of 21.09%. This outlier is from the ��disruption�� case. We can conclude from these three measures that, except for aforementioned cases, the rule-based system was able to identify the start and end frames with a reasonable amount of accuracy. 43

Page 50
(a) Frame 97 (b) Frame 174 (c) Frame 217
Figure 5.5: Displacement of start frame in outlier video. The frame in figure (a) is the predicted start of the queue. Figure (b) shows two people walking side-by-side before forming a queue in Figure (c) as defined in ground truth. For precision, 7 out of the 8 test data achieved 100% precision, with the remaining video achieving 91.06% precision. This result shows that queues identified by the system are often indeed queues. The only exception is the sole negative start displacement case discussed and illustrated earlier in Figure 5.5. In the case of recall, the average recall were 90.30% (initial version) and 89.40% (modified version) respectively. In both cases, the distribution of recall for each test data ranged from 87.41% to 99.73% excluding an outlier (55.35% and 48.19% respectively), which was the only difference between both versions. This minor difference is due to the absence of an ��appear�� event. The outlier is the ��disruption�� case as discussed earlier: only partial localisation of the queue was provided in this case. Apart from the special case, the system performed well in completely localising queues in addition to identifying start and end frames.
5.3.3 Machine Learning Approach
This section discusses the evaluation of the machine learning approach from Chapter 4. Experiments were run using 10-fold cross validation where 9 example videos from the queue dataset were used to train the n-gram model whilst the remaining video was used for testing. Both rule-based and clustered event sequences were evaluated and compared. Similar to the rule-based approach, versions with and without ��appear�� and ��vanish�� events were also evaluated. As discussed in Section 4.5, a parameter k which denotes the minimum cluster size required for n-gram sequences to be considered a queue was also varied from 2 to 4. The combination of the aforementioned variations resulted in a set of 12 44

Page 51
Figure 5.6: Distribution of Errors by Video experiments. The results are summarised in Table 5.2. Queues were found in 9 out of the 10 videos for all experiments except one where queues were identified in only 8 of the videos. In general, learning and localising with clustered-event sequences resulted in a poorer performance compared to the rule-based version. This may be due to the increased number of represented events in the clustered version, which results in a greater variety of sequences and n-grams. Localisation would be more difficult since there is a greater chance of an unseen n-gram in a test sequence. The choice of cluster size, k also seemed to affect the results: performance deteriorated as k was increased. This is due to the increase in the number of false negatives. However, the increase in k may also decrease the number of false positives. Hence, a cluster size of 3 appears to be a good compromise. Finally, the omission of ��appear�� and ��vanish�� events generally enhanced Precision and Recall, but increased the error rate. This is because the omission generally reduces the amount of noise allowing a more complete localisation. The increase in error rate however, was mainly due to failure in localising smaller clusters of valid n-grams before an invalid sequence (e.g. the ��disruption�� case). These cases favour the initial version since the presence of ��appear�� and ��vanish�� events produces a larger cluster of valid n-grams, increasing its likelihood of being localised. The machine learning approach generally performed worse than its rule-based counterpart. Start 45

Page 52
Table 5.2: Results of experiments on the n-gram based system with various key parameters. displacements were zero in about 50% of the videos in most cases, with the remaining being positive displacements. In contrast, end displacements were negative in all cases except one. The reason for the disagreement between ground truth and the system in this exceptional case is depicted in Figure 5.7 where the last person is not considered as part of the queue (Figure 5.7(c)). For both start and end displacements, the standard deviation was large. Precision for the n-gram based system was good: 100% precision was achieved for all cases except one (91.46%). This exception was due to the case in Figure 5.7. Recall was also generally good with a mean above 70% in many cases. The n-gram based system may not have performed as well as the rule-based system, but this was expected as example videos of queues are scarce. The results may differ if more training data were available.
(a) (b) (c)
Figure 5.7: The sole positive end displacement case. The person entering in Figure (c) is not considered part of the queue according to ground truth, but was considered otherwise by the n-gram based system. 46

Page 53
5.4 Classification of Queues and Non-Queues
The following sets of experiments evaluate how well the system classifies queues and non-queues. Un- like previous experiments, these do not involve temporal localisation, but only require a yes/no answer. The experiments only involve the n-gram based system from Chapter 4. Videos from both queue and non-queue datasets are used to train and evaluate the system.
5.4.1 Evaluation Measures
Since these experiments do not involve temporal localisation of queues, the proposed evaluation mea- sures in Section 5.3.1 are not applicable. Instead, the outcome of each classification task can be repre- sented as a confusion matrix [19]. In this thesis, confusion matrices are two-dimensional, and are of the following form: Predicted Queue Non-Queue Ground Truth Queue True Positive False Negative Non-Queue False Positive True Negative Table 5.3: Format of confusion matrices in this thesis In addition to the confusion matrix, several measures are used to evaluate the results: Accuracy = true positive + true negative all test data �� 100 (5.6) True Positive Rate = true positive actual positives �� 100 (5.7) True Negative Rate = true negative actual negatives �� 100 (5.8) Accuracy (Equation 5.6) is a frequently-used measure in classification tasks and gauges a classifier��s overall performance in classifying queues and non-queues. In addition to an overall summary, it is also practical to measure the classification of queues and non-queues separately, as a classifier may 47

Page 54
perform well in one case but not the other. This can be evaluated using two additional measures: true positive rate, or recall for queues (Equation 5.7) and true negative rate, or specificity for non-queues (Equation 5.8).
5.4.2 Evaluation of N-gram Based Classifier
In the evaluation of the n-gram based classifier from Section 4.5, two separate sets of experiments were performed. The first set of experiments involved training the n-gram using various machine learning techniques: support vector machines (SVM), nearest neighbour and decision trees. Both queue and non-queue datasets were used, and experiments were performed in Weka using 10-fold cross validation. The second set of experiments involved the custom implementation of the nearest neighbour classifier using the similarity measure from [11], as discussed in Section 4.5. In this case, leave-one-out cross validation was used. For both sets of experiments, four versions of event sequences were evaluated: the initial and modified (without ��appear�� and ��vanish��) versions, and for both cases the rule-based and clustered events. SVM Nearest Neighbour Decision Trees initial, rule-based events 91.67% ⎡ ⎣ 9 1 1 13 ⎤ ⎦ 58.33% ⎡ ⎣ 10 0 10 4 ⎤ ⎦ 62.50% ⎡ ⎣ 5 5 4 10 ⎤ ⎦ initial, clustered events 91.67% ⎡ ⎣ 10 0 2 12 ⎤ ⎦ 54.17% ⎡ ⎣ 10 0 11 3 ⎤ ⎦ 66.67% ⎡ ⎣ 6 4 4 10 ⎤ ⎦ modified, rule-based events 91.67% ⎡ ⎣ 9 1 1 13 ⎤ ⎦ 50.00% ⎡ ⎣ 10 0 12 2 ⎤ ⎦ 91.67% ⎡ ⎣ 9 1 1 13 ⎤ ⎦ modified, clustered events 87.50% ⎡ ⎣ 9 1 2 12 ⎤ ⎦ 50.00% ⎡ ⎣ 10 0 12 2 ⎤ ⎦ 91.67% ⎡ ⎣ 9 1 1 13 ⎤ ⎦ Table 5.4: Accuracy rates and confusion matrices for different classifiers and event sequences Table 5.4 displays the confusion matrices and accuracy rates for the first set of experiments. Out of the three classifiers, the SVM classifier appeared to produce the best performance. It performed consis- tently well across all four versions of event sequences, with about 90% accuracy. In stark contrast, the 48

Page 55
nearest neighbour classifier only managed to achieve an accuracy rate of about 50% to 60%. This poor performance may be due to limited training data. In the case of the decision tree classifier, an interesting observation could be seen. The decision tree classifier performed poorly with the initial event sequences as input, but performed on par with the SVM classifer using the modified sequences. This discrepancy can be explained by examining the decision tree built by the classifier. Figure 5.8 shows two decision trees built from rule-based event sequences, each from the initial sequences and the modified sequences respectively. In Figure 5.8(a), classification of queues is strongly dictated by the presence of the n-grams ��van-van-van�� and ��idc-idc-van��, both of which are infrequent in queue sequences. However, the use of modified sequence resulted in the decision tree in Figure 5.8(b), where ��idc-rc-rc�� is an important subsequence in a ��standard�� queue, whilst ��con-dc-con�� is a common feature for short queues. This shows decision trees do not deal too well with noise. In the case of SVM and nearest neighbour, the difference between the initial and modified versions was not as significant.
(a) Initial Event Sequences (b) Modified Event Sequences
Figure 5.8: Decision trees generated from initial and modified event sequences. Apart from accuracy, it is also worth examining the true positive and true negative rates. Figure 5.9 visualises these figures. It can be observed that the SVM classifier was consistent for both positive and negative cases. In contrast, the nearest neighbour classifier��s low true negative rate was the main cause of its poor accuracy rate. In addition to sparse training data, the Euclidean distance measure may also not be the most suitable measure for the task; perhaps a different distance metrics such as Equation 4.1 may produce a better result. On the other hand, the decision tree classifier had a relatively low rate for both cases when used in conjunction with the initial event sequence. This was due to the amount of noise as discussed earlier. 49

Page 56
Figure 5.9: True positive rate and true negative rate of first set of experiments. The second set of experiments involved only a custom nearest neighbour classifier using Equa- tion 4.1 as its distance metrics. The results are summarised in Table 5.4. Compared to the nearest neigh- bour classifier in the first set of experiments, this classifier produced results comparable to the SVM classifier. The highest accuracy rate achieved was 95.83%, when the initial, clustered event-sequence was used. This suggests that the distance measure proposed in [11] is more suitable for nearest neigh- bour classifier compared to the Euclidean distance measure. For comparison, the experiments were repeated using the Euclidean distance measure which produced results relatively similar to the nearest neighbour classifier in the first experiment. An interesting note is how the Euclidean distance measure classifier was affected by examples of (spatially) short queues (i.e. queues with at most 2 people at a point in time) as most false positives identified these as their nearest neighbour. Detailed results of these experiments for both measures are included in Appendix D. In comparing initial and modified versions of event sequences, the former performed marginally better. In this case, events of people appearing and vanishing may produce a certain pattern (e.g. people will join a queue after appearing in the scene) and help discriminate queues and non-queues better.
5.5 Summary
For temporal localisation tasks, the rule-based system appeared to outperform the n-gram based system. However, this conclusion may be slightly biased as the analysis of rule-based system��s results is based 50

Page 57
Confusion Matrix Accuracy True Positive Rate False Positive Rate initial, rule-based events ⎡ ⎣ 9 1 1 13 ⎤ ⎦ 91.67% 90.00% 92.86% initial, clustered events ⎡ ⎣ 10 0 1 13 ⎤ ⎦ 95.83% 100.00% 92.86% modified, rule-based events ⎡ ⎣ 9 1 3 11 ⎤ ⎦ 83.33% 90.00% 78.58% modified, clustered events ⎡ ⎣ 9 1 2 12 ⎤ ⎦ 87.50% 90.00% 85.71% Table 5.5: Results for custom nearest neighbour classifier. on only 8 out of 10 videos whilst the results of the n-gram system is based on 9 videos. The rule-based system is more rigid in its definition of a queue, and does not handle cases such as the ��disruption�� case well. The n-gram based system trained with initial event sequences appeared to handle the ��disruption�� case slightly better, where it achieved 60% to 80% recall and 9.45% error across different parameters. The n-gram based system��s performance may improve given more training data. For classification tasks, the system performed generally well as a whole. The SVM classifier gave consistent results. The nearest neighbour classifier performed well with Equation 4.1 as its distance measure, and the initial version performed slightly better than the modified version. The decision tree classifier worked well in the case of the modified event sequences. We can conclude that ��appear�� and ��vanish�� events may help the classification task in certain cases but not in others. 51

Page 58
Chapter 6
Conclusions
6.1 Summary
This thesis has presented solutions to represent and recognise compound spatio-temporal entities, us- ing queues as an example. Chapter 2 outlined a literature review on spatio-temporal relations, spatio- temporal representations, object recognition, object tracking and machine learning. Chapter 3 proposed representing scenes qualitatively as a spatio-temporal graph, and discussed initial attempts at specifying a graph grammar for spatio-temporal graphs and the problems associated with it. It proceeded to propose the use of string grammars to represent event sequences generated from rules, and a means of localising a valid event sequence. Chapter 4 explored the possibilities of learning and recognising queues from examples using an n-gram model. Finally, Chapter 5 described the evaluation of the systems proposed in Chapters 3 and 4 in temporal localisation and classification tasks. This work is novel as there have been no known work on representing and recognising compound spatio-temporal entities. The main contribution of this thesis is a qualitative representation for com- pound spatio-temporal entities using event sequences and string grammars. The extension of this work to incorporate learning has also produced a representation of compound spatio-temporal entities as an event n-gram model, as grammar induction is not viable due to sparse training data. Representing ac- tivities as n-gram models is a recent idea (e.g. in [11]), as most prior work are based on grammar (e.g. 52

Page 59
[14, 38]). Evaluation shows that although the n-gram based system did not localise queues as well as the rule-based system, it produced excellent results in classification tasks.
6.2 Future Work
There are many possibilities in extending the work described in this thesis. As this thesis only con- centrates on representing and recognising queues, the idea can also be extended to other compound spatio-temporal entities and activities, such as the process of stacking containers onto a cargo ship or the process of loading vehicles onto a ferry. One direction is to specify a grammar for this new en- tity. Another aspect worth exploring is to investigate whether the n-gram based system can discriminate between different compound spatio-temporal entities. The idea can also be extended to spatial localisation, or more precisely answering the question ��where is the queue and who are involved in the queue?��. Spatial localisation can be useful in recogni- tion since temporal localisation only identifies the frames containing compound spatio-temporal entities – there might well be more than one compound entity within a frame sequence. In this respect, another scope for future work is to extend the system to handle multiple instances of compound spatio-temporal entities in a given video. These instances may be similar or different, and they may even occur simulta- neously. This work concerns only high-level qualitative representations and ignores the problems of low-level tracking. It is possible to fully automate the system by extending the work to track entities from videos. The outcome will be the ability for a seamless generation and recognition of n-gram models directly from example videos. 53

Page 60
Bibliography
[1] J. F. Allen. Maintaining knowledge about temporal intervals. Commun. ACM, 26(11):832–843, 1983. [2] S. Baumann. A simplified attributed graph grammar for high-level music recognition. In ICDAR ��95: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2), page 1080, Washington, DC, USA, 1995. IEEE Computer Society. [3] B. Bennett, D. R. Magee, A. G. Cohn, and D. C. Hogg. Using spatio-temporal continuity con- straints to enhance visual tracking of moving objects. In Lorenza Saitta, editor, Proceedings of the 16th European Conference on Artificial Intelligence (ECAI-04). ECCAI, IOS Press, 2004. [4] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006. [5] R. C. Carrasco, J. Oncina, and J. Calera-Rubio. Stochastic inference of regular tree languages. Machine Learning, 44(1/2):185–197, 2001. [6] A. G. Cohn and S. M. Hazarika. Qualitative spatial representation and reasoning: An overview. Fundamenta Informaticae, 46(1-2):1–29, 2001. [7] A. G. Cohn, D. Magee, A. Galata, D. Hogg, and S. Hazarika. Towards an architecture for cognitive vision using qualitative spatio-temporal representations and abduction. In C. Freksa, C. Habel, and K.F Wender, editors, Spatial Cognition III, Lecture Notes in Computer Science, pages 232–248. Springer, 2003. [8] D. J. Cook, L. B. Holder, and S. Djoko. Knowledge discovery from structural data. Journal of Intelligent Information Systems, 5(3):229–248, 1995. 54

Page 61
[9] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. In H. Burkhardt and B. Neumann, editors, Proceedings of European Conference on Computer Vision 1998, volume 2, pages 484–498. Springer, 1998. [10] F. D. Fracchia. Integrating lineage and interaction for the visualization of cellular structures. In Graph Gramars and Their Application to Computer Science, 5th International Workshop, Williamsburg, VA, USA, November 13-18, 1994, Selected Papers, volume 1073 of Lecture Notes in Computer Science, pages 512–535. Springer, 1996. [11] R. Hamid, S. Maddi, A. Johnson, A. Bobick, I. Essa, and C. Isbell. Unsupervised activity dis- covery and characterization from event-streams. In Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence (UAI-05), pages 251–15, Arlington, Virginia, 2005. AUAI Press. [12] F. Han and S. C. Zhu. Bottom-up/top-down image parsing by attribute graph grammar. In ICCV ��05: Proceedings of the Tenth IEEE International Conference on Computer Vision, pages 1778– 1785, Washington, DC, USA, 2005. IEEE Computer Society. [13] S. M. Hazarika and A. G. Cohn. Qualitative spatio-temporal continuity. In D. R. Montello, edi- tor, Spatial Information Theory: Foundations of Geographic Information Science; Proceedings of COSIT��01, volume 2205 of LNCS, pages 92–107, Morro Bay,CA, 2001. Springer. [14] Y. A. Ivanov and A. F. Bobick. Recognition of visual activities and interactions by stochastic parsing. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):852–872, 2000. [15] E. Jeltsch and H. Kreowski. Grammatical inference based on hyperedge replacement. In Proceed- ings of the 4th International Workshop on Graph-Grammars and Their Application to Computer Science, pages 461–474, London, UK, 1991. Springer-Verlag. [16] I. Jonyer, L. B. Holder, and D. J. Cook. Concept formation using graph grammars. In Proceedings of the KDD Workshop on Multi-Relational Data Mining, pages 71–79, 2002. [17] I. Jonyer, L. B. Holder, and D. J. Cook. MDL-based context-free graph grammar induction and applications. International Journal of Artificial Intelligence Tools, 13(1):65–79, 2004. 55

Page 62
[18] D. Klein and C. D. Manning. Natural language grammar induction using a constituent-context model. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press. [19] R. Kohavi and F. Provost. Glossary of terms. Machine Learning, 30(2/3):271–274, 1998. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process. [20] J. Kong and K. Zhang. On a spatial graph grammar formalism. In VLHCC ��04: Proceedings of the 2004 IEEE Symposium on Visual Languages - Human Centric Computing (VLHCC��04), pages 102–104, Washington, DC, USA, 2004. IEEE Computer Society. [21] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, and S. Shafer. Multi-camera multi-person tracking for easyliving. In VS ��00: Proceedings of the Third IEEE International Workshop on Visual Surveillance (VS��2000), page 3, Washington, DC, USA, 2000. IEEE Computer Society. [22] B. Leibe, N. Cornelis, K. Cornelis, and L.J. Van Gool. Dynamic 3D scene analysis from a moving vehicle. In CVPR07, pages 1–8, 2007. [23] G. Ligozat. Reasoning about cardinal directions. Journal of Visual Languages and Computing, 1(9):23–44, 1998. [24] D. Lowe. Distinctive image features from scale-invariant keypoints. In International Journal of Computer Vision, volume 20, pages 91–110, 2003. [25] D. Magee, C. J. Needham, P. Santos, A. G. Cohn, and D. C. Hogg. Autonomous learning for a cognitive agent using continuous models and inductive logic programming from audio-visual input. In Proceedings of the AAAI Workshop on Anchoring Symbols to Sensor Data, 2004. [26] C. Manning and H. Sch��tze. Foundations of Statistical Natural Language Processing, chapter 8, page 267. MIT Press, Cambridge, MA, 1999. [27] C. McCreary. Parsing a graph-grammar. In CSC ��88: Proceedings of the 1988 ACM sixteenth annual conference on Computer science, pages 249–255, New York, NY, USA, 1988. ACM Press. [28] S. J. McKenna, S. Jabri, Z. Duric, and H. Wechsler. Tracking interacting people. In Proceedings of Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000, pages 348–353. IEEE Computer Society, 2000. 56

Page 63
[29] T. M. Mitchell. Machine Learning. McGraw-Hill, London, 1997. [30] C. Needham, P. Santos, D. R. Magee, V. Devin, D. C. Hogg, and A. G. Cohn. Protocols from perceptual observations. Artificial Intelligence, 167:103–136, 2005. [31] C. G. Nevill-Manning and I. H. Witten. Identifying hierarchical structure in sequences: A linear- time algorithm. Journal of Artificial Intelligence Research, 7:67–82, 1997. [32] J. A. Panico. Queuing theory: a study of waiting lines for business, economics, and science. Prentice-Hall, Englewood Cliffs, NJ, 1969. [33] J. L. Pfaltz and A. Rosenfeld. Web grammars. In IJCAI, pages 609–620, 1969. [34] M. Pontil and A. Verri. Support vector machines for 3d object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6):637–646, 1998. [35] D. A. Randell, Z. Cui, and A. Cohn. A spatial logic based on regions and connection. In Bernhard Nebel, Charles Rich, and William Swartout, editors, KR��92. Principles of Knowledge Representa- tion and Reasoning: Proceedings of the Third International Conference, pages 165–176. Morgan Kaufmann, San Mateo, California, 1992. [36] J. Rekers and Andy Schurr. Defining and parsing visual languages with layered graph grammars. Journal of Visual Languages and Computing, 8(1):27–55, 1997. [37] G. Rozenberg and E. Welzl. Boundary NLC graph grammars–basic definitions, normal forms, and complexity. Inf. Control, 69(1-3):136–167, 1986. [38] M. S. Ryoo and J. K. Aggarwal. Recognition of composite human activities through context-free grammar based representation. In CVPR06, volume 2, pages 1709–1718, 2006. [39] M. Sipser. Introduction to the Theory of Computation. Thomson Course Technology, Boston, Massachusetts, Second edition, 2006. [40] Z. Solan, D. Horn, E. Ruppin, and S. Edelman. Unsupervised learning of natural languages. Proceedings of the National Academy of Science of the United States of America, 102(33):11629– 11634, August 2005. 57

Page 64
[41] C. Stauffer and W. E. L. Grimson. Adaptive background mixture models for real-time tracking. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1999, volume 2, pages 246–252, 1999. [42] A. Stolcke and S. Omohundro. Inducing probabilistic grammars by bayesian model merging. In R. C. Carrasco and J. Oncina, editors, Grammatical Inference and Applications (ICGI-94), pages 106–118. Springer, Berlin, Heidelberg, 1994. [43] P. Viola and M. J. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004. [44] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, chapter 4, pages 137–138. Morgan Kaufmann, San Francisco, CA, second edition, 1999. [45] C. R. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):780–785, 1997. [46] B. Wu and R. Nevatia. Tracking of multiple, partially occluded humans based on static body part detection. In CVPR ��06: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 951–958, Washington, DC, USA, 2006. IEEE Computer Society. [47] D. Zhang and K. Zhang. Reserved graph grammar: A specification tool for diagrammatic VPLs. In Proceedings. 1997 IEEE Symposium on Visual Languages, pages 284–91, Isle of Capri, Italy, 23–26 1997. [48] T. Zhao and R. Nevatia. Tracking multiple humans in crowded environment. CVPR04, 02:406– 413, 2004. 58

Page 65
Appendix A
Reflection
In a research-based project, things do not always run smoothly. There are times when a proposed solu- tion do not perform as well as expected, or when ideas which seem good in theory do not work well in practice. There are also times when a working implementation of an algorithm could not be acquired from an author, and when a custom implementation is impractical due to time constraints. There are also cases when one could be stuck without any ideas or progress: these are the times when a supervisor��s advice is crucial. Time management is important to accommodate all these unforeseen circumstances. Realistic goals should be set right from the beginning, and the schedule should be made as flexible as possible to deal with contingencies. The chosen methodology is suitable for research-based projects such as this work, as ideas can be explored, experimented with, and incrementally revised or improved. It is sometimes worthwhile to ex- periment with simple solutions first to assess their suitability and to identify any problems early. More complex and robust solutions can then be built on the simple solution. The solutions described in this thesis are not exhaustive: many alternative solutions have been proposed but did not make it into the thesis due to lack of space. Many of these solutions are simple ideas but could not be implemented due to various constraints. 59

Page 66
In the process of extending this work to explore possibilities of learning from examples, data sparse- ness has been a major issue. Several proposed solutions had to be revised and even abandoned as they were not feasible due to lack of training examples. The occurrence of this problem in the case of this work is due to the novelty of the problem; it is quite difficult to find video datasets of queues as well as suitable counter-examples. Although the datasets described in this thesis have been specially recorded for this work, it is still insufficient. Disk space is also another issue to consider: 1GB of space is required for approximately 10-15 minutes of video (assuming 640x480 resolution). Manual annotation often cannot be avoided in a computer vision project. In this task, annotation is required both as the assumed output of the ��tracker�� and as ground truth. The process of annotating is mundane and time-consuming, and definitely requires a great deal of effort and patience. The schedule should therefore take this into account. For example, the annotation of all 10 videos in the queue dataset took about a week to complete. The annotation of non-queue videos is unexpected and took up an ad- ditional week. Therefore, it is worthwhile to plan ahead and think of what data sources are required before committing to a particular project. Finally, the evaluation phase of the project should not be underestimated. The time required to test, analyse and report the results of experiments is often greater than one would expect. With a prototyping methodology, experiments often have to be repeated after various changes to the implementation. A systematic methodology is thus important in carrying out the experiments. Writing-up the work and revising the write-up can also take a lot of time and should therefore be started early. 60

Page 67
Appendix B
Interim Report
61

Page 68
Appendix C
Dataset
Table C.1 gives a brief description of all 10 videos from the queue dataset. Table C.2 describes all 14 videos from the non-queue dataset. Some videos in both datasets contain passers-by in the background; these are not explicitly mentioned. In all cases, a person who leaves the scene and returns at a later time are considered to be two different people. 62

Page 69
Name Description data101 2 persons already in line, 2 others enter and join queue together. Queue gets served. data102 1 person stands alone, 4 others join one by one. Queue gets served. data103 4 persons stand in line. Queue gets served. data104 4 persons stand in line. Queue gets served. data105 Queue forms and gets served almost alternately. 8 persons altogether. data106 2 persons enter together and form a line, 2 others enter together and joins queue. Queue gets served. data107 A short queue (maximum 2 persons) forms and gets served alternately. Last person enters and leaves scene (see Figure 5.7). 6 persons altogether. data108 4 persons form a queue, one person walks through queue and later joins at the back. Queue gets served. data109 3 persons stand in line. One person walks through queue. First person in line leaves, and another joins. A person walks by in the foreground. Remaining queue gets served. data110 A short queue quite similar to data108. 6 persons altogether. Table C.1: The queue dataset 63

Page 70
Name Description data201 A group-of-2 walks away from camera, followed by a group-of-3. A passer-by walks by in the foreground. 4 persons walk/run towards camera. data202 A group-of-2 walks away from camera. A group-of-2 and a group-of-3 enter from and leave in opposite directions. A group-of-3 walks towards camera. data203 A group-of-6 walks away from camera. A passer-by walks by in the foreground. 1 person walks to the centre. Another enters and meets up with the person. Both leave scene together. data204 2 persons walk to the centre. Another person enters. Person 1 walks away, and the remaining 2 leave in different directions. Two groups-of-2 enter from different directions and leave in the same direction. data205 Two groups-of-2 enter from different directions, cross each others�� path, and leave in different directions. data206 4 persons enter from different directions, cross paths in the centre, and leave in dif- ferent directions. 4 persons enter from different directions, meet up in the centre, and leave in different directions. data207 2 persons walk to the centre from different directions. 2 persons walk together to- wards the centre. All 4 leave together. data208 2 persons fight each other whilst 4 others try to pull them apart. data209 A group-of-6 marches towards the camera. data210 A group-of-2 walks away from camera. A group-of-2 and a group-of-3 enter from opposite directions, meet up and exchange members in the centre, and both groups walk away in opposite directions. data211 5 persons meet up in the centre. Leave one-by-one in random directions. data212 A 5-person-marathon. A person passes by in the foreground. A group-of-2 walks towards the camera, followed by a group-of-3. data213 A group-of-5 walks away from the camera. data214 4 persons enter from different directions, meet up in the centre, and leave simultane- ously in different directions. Table C.2: The non-queue dataset 64

Page 71
Appendix D
Detailed Results for Nearest Neighbour Classifier
The following tables complement the results of the second set of experiments in Section 5.4.2. Exper- iments were performed using leave-one-out cross-validation on 24 instances. The results displayed are based on the initial and clustered event sequence. Table D.1 uses Equation 4.1 as the similarity measure, whereas Table D.2 uses the Euclidean distance measure. Queue instances are labelled 1 to 10, whilst non-queues are labelled 11 to 24. Rows indicate instances, columns represent candidate neighbours. The nearest neighbour for each instance (row) is boldfaced and underlined. Tables are deliberately divided into 4 quadrants to resemble the format of the confusion matrix used. All numbers are rounded up to 2 decimal places because of space constraints. It is worth noting that classification using the Euclidean distance measure is greatly affected by the existence of examples of short queues (i.e. queues with at most 2 people at a point in time). As can be seen from Table D.2, all false positives are due to instances 7 and 10, both of which are examples of short queues. 65

Page 72
Confusion Matrix: ⎡ ⎣ 10 0 1 13 ⎤ ⎦ Accuracy: 95.83% True Positive Rate: 100.00% True Negative Rate: 92.86%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 - 0.60 0.62 0.36 0.00 0.12 0.00 0.28 0.06 0.00 0.10 0.18 0.04 0.13 0.22 0.08 0.06 0.04 0.00 0.00 0.11 0.03 0.11 0.10 2 0.60 - 0.44 0.26 0.13 0.14 0.00 0.39 0.05 0.00 0.12 0.09 0.04 0.05 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.00 0.05 0.00 3 0.62 0.44 - 0.42 0.00 0.29 0.00 0.21 0.07 0.00 0.00 0.08 0.00 0.07 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4 0.36 0.26 0.42 - 0.06 0.13 0.00 0.05 0.06 0.00 0.00 0.08 0.05 0.07 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5 0.00 0.13 0.00 0.06 - 0.00 0.28 0.13 0.00 0.20 0.22 0.03 0.04 0.00 0.00 0.04 0.00 0.00 0.00 0.07 0.00 0.06 0.06 0.00 6 0.12 0.14 0.29 0.13 0.00 - 0.06 0.09 0.11 0.06 0.09 0.04 0.00 0.12 0.07 0.00 0.05 0.03 0.00 0.08 0.05 0.03 0.00 0.00 7 0.00 0.00 0.00 0.00 0.28 0.06 - 0.00 0.00 0.73 0.21 0.00 0.00 0.06 0.00 0.05 0.07 0.03 0.00 0.04 0.00 0.04 0.00 0.00 8 0.28 0.39 0.21 0.05 0.13 0.09 0.00 - 0.09 0.00 0.08 0.03 0.07 0.00 0.00 0.09 0.13 0.00 0.04 0.11 0.00 0.08 0.09 0.00 9 0.06 0.05 0.07 0.06 0.00 0.11 0.00 0.09 - 0.00 0.04 0.07 0.04 0.06 0.00 0.10 0.05 0.03 0.05 0.04 0.05 0.06 0.06 0.05 10 0.00 0.00 0.00 0.00 0.20 0.06 0.73 0.00 0.00 - 0.26 0.04 0.11 0.15 0.00 0.09 0.06 0.03 0.00 0.14 0.00 0.04 0.00 0.00 11 0.10 0.12 0.00 0.00 0.22 0.09 0.21 0.08 0.04 0.26 - 0.26 0.18 0.14 0.10 0.10 0.23 0.06 0.00 0.28 0.04 0.07 0.09 0.08 12 0.18 0.09 0.08 0.08 0.03 0.04 0.00 0.03 0.07 0.04 0.26 - 0.30 0.21 0.12 0.05 0.07 0.04 0.00 0.11 0.10 0.11 0.17 0.09 13 0.04 0.04 0.00 0.05 0.04 0.00 0.00 0.07 0.04 0.11 0.18 0.30 - 0.13 0.05 0.23 0.24 0.03 0.03 0.28 0.07 0.07 0.30 0.14 14 0.13 0.05 0.07 0.07 0.00 0.12 0.06 0.00 0.06 0.15 0.14 0.21 0.13 - 0.14 0.04 0.05 0.07 0.00 0.13 0.06 0.06 0.11 0.05 15 0.22 0.00 0.09 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.10 0.12 0.05 0.14 - 0.08 0.06 0.10 0.10 0.00 0.12 0.06 0.13 0.21 16 0.08 0.00 0.00 0.04 0.04 0.00 0.05 0.09 0.10 0.09 0.10 0.05 0.23 0.04 0.08 - 0.13 0.08 0.10 0.11 0.11 0.18 0.17 0.06 17 0.06 0.00 0.00 0.00 0.00 0.05 0.07 0.13 0.05 0.06 0.23 0.07 0.24 0.05 0.06 0.13 - 0.05 0.10 0.31 0.10 0.06 0.30 0.00 18 0.04 0.00 0.00 0.00 0.00 0.03 0.03 0.00 0.03 0.03 0.06 0.04 0.03 0.07 0.10 0.08 0.05 - 0.20 0.16 0.13 0.07 0.03 0.11 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.05 0.00 0.00 0.00 0.03 0.00 0.10 0.10 0.10 0.20 - 0.27 0.22 0.03 0.17 0.20 20 0.00 0.04 0.00 0.00 0.07 0.08 0.04 0.11 0.04 0.14 0.28 0.11 0.28 0.13 0.00 0.11 0.31 0.16 0.27 - 0.11 0.05 0.18 0.04 21 0.11 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.05 0.00 0.04 0.10 0.07 0.06 0.12 0.11 0.10 0.13 0.22 0.11 - 0.06 0.22 0.17 22 0.03 0.00 0.00 0.00 0.06 0.03 0.04 0.08 0.06 0.04 0.07 0.11 0.07 0.06 0.06 0.18 0.06 0.07 0.03 0.05 0.06 - 0.03 0.08 23 0.11 0.05 0.00 0.00 0.06 0.00 0.00 0.09 0.06 0.00 0.09 0.17 0.30 0.11 0.13 0.17 0.30 0.03 0.17 0.18 0.22 0.03 - 0.11 24 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.08 0.09 0.14 0.05 0.21 0.06 0.00 0.11 0.20 0.04 0.17 0.08 0.11 -
Table D.1: Custom nearest neighbour classifier using Equation 4.1 as the similarity metric. 66

Page 73
Confusion Matrix: ⎡ ⎣ 10 0 11 3 ⎤ ⎦ Accuracy: 54.17% True Positive Rate: 100.00% True Negative Rate: 21.43%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 - 4.00 3.16 4.24 5.74 5.29 4.58 5.39 5.57 4.69 6.16 6.71 6.56 5.20 4.58 6.86 5.66 7.42 6.00 6.78 5.57 7.75 5.29 5.92 2 4.00 - 4.47 5.29 6.08 6.00 5.57 5.57 6.40 5.66 6.78 7.68 7.28 6.24 6.08 7.81 6.63 8.19 6.78 7.35 6.71 8.49 6.32 7.00 3 3.16 4.47 - 3.74 5.39 4.47 4.12 5.39 5.20 4.24 6.16 6.86 6.40 5.00 4.58 6.86 5.48 7.28 5.66 6.48 5.57 7.62 5.29 5.92 4 4.24 5.29 3.74 - 5.39 5.10 4.36 6.08 5.39 4.47 6.32 7.00 6.40 5.20 5.00 6.86 5.66 7.42 5.83 6.63 5.74 7.75 5.48 6.08 5 5.74 6.08 5.39 5.39 - 5.92 4.00 6.16 6.00 4.36 5.92 7.48 6.78 5.83 5.48 7.21 6.08 7.75 6.24 6.71 6.16 7.81 5.74 6.48 6 5.29 6.00 4.47 5.10 5.92 - 4.58 6.24 5.57 4.69 6.32 7.42 6.86 5.39 5.20 7.28 5.83 7.55 6.16 6.63 5.92 7.87 5.83 6.40 7 4.58 5.57 4.12 4.36 4.00 4.58 - 5.66 4.90 1.73 5.00 6.78 6.00 4.47 4.24 6.32 4.80 6.78 5.20 5.92 5.10 7.14 4.80 5.48 8 5.39 5.57 5.39 6.08 6.16 6.24 5.66 - 6.32 5.74 7.00 8.00 7.21 6.48 6.16 7.48 6.24 8.25 6.71 7.14 6.78 8.19 6.24 7.07 9 5.57 6.40 5.20 5.39 6.00 5.57 4.90 6.32 - 5.00 6.56 7.35 6.78 5.66 5.48 6.93 5.92 7.62 6.08 6.86 6.00 7.81 5.74 6.32 10 4.69 5.66 4.24 4.47 4.36 4.69 1.73 5.74 5.00 - 4.90 6.71 5.74 4.36 4.36 6.24 4.90 6.86 5.29 5.66 5.20 7.21 4.90 5.57 11 6.16 6.78 6.16 6.32 5.92 6.32 5.00 7.00 6.56 4.90 - 7.00 6.86 6.08 5.92 7.55 5.83 8.06 6.93 6.48 6.71 8.25 6.32 6.86 12 6.71 7.68 6.86 7.00 7.48 7.42 6.78 8.00 7.35 6.71 7.00 - 6.93 6.63 6.78 8.49 7.42 8.83 7.81 7.94 7.35 8.77 6.86 7.62 13 6.56 7.28 6.40 6.40 6.78 6.86 6.00 7.21 6.78 5.74 6.86 6.93 - 6.32 6.32 7.07 6.08 8.37 7.00 6.56 6.78 8.43 5.57 6.78 14 5.20 6.24 5.00 5.20 5.83 5.39 4.47 6.48 5.66 4.36 6.08 6.63 6.32 - 4.90 7.07 5.74 7.35 6.08 6.40 5.83 7.68 5.39 6.16 15 4.58 6.08 4.58 5.00 5.48 5.20 4.24 6.16 5.48 4.36 5.92 6.78 6.32 4.90 - 6.63 5.39 6.93 5.39 6.56 5.29 7.42 5.00 5.29 16 6.86 7.81 6.86 6.86 7.21 7.28 6.32 7.48 6.93 6.24 7.55 8.49 7.07 7.07 6.63 - 6.86 8.49 7.14 7.68 7.07 8.19 6.56 7.48 17 5.66 6.63 5.48 5.66 6.08 5.83 4.80 6.24 5.92 4.90 5.83 7.42 6.08 5.74 5.39 6.86 - 7.55 6.00 5.83 5.92 7.87 4.90 6.56 18 7.42 8.19 7.28 7.42 7.75 7.55 6.78 8.25 7.62 6.86 8.06 8.83 8.37 7.35 6.93 8.49 7.55 - 7.00 7.81 7.35 9.11 7.55 7.62 19 6.00 6.78 5.66 5.83 6.24 6.16 5.20 6.71 6.08 5.29 6.93 7.81 7.00 6.08 5.39 7.14 6.00 7.00 - 6.00 5.57 8.12 5.48 5.92 20 6.78 7.35 6.48 6.63 6.71 6.63 5.92 7.14 6.86 5.66 6.48 7.94 6.56 6.40 6.56 7.68 5.83 7.81 6.00 - 6.71 8.60 6.16 7.28 21 5.57 6.71 5.57 5.74 6.16 5.92 5.10 6.78 6.00 5.20 6.71 7.35 6.78 5.83 5.29 7.07 5.92 7.35 5.57 6.71 - 7.94 5.20 6.00 22 7.75 8.49 7.62 7.75 7.81 7.87 7.14 8.19 7.81 7.21 8.25 8.77 8.43 7.68 7.42 8.19 7.87 9.11 8.12 8.60 7.94 - 7.87 8.06 23 5.29 6.32 5.29 5.48 5.74 5.83 4.80 6.24 5.74 4.90 6.32 6.86 5.57 5.39 5.00 6.56 4.90 7.55 5.48 6.16 5.20 7.87 - 5.92 24 5.92 7.00 5.92 6.08 6.48 6.40 5.48 7.07 6.32 5.57 6.86 7.62 6.78 6.16 5.29 7.48 6.56 7.62 5.92 7.28 6.00 8.06 5.92 -
Table D.2: Custom nearest neighbour classifier using Euclidean distance measure. 67

Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#nuokui.com
TOP