Home > Database Application Design

Database Application Design


(C) 2005, The University of Michigan 

1  

Information Retrieval 
 

Dragomir R. Radev

University of Michigan

radev@umich.edu 

September 19, 2005


(C) 2005, The University of Michigan 

2  

About the instructor 

  • Dragomir R. Radev
  • Associate Professor, University of Michigan
    • School of Information
    • Department of Electrical Engineering and Computer Science
    • Department of Linguistics
  • Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan
  • Treasurer, North American Chapter of the ACL
  • Ph.D., 1998, Computer Science, Columbia University
  • Email: radev@umich.edu
  • Home page: http://tangra.si.umich.edu/~radev

 


(C) 2005, The University of Michigan 

3  

Introduction


(C) 2005, The University of Michigan 

4  

IR systems 

  • Google
  • Viv��simo
  • AskJeeves
  • NSIR
  • Lemur
  • MG
  • Nutch

(C) 2005, The University of Michigan 

5  

Examples of IR systems 

  • Conventional (library catalog).  
    Search by keyword, title, author, etc.
  • Text-based (Lexis-Nexis, Google, FAST). 
    Search by keywords. Limited search using queries in natural language.
  • Multimedia (QBIC, WebSeek, SaFe) 
    Search by visual appearance (shapes, colors,�� ).
  • Question answering systems (AskJeeves, NSIR, Answerbus) 
    Search in (restricted) natural language

(C) 2005, The University of Michigan 

6


(C) 2005, The University of Michigan 

7


(C) 2005, The University of Michigan 

8  

Need for IR 

  • Advent of WWW - more than 8 Billion documents indexed on Google
  • How much information? 200TB according to Lyman and Varian 2003.

    http://www.sims.berkeley.edu/research/projects/how-much-info/

  • Search, routing, filtering
  • User��s information need

(C) 2005, The University of Michigan 

9  

Some definitions of Information Retrieval (IR) 

Salton (1989): ��Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.�� 
 
Kowalski (1997): ��An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).��


(C) 2005, The University of Michigan 

10  

Sample queries (from Excite) 

In what year did baseball become an offical sport?

play station codes . com

birth control and depression

government

"WorkAbility I"+conference

kitchen appliances

where can I find a chines rosewood

tiger electronics

58 Plymouth Fury

How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?

emeril Lagasse

Hubble

M.S Subalaksmi

running


(C) 2005, The University of Michigan 

11  

Mappings and abstractions 

Reality 

Data 

Information need 

Query 

From Korfhage��s book


(C) 2005, The University of Michigan 

12  

Typical IR system 

  • (Crawling)
  • Indexing
  • Retrieval
  • User interface

(C) 2005, The University of Michigan 

13  

Key Terms Used in IR 

  • QUERY: a representation of what the user is looking for - can be a list of words or a phrase.
  • DOCUMENT: an information entity that the user wants to retrieve
  • COLLECTION: a set of documents
  • INDEX: a representation of information that makes querying easier
  • TERM: word or concept that appears in a document or a query

(C) 2005, The University of Michigan 

14  

Documents


(C) 2005, The University of Michigan 

15  

Documents 

  • Not just printed paper
  • collections vs. documents
  • data structures: representations
  • Bag of words method
  • document surrogates: keywords, summaries
  • encoding: ASCII, Unicode, etc.

(C) 2005, The University of Michigan 

16  

Document preprocessing 

  • Formatting
  • Tokenization (Paul��s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc)
  • Casing (cat vs. CAT)
  • Stemming (computer, computation)
  • Soundex

 


(C) 2005, The University of Michigan 

17  

Document representations 

  • Term-document matrix (m x n)
  • term-term matrix (m x m x n)
  • document-document matrix (n x n)
  • Example: 3,000,000 documents (n) with 50,000 terms (m)
  • sparse matrices
  • Boolean vs. integer matrices

(C) 2005, The University of Michigan 

18  

Document representations 

  • Term-document matrix
    • Evaluating queries (e.g., (AB)C)
    • Storage issues
  • Inverted files
    • Storage issues
    • Evaluating queries
    • Advantages and disadvantages

(C) 2005, The University of Michigan 

19  

IR models


(C) 2005, The University of Michigan 

20  

Major IR models 

  • Boolean
  • Vector
  • Probabilistic
  • Language modeling
  • Fuzzy retrieval
  • Latent semantic indexing

 


(C) 2005, The University of Michigan 

21  

Major IR tasks 

  • Ad-hoc
  • Filtering and routing
  • Question answering
  • Spoken document retrieval
  • Multimedia retrieval

 


(C) 2005, The University of Michigan 

22  

Venn diagrams 

x 

w 

y 

z 

D1 

D2


(C) 2005, The University of Michigan 

23  

Boolean model 


B


(C) 2005, The University of Michigan 

24  

restaurants AND (Mideastern OR vegetarian) AND inexpensive 

Boolean queries 

  • What types of documents are returned?
  • Stemming
  • thesaurus expansion
  • inclusive vs. exclusive OR
  • confusing uses of AND and OR
 

dinner AND sports AND symphony 

4 OF (Pentium, printer, cache, PC, monitor, computer, personal)


(C) 2005, The University of Michigan 

25  

Boolean queries 

  • Weighting (Beethoven AND sonatas)
  • precedence
 
 

coffee AND croissant OR muffin 

raincoat AND umbrella OR sunglasses 

  • Use of negation: potential problems
  • Conjunctive and Disjunctive normal forms
  • Full CNF and DNF

(C) 2005, The University of Michigan 

26  

Transformations 

  • De Morgan��s Laws:
 

NOT (A AND B) = (NOT A) OR (NOT B) 

NOT (A OR B) = (NOT A) AND (NOT B) 

  • CNF or DNF?
    • Reference librarians prefer CNF - why?

(C) 2005, The University of Michigan 

27  

Boolean model 

  • Partition
  • Partial relevance?
  • Operators: AND, NOT, OR, parentheses

 


(C) 2005, The University of Michigan 

28  

Exercise 

  • D1 = ��computer information retrieval��
  • D2 = ��computer retrieval��
  • D3 = ��information��
  • D4 = ��computer information��
  • Q1 = ��information  retrieval�� 
  • Q2 = ��information  ¬computer��

(C) 2005, The University of Michigan 

29  

Exercise 

Swift 

Shakespeare 


Shakespeare 


Swift 



Swift 

Shakespeare 

Milton 


Shakespeare 

Milton 


Swift 

Milton 


Milton 


Chaucer 


Swift 

Chaucer 


Shakespeare 

Chaucer 

10 

Swift 

Shakespeare 

Chaucer 

11 

Milton 

Chaucer 

12 

Swift 

Milton 

Chaucer 

13 

Swift 

Shakespeare 

Milton 

Chaucer 

15 

Shakespeare 

Milton 

Chaucer 

14 

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))


(C) 2005, The University of Michigan 

30  

Stop lists 

  • 250-300 most common words in English account for 50% or more of a given text.
  • Example: ��the�� and ��of�� represent 10% of tokens. ��and��, ��to��, ��a��, and ��in�� - another 10%. Next 12 words - another 10%.
  • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).
  • Token/type ratio: 2256/859 = 2.63

(C) 2005, The University of Michigan 

31  

Vector models 

Term 1 

Term 2 

Term 3 

Doc 1 

Doc 2 

Doc 3


(C) 2005, The University of Michigan 

32  

Vector queries 

  • Each document is represented as a vector
  • non-efficient representations (bit vectors)
  • dimensional compatibility

(C) 2005, The University of Michigan 

33  

The matching process 

  • Document space
  • Matching is done between a document and a query (or between two documents)
  • distance vs. similarity
  • Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.

(C) 2005, The University of Michigan 

34  

Miscellaneous similarity measures 

  • The Cosine measure
 

 (D,Q) =                           =  

 (di x qi)  

  

 (di)2  * 

 (qi)2 

|X  Y| 

  

|X| * |Y| 

 (D,Q) = 

|X  Y| 

|X  Y| 

  • The Jaccard coefficient

(C) 2005, The University of Michigan 

35  

Exercise 

  • Compute the cosine measures  (D1,D2) and  (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1>
  • Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.

(C) 2005, The University of Michigan 

36  

Evaluation


(C) 2005, The University of Michigan 

37  

Relevance 

  • Difficult to change: fuzzy, inconsistent
  • Methods: exhaustive, sampling, pooling, search-based

(C) 2005, The University of Michigan 

38  

Contingency table 

w 

x 

y 

z 

n2 = w + y 

n1 = w + x 

N 

relevant 

not relevant 

retrieved 

not retrieved


(C) 2005, The University of Michigan 

39  

Precision and Recall 

Recall: 

Precision: 


w+y 

w+x 

w


(C) 2005, The University of Michigan 

40  

Exercise 

Go to Google (www.google.com) and search for documents on Tolkien��s ��Lord of the Rings��. Try different ways of phrasing the query: e.g., Tolkien, ��JRR Melville��, +��JRR Tolkien�� +Lord of the Rings��, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.  

Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs.

Later, try different queries.


(C) 2005, The University of Michigan 

41  

[From Salton��s book]


(C) 2005, The University of Michigan 

42  

Interpolated average precision (e.g., 11pt)

Interpolation – what is precision at recall=0.5?


(C) 2005, The University of Michigan 

43  

Issues 

  • Why not use accuracy A=(w+z)/N?
  • Average precision
  • Average P at given ��document cutoff values��
  • Report when P=R
  • F measure: F=(b2+1)PR/(b2P+R)
  • F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R

(C) 2005, The University of Michigan 

44  

         Kappa 
 

  • N: number of items  (index i)
  • n: number of categories (index j)
  • k: number of annotators

(C) 2005, The University of Michigan 

45  

Kappa example (from Manning, Schuetze, Raghavan) 

70 

20 

J2- 

10 

300 

J2+ 

J1- 

J1+


(C) 2005, The University of Michigan 

46  

Kappa (cont��d) 

  • P(A) = 370/400
  • P (-) = (10+20+20+70)/800 = 0.2125
  • P (+) = (10+20+300+300)/800 = 0.7878
  • P (E) = 0.2125 * 0.2125 + 0.7878 * 0.7878 = 0.665
  • K = (0.925-0.665)/(1-0.665) = 0.776
  • Kappa higher than 0.67 is tentatively acceptable; higher than 0.8 is good

(C) 2005, The University of Michigan 

47  

Relevance collections 

  • TREC ad hoc collections, 2-6 GB
  • TREC Web collections, 2-100GB

(C) 2005, The University of Michigan 

48  

Sample TREC query 

<top>

<num> Number: 305

<title> Most Dangerous Vehicles  

<desc> Description:

Which are the most crashworthy, and least crashworthy,

passenger vehicles?

 

<narr> Narrative:

A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles.  The document will have to describe/compare vehicles, not drivers.  For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes.  I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.

</top> 

LA031689-0177

FT922-1008

LA090190-0126

LA101190-0218

LA082690-0158

LA112590-0109

FT944-136

LA020590-0119

FT944-5300

LA052190-0048

LA051689-0139

FT944-9371

LA032390-0172 

LA042790-0172 
LA021790-0136 
LA092289-0167 
LA111189-0013 
LA120189-0179 
LA020490-0021 
LA122989-0063 
LA091389-0119 
LA072189-0048 
FT944-15615 
LA091589-0101 
LA021289-0208


(C) 2005, The University of Michigan 

49  

<DOCNO> LA031689-0177 </DOCNO>

<DOCID> 31701 </DOCID>

<DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE>

<SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION>

<LENGTH><P>586 words </P></LENGTH>

<HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE>

<BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE>

<TEXT>

<P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over

accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P>

<P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the

Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after

Consumer Reports magazine charged that the vehicle had basic design flaws. </P>

<P>Several Fatalities </P>

<P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,

particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation

conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P>

<P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle

roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving

the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the

accident report, NHTSA declined to investigate the Samurai. </P>

...

</TEXT>

<GRAPHIC><P> Photo, The Ford Bronco II "appears to have a higher

number of single-vehicle, first event roll-overs," a federal official

said. </P></GRAPHIC>

<SUBJECT>

<P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;

RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P>

</SUBJECT>

</DOC>

 


(C) 2005, The University of Michigan 

50  

TREC (cont��d) 

  • http://trec.nist.gov/tracks.html
  • http://trec.nist.gov/presentations/presentations.html

 


(C) 2005, The University of Michigan 

51  

Word distribution models


(C) 2005, The University of Michigan 

52  

Shakespeare 

  • Romeo and Juliet:
  • And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60;
  • ��
  • A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1;
 

http://www.mta75.org/curriculum/english/Shakes/indexx.html


(C) 2005, The University of Michigan 

53  

The BNC (Adam Kilgarriff) 

  • 1 6187267 the det
  • 2 4239632 be v
  • 3 3093444 of prep
  • 4 2687863 and conj
  • 5 2186369 a det
  • 6 1924315 in prep
  • 7 1620850 to infinitive-marker
  • 8 1375636 have v
  • 9 1090186 it pron
  • 10 1039323 to prep
  • 11 887877 for prep
  • 12 884599 i pron
  • 13 760399 that conj
  • 14 695498 you pron
  • 15 681255 he pron
  • 16 680739 on prep
  • 17 675027 with prep
  • 18 559596 do v
  • 19 534162 at prep
  • 20 517171 by prep
 

Kilgarriff, A. Putting Frequencies in the Dictionary. 
International Journal of Lexicography 
10 (2) 1997. Pp 135--155


(C) 2005, The University of Michigan 

54  

Stop lists 

  • 250-300 most common words in English account for 50% or more of a given text.
  • Example: ��the�� and ��of�� represent 10% of tokens. ��and��, ��to��, ��a��, and ��in�� - another 10%. Next 12 words - another 10%.
  • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).
  • Token/type ratio: 2256/859 = 2.63

(C) 2005, The University of Michigan 

55  

Zipf��s law 

Rank x Frequency  Constant


(C) 2005, The University of Michigan 

56  

Zipf's law is fairly general! 

  • Frequency of accesses to web pages
    • in particular the access counts on the Wikipedia page,

with s approximately equal to 0.3

    • page access counts on Polish Wikipedia (data for late July 2003)

approximately obey Zipf's law with s about 0.5

  • Words in the English language
    • for instance, in Shakespeare��s play Hamlet with s approximately 0.5
  • Sizes of settlements
  • Income distributions amongst individuals
  • Size of earthquakes
  • Notes in musical performances
 

http://en.wikipedia.org/wiki/Zipf's_law


(C) 2005, The University of Michigan 

57  

Zipf��s law (cont��d) 

  • Limitations:
    • Low and high frequencies
    • Lack of convergence
  • Power law with coefficient c = -1
    • Y=kxc
  • Li (1992) – typing words one letter at a time, including spaces

(C) 2005, The University of Michigan 

58  

Heap��s law 

  • Size of vocabulary: V(n) = Knb
  • In English, K is between 10 and 100, �� is between 0.4 and 0.6.
 


V(n) 

http://en.wikipedia.org/wiki/Heaps%27_law


(C) 2005, The University of Michigan 

59  

Heap��s law (cont��d) 

  • Related to Zipf��s law: generative models
  • Zipf��s and Heap��s law coefficients change with language
 

Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws�� Coefficients Depend on Language. Proc. 
CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics,  
February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004,  
ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp. 332–335.


(C) 2005, The University of Michigan 

60  

Indexing


(C) 2005, The University of Michigan 

61  

Methods 

  • Manual: e.g., Library of Congress subject headings, MeSH
  • Automatic

(C) 2005, The University of Michigan 

62  

LOC subject headings 

http://www.loc.gov/catdir/cpso/lcco/lcco.html 

A -- GENERAL WORKS 
B -- PHILOSOPHY. PSYCHOLOGY. RELIGION 
C -- AUXILIARY SCIENCES OF HISTORY 
D -- HISTORY (GENERAL) AND HISTORY OF EUROPE 
E -- HISTORY: AMERICA 
F -- HISTORY: AMERICA 
G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION 
H -- SOCIAL SCIENCES 
J -- POLITICAL SCIENCE 
K -- LAW 
L -- EDUCATION 
M -- MUSIC AND BOOKS ON MUSIC 
N -- FINE ARTS 
P -- LANGUAGE AND LITERATURE 
Q -- SCIENCE 
R -- MEDICINE 
S -- AGRICULTURE 
T -- TECHNOLOGY 
U -- MILITARY SCIENCE 
V -- NAVAL SCIENCE 
Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)


(C) 2005, The University of Michigan 

63  

Medicine 

           CLASS R - MEDICINE

           Subclass R

           R5-920 Medicine (General)

           R5-130.5  General works

           R131-687  History of medicine.  Medical expeditions

           R690-697  Medicine as a profession.  Physicians

           R702-703  Medicine and the humanities.  Medicine and disease in relation to history, literature, etc.

           R711-713.97 Directories

           R722-722.32 Missionary medicine.  Medical missionaries

           R723-726  Medical philosophy.  Medical ethics

           R726.5-726.8 Medicine and disease in relation to psychology.  Terminal care.  Dying

           R727-727.5  Medical personnel and the public.  Physician and the public

           R728-733  Practice of medicine.  Medical practice economics

           R735-854  Medical education.  Medical schools.  Research

           R855-855.5  Medical technology

           R856-857  Biomedical engineering.  Electronics.  Instrumentation

           R858-859.7  Computer applications to medicine.  Medical informatics

           R864  Medical records

           R895-920  Medical physics.  Medical radiology.  Nuclear medicine


(C) 2005, The University of Michigan 

64  

Finding the most frequent terms in a document 

  • Typically stop words: the, and, in
  • Not content-bearing
  • Terms vs. words
  • Luhn��s method

(C) 2005, The University of Michigan 

65  

Luhn��s method 

WORDS 

FREQUENCY 

E


(C) 2005, The University of Michigan 

66  

Computing term salience 

  • Term frequency (IDF)
  • Document frequency (DF)
  • Inverse document frequency (IDF)

(C) 2005, The University of Michigan 

67  

Applications of TFIDF 

  • Cosine similarity
  • Indexing
  • Clustering

 


(C) 2005, The University of Michigan 

68  

Vector-based matching 

  • The cosine measure
 

sim (D,C) =  

(dk . ck . idf(k))  

  

   (dk)2   . 

(ck)2 






k


(C) 2005, The University of Michigan 

69  

IDF: Inverse document frequency 

N: number of documents 
dk: number of documents containing term
f
ik: absolute frequency of term k in document i 
wik: weight of term k in document i 

idfk =  log2(N/dk) + 1 = log2N - log2dk + 1 

TF * IDF is used for automated indexing and for topic 
discrimination:


(C) 2005, The University of Michigan 

70  

Asian and European news 

622.941        deng

306.835        china

196.725        beijing

153.608        chinese

152.113        xiaoping

124.591        jiang

108.777        communist

102.894        body

  85.173        party

  71.898        died

  68.820        leader

  43.402        state

  38.166        people 

97.487        nato

92.151        albright

74.652        belgrade

46.657        enlargement

34.778        alliance

34.778        french

33.803        opposition

32.571        russia

14.095        government

  9.389        told

  9.154        would

  8.459        their

  6.059        which


(C) 2005, The University of Michigan 

71  

Other topics 

120.385        shuttle

  99.487        space

  90.128        telescope

  70.224        hubble

  59.992        rocket

  50.160        astronauts

  49.722        discovery

  47.782        canaveral

  47.782        cape

  40.889        mission

  35.778        florida

  27.063        center 

74.652        compuserve

65.321        massey

55.989        salizzoni

29.996        bob

27.994        online

27.198        executive

15.890        interim

15.271        chief

11.647        service

11.174        second

  6.781        world

  6.315        president


(C) 2005, The University of Michigan 

72  

Compression


(C) 2005, The University of Michigan 

73  

Compression 

  • Methods
    • Fixed length codes
    • Huffman coding
    • Ziv-Lempel codes

(C) 2005, The University of Michigan 

74  

Fixed length codes 

  • Binary representations
    • ASCII
    • Representational power (2k symbols where k is the number of bits)

 


(C) 2005, The University of Michigan 

75  

Variable length codes 

  • Alphabet:

        A .-   N -.   0 -----

        B -...   O ---   1 .----

        C -.-.   P .--.   2 ..---

        D -..   Q --.-   3 ...—

        E .   R .-.  4 ....- 

        F ..-.  S ...  5 ..... 

        G --.  T -   6 -....

        H ....  U ..-   7 --...

        I ..   V ...-   8 ---..

        J .---   W .--   9 ----.

        K -.-   X -..-

        L .-..   Y -.—

        M --   Z --..

  • Demo:
    • http://www.babbage.demon.co.uk/morse.html
    • http://www.scphillips.com/morse/

 


(C) 2005, The University of Michigan 

76  

Most frequent letters in English 

  • Most frequent letters:
    • E T A O I N S H R D L U
    • http://www.math.cornell.edu/~mec/modules/cryptography/subs/frequencies.html
  • Demo:
    • http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm
  • Also: bigrams:
    • TH HE IN ER AN RE ND AT ON NT
    • http://www.math.cornell.edu/~mec/modules/cryptography/subs/digraphs.html

(C) 2005, The University of Michigan 

77  

Useful links about cryptography 

  • http://world.std.com/~franl/crypto.html
  • http://www.faqs.org/faqs/cryptography-faq/
  • http://en.wikipedia.org/wiki/Cryptography

 


(C) 2005, The University of Michigan 

78  

Huffman coding 

  • Developed by David Huffman (1952)
  • Average of 5 bits per character (37.5% compression)
  • Based on frequency distributions of symbols
  • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

(C) 2005, The University of Michigan 

79


(C) 2005, The University of Michigan 

80  




























a


(C) 2005, The University of Michigan 

81


(C) 2005, The University of Michigan 

82  

Exercise 

  • Consider the bit string:  
    01101101111000100110001110100111000110101101011101
  • Use the Huffman code from the example to decode it.
  • Try inserting, deleting, and switching some bits at random locations and try decoding.

(C) 2005, The University of Michigan 

83  

Ziv-Lempel coding 

  • Two types - one is known as LZ77 (used in GZIP)
  • Code: set of triples <a,b,c>
  • a: how far back in the decoded text to look for the upcoming text segment
  • b: how many characters to copy
  • c: new character to add to complete segment

(C) 2005, The University of Michigan 

84  

  • <0,0,p> p
  • <0,0,e> pe
  • <0,0,t> pet
  • <2,1,r> peter
  • <0,0,_> peter_
  • <6,1,i> peter_pi
  • <8,2,r> peter_piper
  • <6,3,c> peter_piper_pic
  • <0,0,k> peter_piper_pick
  • <7,1,d> peter_piper_picked
  • <7,1,a> peter_piper_picked_a
  • <9,2,e> peter_piper_picked_a_pe
  • <9,2,_> peter_piper_picked_a_peck_
  • <0,0,o> peter_piper_picked_a_peck_o
  • <0,0,f> peter_piper_picked_a_peck_of
  • <17,5,l> peter_piper_picked_a_peck_of_pickl
  • <12,1,d> peter_piper_picked_a_peck_of_pickled
  • <16,3,p> peter_piper_picked_a_peck_of_pickled_pep
  • <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper
  • <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

(C) 2005, The University of Michigan 

85  

Links on text compression 

  • Data compression:
    • http://www.data-compression.info/
  • Calgary corpus:
    • http://links.uwaterloo.ca/calgary.corpus.html
  • Huffman coding:
    • http://www.compressconsult.com/huffman/
    • http://en.wikipedia.org/wiki/Huffman_coding
  • LZ
    • http://en.wikipedia.org/wiki/LZ77

(C) 2005, The University of Michigan 

86  

Relevance feedback  
and 
query expansion


(C) 2005, The University of Michigan 

87  

Relevance feedback 

  • Problem: initial query may not be the most appropriate to satisfy a given information need.
  • Idea: modify the original query so that it gets closer to the right documents in the vector space

(C) 2005, The University of Michigan 

88  

Relevance feedback 

  • Automatic
  • Manual
  • Method: identifying feedback terms 

    Q�� = a1Q + a2R - a3N

    Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|

(C) 2005, The University of Michigan 

89  

Example 

  • Q = ��safety minivans��
  • D1 = ��car safety minivans tests injury statistics�� - relevant
  • D2 = ��liability tests safety�� - relevant
  • D3 = ��car passengers injury reviews�� - non-relevant
  • R = ?
  • S = ?
  • Q�� = ?

(C) 2005, The University of Michigan 

90  

Pseudo relevance feedback 

  • Automatic query expansion
    • Thesaurus-based expansion (e.g., using latent semantic indexing – later��)
    • Distributional similarity
    • Query log mining

(C) 2005, The University of Michigan 

91  

Examples 

Book: publication, product, fact, dramatic composition, record

Computer: machine, expert, calculator, reckoner, figurer

Fruit: reproductive structure, consequence, product, bear

Politician: leader, schemer

Newspaper: press, publisher, product, paper, newsprint  

Distributional clustering: 

Lexical semantics (Hypernymy): 

Book: autobiography, essay, biography, memoirs, novels

Computer: adobe, computing, computers, developed, hardware

Fruit: leafy, canned, fruits, flowers, grapes

Politician: activist, campaigner, politicians, intellectuals, journalist

Newspaper:  daily, globe, newspapers, newsday, paper


(C) 2005, The University of Michigan 

92  

Examples (query logs) 

  • Book: booksellers, bookmark, blue
  • Computer: sales, notebook, stores, shop
  • Fruit: recipes cake salad basket company
  • Games: online play gameboy free video
  • Politician: careers federal office history
  • Newspaper: online website college information
  • Schools: elementary high ranked yearbook
  • California: berkeley san francisco southern
  • French: embassy dictionary learn

(C) 2005, The University of Michigan 

93  

Problems with automatic query expansion 

  • Adding frequent words may dilute the results (example?)

 


(C) 2005, The University of Michigan 

94  

Stemming


(C) 2005, The University of Michigan 

95  

Goals 

  • Motivation:
    • Computer, computers, computerize, computational, computerization
    • User, users, using, used
  • Representing related words as one token
  • Simplify matching
  • Reduce storage and computation
  • Also known as: term conflation

(C) 2005, The University of Michigan 

96  

Methods 

  • Manual (tables)
    • Achievement  achiev
    • Achiever  achiev
    • Etc.
  • Affix removal (Harman 1991, Frakes 1992)
    • if a word ends in ��ies�� but not ��eies�� or ��aies�� then ��ies��  ��y��
    • If a word ends in ��es�� but not ��aes��, ��ees��, or ��oes��, then ��es��  ��e��
    • If a word ends in ��s�� but not ��us�� or ��ss�� then ��s��  NULL
    • (apply only the first applicable rule)

(C) 2005, The University of Michigan 

97  

Porter��s algorithm (Porter 1980) 

  • Home page:
    • http://www.tartarus.org/~martin/PorterStemmer
  • Reading assignment:
    • http://www.tartarus.org/~martin/PorterStemmer/def.txt
  • Consonant-vowel sequences:
    • CVCV ... C
    • CVCV ... V
    • VCVC ... C
    • VCVC ... V
    • Shorthand: [C]VCVC ... [V]

(C) 2005, The University of Michigan 

98  

Porter��s algorithm (cont��d) 

  • [C](VC){m}[V]
      • {m} indicates repetition
  • Examples:
      • m=0 TR, EE, TREE, Y, BY
      • m=1 TROUBLE, OATS, TREES, IVY
      • m=2 TROUBLES, PRIVATE, OATEN
  • Conditions:
      • *S - the stem ends with S (and similarly for the other letters).
      • *v* - the stem contains a vowel.
      • *d - the stem ends with a double consonant (e.g. -TT, -SS).
      • *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).

(C) 2005, The University of Michigan 

99  

Step 1a

      SSES -> SS caresses -> caress

      IES -> I ponies -> poni ties -> ti

      SS -> SS caress -> caress

      S -> cats -> cat

Step 1b

      (m>0) EED -> EE feed -> feed agreed -> agree

      (*v*) ED -> plastered -> plaster bled -> bled

      (*v*) ING -> motoring -> motor sing -> sing

Step 1b1

      If the second or third of the rules in Step 1b is successful,

      the following is done:

      AT -> ATE conflat(ed) -> conflate

      BL -> BLE troubl(ed) -> trouble

      IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter

            hopp(ing) -> hop

            tann(ed) -> tan

            fall(ing) -> fall

            hiss(ing) -> hiss

            fizz(ed) -> fizz

      (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file


(C) 2005, The University of Michigan 

100  

Step 1c

      (*v*) Y -> I happy -> happi sky -> sky

Step 2

      (m>0) ATIONAL -> ATE relational -> relate

      (m>0) TIONAL -> TION conditional -> condition rational -> rational

      (m>0) ENCI -> ENCE valenci -> valence

      (m>0) ANCI -> ANCE hesitanci -> hesitance

      (m>0) IZER -> IZE digitizer -> digitize

      (m>0) ABLI -> ABLE conformabli -> conformable

      (m>0) ALLI -> AL radicalli -> radical

      (m>0) ENTLI -> ENT differentli -> different

      (m>0) ELI -> E vileli - > vile

      (m>0) OUSLI -> OUS analogousli -> analogous

      (m>0) IZATION -> IZE vietnamization -> vietnamize

      (m>0) ATION -> ATE predication -> predicate

      (m>0) ATOR -> ATE operator -> operate

      (m>0) ALISM -> AL feudalism -> feudal

      (m>0) IVENESS -> IVE decisiveness -> decisive

      (m>0) FULNESS -> FUL hopefulness -> hopeful

      (m>0) OUSNESS -> OUS callousness -> callous

      (m>0) ALITI -> AL formaliti -> formal

      (m>0) IVITI -> IVE sensitiviti -> sensitive

      (m>0) BILITI -> BLE sensibiliti -> sensible


(C) 2005, The University of Michigan 

101  

Step 3

      (m>0) ICATE -> IC triplicate -> triplic

      (m>0) ATIVE -> formative -> form

      (m>0) ALIZE -> AL formalize -> formal

      (m>0) ICITI -> IC electriciti -> electric

      (m>0) ICAL -> IC electrical -> electric

      (m>0) FUL -> hopeful -> hope

      (m>0) NESS -> goodness -> good

Step 4

      (m>1) AL -> revival -> reviv

      (m>1) ANCE -> allowance -> allow

      (m>1) ENCE -> inference -> infer

      (m>1) ER -> airliner -> airlin

      (m>1) IC -> gyroscopic -> gyroscop

      (m>1) ABLE -> adjustable -> adjust

      (m>1) IBLE -> defensible -> defens

      (m>1) ANT -> irritant -> irrit

      (m>1) EMENT -> replacement -> replac

      (m>1) MENT -> adjustment -> adjust

      (m>1) ENT -> dependent -> depend

      (m>1 and (*S or *T)) ION -> adoption -> adopt

      (m>1) OU -> homologou -> homolog

      (m>1) ISM -> communism -> commun

      (m>1) ATE -> activate -> activ

      (m>1) ITI -> angulariti -> angular

      (m>1) OUS -> homologous -> homolog

      (m>1) IVE -> effective -> effect

      (m>1) IZE -> bowdlerize -> bowdler


(C) 2005, The University of Michigan 

102  

Step 5a

      (m>1) E -> probate -> probat rate -> rate

      (m=1 and not *o) E -> cease -> ceas  

Step 5b

      (m > 1 and *d and *L) -> single letter controll -> control roll -> roll


(C) 2005, The University of Michigan 

103  

Porter��s algorithm (cont��d) 

Example: the word ��duplicatable�� 

duplicat          rule 4 
duplicate        rule 1b1 
duplic             rule 3 

The application of another rule in step 4, removing ��ic,�� cannot 
be applied since one rule from each step is allowed to be applied.  

% cd /clair4/class/ir-w03/tf-idf

% ./stem.pl computers 

computers       comput


(C) 2005, The University of Michigan 

104  

Porter��s algorithm


(C) 2005, The University of Michigan 

105  

Stemming 

  • Not always appropriate (e.g., proper names, titles)
  • The same applies to casing (e.g., CAT vs. cat)

 


(C) 2005, The University of Michigan 

106  

String matching


(C) 2005, The University of Michigan 

107  

String matching methods 

  • Index-based
  • Full or approximate
    • E.g., theater = theatre

(C) 2005, The University of Michigan 

108  

Index-based matching 

  • Inverted files
  • Position-based inverted files 
  • Block-based inverted files 
     
 

1    6  9 11    1719   24  28   33     40    46  50   55   60

This is a text. A text has many words. Words are made from letters. 

Text: 11, 19

Words: 33, 40

From: 55


(C) 2005, The University of Michigan 

109  

Inverted index (trie) 

Letters: 60 

Text: 11, 19 

Words: 33, 40 

Made: 50 

Many: 28 







n


(C) 2005, The University of Michigan 

110  

Sequential searching 

  • No indexing structure given
  • Given: database d and search pattern p.
    • Example: find ��words�� in the earlier example
  • Brute force method
    • try all possible starting positions
    • O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn)
    • Typical runtime is actually O(n) given that mismatches are easy to notice

(C) 2005, The University of Michigan 

111  

Knuth-Morris-Pratt 

  • Average runtime similar to BF
  • Worst case runtime is linear: O(n)
  • Idea: reuse knowledge
  • Need preprocessing of the pattern

(C) 2005, The University of Michigan 

112  

Knuth-Morris-Pratt (cont��d) 

  • Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm)
 
 

database: ABC ABC ABC ABDAB ABCDABCDABDE

pattern: ABCDABD  

index 0 1 2 3 4 5 6 7

char  A B C D A B D –

pos  -1 0 0 0 0 1 2 0  

1234567

ABCDABD

    ABCDABD


(C) 2005, The University of Michigan 

113  

Knuth-Morris-Pratt (cont��d) 

ABC ABC ABC ABDAB ABCDABCDABDE

ABCDABD

   ^

ABC ABC ABC ABDAB ABCDABCDABDE

   ABCDABD

   ^

ABC ABC ABC ABDAB ABCDABCDABDE

    ABCDABD

    ^

ABC ABC ABC ABDAB ABCDABCDABDE

                  ABCDABD

                        ^

ABC ABC ABC ABDAB ABCDABCDABDE

                      ABCDABD

                        ^


(C) 2005, The University of Michigan 

114  

Boyer-Moore 

  • Used in text editors
  • Demos
    • http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html
    • http://www.blarg.com/~doyle/pages/bmi.html

 


(C) 2005, The University of Michigan 

115  

Word similarity 

  • Hamming distance - when words are of the same length
  • Levenshtein distance - number of edits (insertions, deletions, replacements)
    • color --> colour (1)
    • survey --> surgery (2)
    • com puter --> computer   ?
  • Longest common subsequence (LCS)
    • lcs (survey, surgery) = surey

(C) 2005, The University of Michigan 

116  

Levenshtein edit distance 

  • Examples:
    • Theatre-> theater
    • Ghaddafi->Qadafi
    • Computer->counter
  • Edit distance (inserts, deletes, substitutions)
    • Edit transcript
  • Done through dynamic programming

(C) 2005, The University of Michigan 

117  

Recurrence relation 

  • Three dependencies
    • D(i,0)=i
    • D(0,j)=j
    • D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)]
  • Simple edit distance:
    • t(i,j) = 0 iff S1(i)=S2(j)

(C) 2005, The University of Michigan 

118  

Example 

Gusfield 1997 













































W


(C) 2005, The University of Michigan 

119  

Example (cont��d) 

Gusfield 1997 






































































W


(C) 2005, The University of Michigan 

120  

Tracebacks 

Gusfield 1997 






































































W


(C) 2005, The University of Michigan 

121  

Weighted edit distance 

  • Used to emphasize the relative cost of different edit operations
  • Useful in bioinformatics
    • Homology information
    • BLAST
    • Blosum
    • http://eta.embl-heidelberg.de:8000/misc/mat/blosum50.html

(C) 2005, The University of Michigan 

122  

  • Web sites:
    • http://www.merriampark.com/ld.htm
    • http://odur.let.rug.nl/~kleiweg/lev/

(C) 2005, The University of Michigan 

123  

Clustering


(C) 2005, The University of Michigan 

124  

Clustering 

  • Exclusive/overlapping clusters
  • Hierarchical/flat clusters
  • The cluster hypothesis 
    • Documents in the same cluster are relevant to the same query

(C) 2005, The University of Michigan 

125  

Representations for document clustering 

  • Typically: vector-based
    • Words: ��cat��, ��dog��, etc.
    • Features: document length, author name, etc.
  • Each document is represented as a vector in an n-dimensional space
  • Similar documents appear nearby in the vector space (distance measures are needed)

(C) 2005, The University of Michigan 

126  

Hierarchical clustering 
Dendrograms 

http://odur.let.rug.nl/~kleiweg/clustering/clustering.html 

E.g., language similarity:


(C) 2005, The University of Michigan 

127  

Another example 

  • Kingdom = animal
  • Phylum = Chordata
  • Subphylum = Vertebrata
  • Class = Osteichthyes
  • Subclass = Actinoptergyii
  • Order = Salmoniformes
  • Family = Salmonidae
  • Genus = Oncorhynchus
  • Species = Oncorhynchus kisutch (Coho salmon)

(C) 2005, The University of Michigan 

128  

Clustering using dendrograms 

REPEAT

    Compute pairwise similarities

    Identify closest pair

    Merge pair into single node

UNTIL only one node left

Q: what is the equivalent Venn diagram representation? 

Example: cluster the following sentences: 

      A B C B A

      A D C C A D E

      C D E F C D A

      E F G F D A

      A C D A B A


(C) 2005, The University of Michigan 

129  

Methods 

  • Single-linkage
    • One common pair is sufficient
    • disadvantages: long chains
  • Complete-linkage
    • All pairs have to match
    • Disadvantages: too conservative
  • Average-linkage
  • Centroid-based (online)
    • Look at distances to centroids
  • Demo:
    • /clair4/class/ir-w05/clustering

(C) 2005, The University of Michigan 

130  

k-means 

  • Needed: small number k of desired clusters
  • hard vs. soft decisions
  • Example: Weka

(C) 2005, The University of Michigan 

131  

k-means 

1 initialize cluster centroids to arbitrary vectors

2 while further improvement is possible do

3     for each document d do

4        find the cluster c whose centroid is closest to d

5        assign d to cluster c

6   end for

7   for each cluster c do

8       recompute the centroid of cluster c based on its documents

9   end for

10 end while


(C) 2005, The University of Michigan 

132  

Example 

  • Cluster the following vectors into two groups:
    • A = <1,6>
    • B = <2,2>
    • C = <4,0>
    • D = <3,3>
    • E = <2,5>
    • F = <2,1>

(C) 2005, The University of Michigan 

133  

Complexity 

  • Complexity = O(kn) because at each step, n documents have to be compared to k centroids.

(C) 2005, The University of Michigan 

134  

Weka 

  • A general environment for machine learning (e.g. for classification and clustering)
  • Book by Witten and Frank
  • www.cs.waikato.ac.nz/ml/weka
 

 


(C) 2005, The University of Michigan 

135  

Demos 

  • http://vivisimo.com/
  • http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html
  • http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means
  • http://www.cs.washington.edu/research/imagedatabase/demo/kmcluster
  • http://www.cc.gatech.edu/~dellaert/html/software.html
  • http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf
  • http://www.ece.neu.edu/groups/rpl/projects/kmeans/

(C) 2005, The University of Michigan 

136  

Human clustering 

  • Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).

(C) 2005, The University of Michigan 

137  

Lexical networks


(C) 2005, The University of Michigan 

138  

Lexical Networks 

  • Used to represent relationships between words
  • Example: WordNet - created by George Miller��s team at Princeton
  • Based on synsets (synonyms, interchangeable words) and lexical matrices

(C) 2005, The University of Michigan 

139  

Lexical matrix


(C) 2005, The University of Michigan 

140  

Synsets 

  • Disambiguation
    • {board, plank}
    • {board, committee}
  • Synonyms
    • substitution
    • weak substitution
    • synonyms must be of the same part of speech

(C) 2005, The University of Michigan 

141  

$ ./wn board -hypen 

Synonyms/Hypernyms (Ordered by Frequency) of noun board 

9 senses of board 

Sense 1

board

       => committee, commission

           => administrative unit

               => unit, social unit

                   => organization, organisation

                       => social group

                           => group, grouping 

Sense 2

board

       => sheet, flat solid

           => artifact, artefact

               => object, physical object

                   => entity, something 

Sense 3

board, plank

       => lumber, timber

           => building material

               => artifact, artefact

                   => object, physical object

                       => entity, something


(C) 2005, The University of Michigan 

142  

Sense 4

display panel, display board, board

       => display

           => electronic device

               => device

                   => instrumentality, instrumentation

                       => artifact, artefact

                           => object, physical object

                               => entity, something 

Sense 5

board, gameboard

       => surface

           => artifact, artefact

               => object, physical object

                   => entity, something 

Sense 6

board, table

       => fare

           => food, nutrient

               => substance, matter

                   => object, physical object

                       => entity, something


(C) 2005, The University of Michigan 

143  

Sense 7

control panel, instrument panel, control board, board, panel

       => electrical device

           => device

               => instrumentality, instrumentation

                   => artifact, artefact

                       => object, physical object

                           => entity, something 

Sense 8

circuit board, circuit card, board, card

       => printed circuit

           => computer circuit

               => circuit, electrical circuit, electric circuit

                   => electrical device

                       => device

                           => instrumentality, instrumentation

                               => artifact, artefact

                                   => object, physical object

                                       => entity, something 

Sense 9

dining table, board

       => table

           => furniture, piece of furniture, article of furniture

               => furnishings

                   => instrumentality, instrumentation

                       => artifact, artefact

                           => object, physical object

                               => entity, something


(C) 2005, The University of Michigan 

144  

Antonymy 

  • ��x�� vs. ��not-x��
  • ��rich�� vs. ��poor��?
  • {rise, ascend} vs. {fall, descend}

 


(C) 2005, The University of Michigan 

145  

Other relations 

  • Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to ��X is a part of Y��, ��X is a member of Y��.
  • Hyponymy: {tree} is a hyponym of {plant}.
  • Hierarchical structure based on hyponymy (and hypernymy).

(C) 2005, The University of Michigan 

146  

Other features of WordNet 

  • Index of familiarity
  • Polysemy

 


(C) 2005, The University of Michigan 

147  

board used as a noun is familiar (polysemy count = 9) 

bird used as a noun is common (polysemy count = 5) 

cat used as a noun is common (polysemy count = 7) 

house used as a noun is familiar (polysemy count = 11) 

information used as a noun is common (polysemy count = 5) 

retrieval used as a noun is uncommon (polysemy count = 3) 

serendipity used as a noun is very rare (polysemy count = 1) 

Familiarity and polysemy


(C) 2005, The University of Michigan 

148  

Compound nouns 

advisory board

appeals board

backboard

backgammon board

baseboard

basketball backboard

big board

billboard

binder's board

binder board 

blackboard

board game

board measure

board meeting

board member

board of appeals

board of directors

board of education

board of regents

board of trustees


(C) 2005, The University of Michigan 

149  

Overview of senses 

1. board -- (a committee having supervisory powers; "the board has seven members")

2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows")

3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes)

4. display panel, display board, board -- (a board on which information can be displayed to public view)

5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces")

6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")

7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree")

8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities)

9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")


(C) 2005, The University of Michigan 

150  

Top-level concepts 

{act, action, activity}

{animal, fauna}

{artifact}

{attribute, property}

{body, corpus}

{cognition, knowledge}

{communication}

{event, happening}

{feeling, emotion}

{food}

{group, collection}

{location, place}

{motive} 

{natural object}

{natural phenomenon}

{person, human being}

{plant, flora}

{possession}

{process}

{quantity, amount}

{relation}

{shape}

{state, condition}

{substance}

{time}


(C) 2005, The University of Michigan 

151  

WordNet and DistSim 

wn reason -hypen  - hypernyms

wn reason -synsn   - synsets

wn reason -simsn   - synonyms

wn reason -over     - overview of senses

wn reason -famln   - familiarity/polysemy

wn reason -grepn   - compound nouns 

/data2/tools/relatedwords/relate reason


(C) 2005, The University of Michigan 

152  

System comparison


(C) 2005, The University of Michigan 

153  

Comparing two systems 

  • Comparing A and B
  • One query?
  • Average performance?
  • Need: A to consistently outperform B
 
 
 
 

[this slide: courtesy James Allan]


(C) 2005, The University of Michigan 

154  

The sign test 

  • Example 1:
    • A > B (12 times)
    • A = B (25 times)
    • A < B (3 times)
    • p < 0.035 (significant at the 5% level)
  • Example 2:
    • A > B (18 times)
    • A < B (9 times)
    • p < 0.122 (not significant at the 5% level)
 

[this slide: courtesy James Allan]


(C) 2005, The University of Michigan 

155  

Other tests 

  • The t test:
    • Takes into account the actual performances, not just which system is better
    • http://nimitz.mcs.kent.edu/~blewis/stat/tTest.html
  • The sign test:
    • http://www.fon.hum.uva.nl/Service/Statistics/Sign_Test.html

(C) 2005, The University of Michigan 

156  

Techniques for dimensionality 
reduction: SVD and LSI


(C) 2005, The University of Michigan 

157  

Techniques for dimensionality reduction 

  • Based on matrix decomposition (goal: preserve clusters, explain away variance)
  • A quick review of matrices
    • Vectors
    • Matrices
    • Matrix multiplication

(C) 2005, The University of Michigan 

158  

SVD: Singular Value Decomposition 

  • A=USVT
  • This decomposition exists for all matrices, dense or sparse
  • If A has 5 columns and 3 rows,  then U will be 5x5 and V will be 3x3
  • In Matlab, use [U,S,V] = svd (A)

(C) 2005, The University of Michigan 

159  

Term matrix normalization 

D1 D2  D3 D4 D5 

D1       D2        D3      D4       D5


(C) 2005, The University of Michigan 

160  

Example (Berry and Browne) 

  • T1: baby
  • T2: child
  • T3: guide
  • T4: health
  • T5: home
  • T6: infant
  • T7: proofing
  • T8: safety
  • T9: toddler
 
  • D1: infant & toddler first aid
  • D2: babies & children��s room (for your home)
  • D3: child safety at home
  • D4: your baby��s health and safety: from infant to toddler
  • D5: baby proofing basics
  • D6: your guide to easy rust proofing
  • D7: beanie babies collector��s guide

(C) 2005, The University of Michigan 

161  

Document term matrix


(C) 2005, The University of Michigan 

162  

Decomposition 

u = 

       -0.6976   -0.0945    0.0174   -0.6950    0.0000    0.0153    0.1442   -0.0000         0

   -0.2622    0.2946    0.4693    0.1968   -0.0000   -0.2467   -0.1571   -0.6356    0.3098

   -0.3519   -0.4495   -0.1026    0.4014    0.7071   -0.0065   -0.0493   -0.0000    0.0000

   -0.1127    0.1416   -0.1478   -0.0734    0.0000    0.4842   -0.8400    0.0000   -0.0000

   -0.2622    0.2946    0.4693    0.1968    0.0000   -0.2467   -0.1571    0.6356   -0.3098

   -0.1883    0.3756   -0.5035    0.1273   -0.0000   -0.2293    0.0339   -0.3098   -0.6356

   -0.3519   -0.4495   -0.1026    0.4014   -0.7071   -0.0065   -0.0493    0.0000   -0.0000

   -0.2112    0.3334    0.0962    0.2819   -0.0000    0.7338    0.4659   -0.0000    0.0000

   -0.1883    0.3756   -0.5035    0.1273   -0.0000   -0.2293    0.0339    0.3098    0.6356 

v = 

   -0.1687    0.4192   -0.5986    0.2261         0   -0.5720    0.2433

   -0.4472    0.2255    0.4641   -0.2187    0.0000   -0.4871   -0.4987

   -0.2692    0.4206    0.5024    0.4900   -0.0000    0.2450    0.4451

   -0.3970    0.4003   -0.3923   -0.1305         0    0.6124   -0.3690

   -0.4702   -0.3037   -0.0507   -0.2607   -0.7071    0.0110    0.3407

   -0.3153   -0.5018   -0.1220    0.7128   -0.0000   -0.0162   -0.3544

   -0.4702   -0.3037   -0.0507   -0.2607    0.7071    0.0110    0.3407


(C) 2005, The University of Michigan 

163  

Decomposition 

s =

    1.5849            0             0            0             0             0             0

             0    1.2721            0            0             0             0             0

             0            0    1.1946            0             0             0             0

             0            0             0    0.7996            0             0             0

             0            0             0            0    0.7100             0             0

             0            0             0            0             0    0.5692             0

             0            0             0            0             0             0    0.1977

             0            0             0            0             0             0             0

             0            0             0            0             0             0             0 

Spread on the v1 axis


(C) 2005, The University of Michigan 

164  

Rank-4 approximation 

s4 = 

    1.5849         0         0         0         0         0         0

         0    1.2721         0         0         0         0         0

         0         0    1.1946         0         0         0         0

         0         0         0    0.7996         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

 


(C) 2005, The University of Michigan 

165  

Rank-4 approximation 

u*s4*v'

   -0.0019    0.5985   -0.0148    0.4552    0.7002    0.0102    0.7002

   -0.0728    0.4961    0.6282    0.0745    0.0121   -0.0133    0.0121

    0.0003   -0.0067    0.0052   -0.0013    0.3584    0.7065    0.3584

    0.1980    0.0514    0.0064    0.2199    0.0535   -0.0544    0.0535

   -0.0728    0.4961    0.6282    0.0745    0.0121   -0.0133    0.0121

    0.6337   -0.0602    0.0290    0.5324   -0.0008    0.0003   -0.0008

    0.0003   -0.0067    0.0052   -0.0013    0.3584    0.7065    0.3584

    0.2165    0.2494    0.4367    0.2282   -0.0360    0.0394   -0.0360

    0.6337   -0.0602    0.0290    0.5324   -0.0008    0.0003   -0.0008


(C) 2005, The University of Michigan 

166  

Rank-4 approximation 

u*s4 

   -1.1056   -0.1203    0.0207   -0.5558         0         0         0

   -0.4155    0.3748    0.5606    0.1573         0         0         0

   -0.5576   -0.5719   -0.1226    0.3210         0         0         0

   -0.1786    0.1801   -0.1765   -0.0587         0         0         0

   -0.4155    0.3748    0.5606    0.1573         0         0         0

   -0.2984    0.4778   -0.6015    0.1018         0         0         0

   -0.5576   -0.5719   -0.1226    0.3210         0         0         0

   -0.3348    0.4241    0.1149    0.2255         0         0         0

   -0.2984    0.4778   -0.6015    0.1018         0         0         0


(C) 2005, The University of Michigan 

167  

Rank-4 approximation 

s4*v' 

   -0.2674   -0.7087   -0.4266   -0.6292   -0.7451   -0.4996   -0.7451

    0.5333    0.2869    0.5351    0.5092   -0.3863   -0.6384   -0.3863

   -0.7150    0.5544    0.6001   -0.4686   -0.0605   -0.1457   -0.0605

    0.1808   -0.1749    0.3918   -0.1043   -0.2085    0.5700   -0.2085

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0


(C) 2005, The University of Michigan 

168  

Rank-2 approximation 

s2 = 

    1.5849         0         0         0         0         0         0

         0    1.2721         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0


(C) 2005, The University of Michigan 

169  

Rank-2 approximation 

u*s2*v' 

    0.1361    0.4673    0.2470    0.3908    0.5563    0.4089    0.5563

    0.2272    0.2703    0.2695    0.3150    0.0815   -0.0571    0.0815

   -0.1457    0.1204   -0.0904   -0.0075    0.4358    0.4628    0.4358

    0.1057    0.1205    0.1239    0.1430    0.0293   -0.0341    0.0293

    0.2272    0.2703    0.2695    0.3150    0.0815   -0.0571    0.0815

    0.2507    0.2412    0.2813    0.3097   -0.0048   -0.1457   -0.0048

   -0.1457    0.1204   -0.0904   -0.0075    0.4358    0.4628    0.4358

    0.2343    0.2454    0.2685    0.3027    0.0286   -0.1073    0.0286

    0.2507    0.2412    0.2813    0.3097   -0.0048   -0.1457   -0.0048


(C) 2005, The University of Michigan 

170  

Rank-2 approximation 

u*s2 

   -1.1056   -0.1203         0         0         0         0         0

   -0.4155    0.3748         0         0         0         0         0

   -0.5576   -0.5719         0         0         0         0         0

   -0.1786    0.1801         0         0         0         0         0

   -0.4155    0.3748         0         0         0         0         0

   -0.2984    0.4778         0         0         0         0         0

   -0.5576   -0.5719         0         0         0         0         0

   -0.3348    0.4241         0         0         0         0         0

   -0.2984    0.4778         0         0         0         0         0


(C) 2005, The University of Michigan 

171  

Rank-2 approximation 

s2*v' 

   -0.2674   -0.7087   -0.4266   -0.6292   -0.7451   -0.4996   -0.7451

    0.5333    0.2869    0.5351    0.5092   -0.3863   -0.6384   -0.3863

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0

         0         0         0         0         0         0         0


(C) 2005, The University of Michigan 

172  

Documents to concepts and terms to concepts 

A(:,1)'*u*s 

-0.4238    0.6784   -0.8541    0.1446   -0.0000   -0.1853    0.0095 

>> A(:,1)'*u*s4 

-0.4238    0.6784   -0.8541    0.1446         0         0         0 
 

>> A(:,1)'*u*s2 

-0.4238    0.6784         0         0         0         0         0 

>> A(:,2)'*u*s2 

-1.1233    0.3650         0         0         0         0         0 

>> A(:,3)'*u*s2 

-0.6762    0.6807         0         0         0         0         0

 


(C) 2005, The University of Michigan 

173  

Documents to concepts and terms to concepts 

>> A(:,4)'*u*s2 

-0.9972    0.6478         0         0         0         0         0 

>> A(:,5)'*u*s2 

-1.1809   -0.4914         0         0         0         0         0 

>> A(:,6)'*u*s2 

-0.7918   -0.8121         0         0         0         0         0 

>> A(:,7)'*u*s2 

-1.1809   -0.4914         0         0         0         0         0

 


(C) 2005, The University of Michigan 

174  

Cont��d 

>> (s2*v'*A(1,:)')' 

-1.7523   -0.1530         0         0         0         0         0         0         0 

>> (s2*v'*A(2,:)')' 

-0.6585    0.4768         0         0         0         0         0         0         0 

>> (s2*v'*A(3,:)')' 

-0.8838   -0.7275         0         0         0         0         0         0         0 

>> (s2*v'*A(4,:)')' 

-0.2831    0.2291         0         0         0         0         0         0         0 

>> (s2*v'*A(5,:)')' 

-0.6585    0.4768         0         0         0         0         0         0         0

 


(C) 2005, The University of Michigan 

175  

Cont��d 

>> (s2*v'*A(6,:)')' 

   -0.4730    0.6078         0         0         0         0         0         0         0 

>> (s2*v'*A(7,:)')' 

   -0.8838   -0.7275         0         0         0         0         0         0         0 

>> (s2*v'*A(8,:)')' 

   -0.5306    0.5395         0         0         0         0         0         0         0 

>> (s2*v'*A(9,:)')�� 

   -0.4730    0.6078         0         0         0         0         0         0         0


(C) 2005, The University of Michigan 

176  

Properties 

A*A' 

     1.5471    0.3364    0.5041    0.2025    0.3364    0.2025    0.5041    0.2025    0.2025

  0.3364    0.6728         0         0    0.6728         0         0    0.3364         0

  0.5041         0    1.0082         0         0         0    0.5041         0         0

  0.2025         0         0    0.2025         0    0.2025         0    0.2025    0.2025

  0.3364    0.6728         0         0    0.6728         0         0    0.3364         0

  0.2025         0         0    0.2025         0    0.7066         0    0.2025    0.7066

  0.5041         0    0.5041         0         0         0    1.0082         0         0

  0.2025    0.3364         0    0.2025    0.3364    0.2025         0    0.5389    0.2025

  0.2025         0         0    0.2025         0    0.7066         0    0.2025    0.7066 

A'*A 

          1.0082         0         0    0.6390         0         0         0

         0    1.0092    0.6728    0.2610    0.4118         0    0.4118

         0    0.6728    1.0092    0.2610         0         0         0

    0.6390    0.2610    0.2610    1.0125    0.3195         0    0.3195

         0    0.4118         0    0.3195    1.0082    0.5041    0.5041

         0         0         0         0    0.5041    1.0082    0.5041

         0    0.4118         0    0.3195    0.5041    0.5041    1.0082 
 

A is a document to term matrix. What is A*A��, what is A��*A?


(C) 2005, The University of Michigan 

177  

Latent semantic indexing (LSI) 

  • Dimensionality reduction = identification of hidden (latent) concepts
  • Query matching in latent space

(C) 2005, The University of Michigan 

178  

Useful pointers 

  • http://lsa.colorado.edu
  • http://lsi.research.telcordia.com/
  • http://www.cs.utk.edu/~lsi/
  • http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm
  • http://citeseer.nj.nec.com/deerwester90indexing.html
  • http://www.pcug.org.au/~jdowling/

(C) 2005, The University of Michigan 

179  

Models of the Web


(C) 2005, The University of Michigan 

180  

Size 

  • The Web is the largest repository of data and it grows exponentially.
    • 320 Million Web pages [Lawrence & Giles 1998]
    • 800 Million Web pages, 15 TB [Lawrence & Giles 1999]
    • 8 Billion Web pages indexed [Google 2005]
  • Amount of data
    • roughly 200 TB [Lyman et al. 2003]

(C) 2005, The University of Michigan 

181  

Bow-tie model of the Web 

SCC 
56 M 

OUT 
44 M 

IN 
44 M 

Bröder & al. WWW 2000, Dill & al. VLDB 2001 

DISC 
17 M 

TEND 
44M 

24% of pages 
reachable from 
a given page


(C) 2005, The University of Michigan 

182  

Power laws 

  • Web site size (Huberman and Adamic 1999)
  • Power-law connectivity (Barabasi and Albert 1999): exponents 2.45 for out-degree and 2.1 for the in-degree
  • Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.

(C) 2005, The University of Michigan 

183  

Small-world networks 

  • Diameter = average length of the shortest path between all pairs of nodes. Example��
  • Milgram experiment (1967)
    • Kansas/Omaha --> Boston (42/160 letters)
    • diameter = 6
  • Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109, d=18.89.
  • Six degrees of separation

(C) 2005, The University of Michigan 

184  

Clustering coefficient 

  • Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors.
  • Examples:
 

0.05 

0.28 

2.25 

2.65 

14 

282 

C. Elegans 

0.005 

0.08 

12.4 

18.7 

2.67 

4941 

Power grid 

0.00027 

0.79 

2.99 

3.65 

61 

225226 

Actors 

crand 


drand 

d 

k 

n


(C) 2005, The University of Michigan 

185  

Models of the Web 

A 

B 

a 

b 

  • Erdös/R��nyi 59, 60
 
  • Barab��si/Albert 99
 
  • Watts/Strogatz 98
 
  • Kleinberg 98
 
  • Menczer 02
 
  • Radev 03
 
  • Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology

(C) 2005, The University of Michigan 

186  

Self-triggerability across hyperlinks 

  • Document closures for information retrieval
  • Self-triggerability [Mosteller&Wallace 84]  Poisson distribution
  • Two-Poisson [Bookstein&Swanson 74]
  • Negative Binomial, K-mixture [Church&Gale 95]
  • Triggerability across hyperlinks?
 
 
 

pj 

pi 

p 

p�� 

by with from 

p 

p�� 

photo dream path


(C) 2005, The University of Michigan 

187  

Evolving Word-based Web 

  • Observations:
    • Links are made based on topics
    • Topics are expressed with words
    • Words are distributed very unevenly (Zipf, Benford, self-triggerability laws)
  • Model
    • Pick n
    • Generate n lengths according to a power-law distribution
    • Generate n documents using a trigram model
 
  • Model (cont��d)
    • Pick words in decreasing order of r.
    • Generate hyperlinks with random directionality
  • Outcome
    • Generates power-law degree distributions
    • Generates topical communities
    • Natural variation of PageRank: LexRank

(C) 2005, The University of Michigan 

188  

Social network analysis for IR


(C) 2005, The University of Michigan 

189  

Social networks 

  • Induced by a relation
  • Symmetric or not
  • Examples:
    • Friendship networks
    • Board membership
    • Citations
    • Power grid of the US
    • WWW

(C) 2005, The University of Michigan 

190  

Krebs 2004


(C) 2005, The University of Michigan 

191  

Prestige and centrality 

  • Degree centrality: how many neighbors each node has.
  • Closeness centrality: how close a node is to all of the other nodes
  • Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes
  • Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects.
  • Prestige = same as centrality but for directed graphs.

(C) 2005, The University of Michigan 

192  

Graph-based representations 





































Square connectivity 
(incidence) matrix 

Graph G (V,E)


(C) 2005, The University of Michigan 

193  

Markov chains 

  • A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E.
  • Path = sequence (x0, x1, ��, xn). 
    Xi = xi-1*E
  • The probability of a path can be computed as a product of probabilities for each step i.
  • Random walk = find Xj given x0, E, and j.

 


(C) 2005, The University of Michigan 

194  

Stationary solutions 

  • The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions:
    • E is stochastic
    • E is irreducible
    • E is aperiodic
  • To make these conditions true:
    • All rows of E add up to 1 (and no value is negative)
    • Make sure that E is strongly connected
    • Make sure that E is not bipartite
  • Example: PageRank [Brin and Page 1998]: use ��teleportation��

(C) 2005, The University of Michigan 

195  









          Example 

This graph E has a second graph E�� 
(not drawn)
superimposed on it: 
E�� is the uniform transition graph.  

t=0 

t=1


(C) 2005, The University of Michigan 

196  

Eigenvectors 

  • An eigenvector is an implicit ��direction�� for a matrix.

    Mv = ��v, where v is non-zero, though �� can be any complex number in principle.

  • The largest eigenvalue of a stochastic matrix E is real: ��1 = 1.
  • For ��1, the left (principal) eigenvector is p, the right eigenvector = 1
  • In other words, ETp = p.

 


(C) 2005, The University of Michigan 

197  

Computing the stationary distribution 

function PowerStatDist (E):

begin

   p(0) = u;   (or p(0) = [1,0,��0])

   i=1;

   repeat

      p(i) = ETp(i-1)

      L = ||p(i)-p(i-1)||1;

      i = i + 1;

   until L <

   return p(i)

end 

Solution for the 
stationary distribution


(C) 2005, The University of Michigan 

198  









    Example 

t=0 

t=1 

t=10


(C) 2005, The University of Michigan 

199  

How Google works 

  • Crawling
  • Anchor text
  • Fast query processing
  • Pagerank

(C) 2005, The University of Michigan 

200  

More about PageRank 

  • Named after Larry Page, founder of Google (and UM alum)
  • Reading ��The anatomy of a large-scale hypertextual web search engine�� by Brin and Page.
  • Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.

 


(C) 2005, The University of Michigan 

201  

HITS 

  • Query-dependent model (Kleinberg 97)
  • Hubs and authorities (e.g., cars, Honda)
  • Algorithm 
     
    • obtain root set using input query
    • expanded the root set by radius one
    • run iterations on the hub and authority scores together
    • report top-ranking authorities and hubs

(C) 2005, The University of Michigan 

202  

The link-content hypothesis 

  • Topical locality: page is similar () to the page that points to it ().
  • Davison (TF*IDF, 100K pages)
    • 0.31 same domain
    • 0.23 linked pages
    • 0.19 sibling
    • 0.02 random
  • Menczer (373K pages, non-linear least squares fit)
  • Chakrabarti (focused crawling) - prob. of losing the topic 
     
 

Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001 

1=1.8, 2=0.6,


(C) 2005, The University of Michigan 

203  

Measuring the Web


(C) 2005, The University of Michigan 

204  

Bharat and Broder 1998 

  • Based on crawls of HotBot, Altavista, Excite, and InfoSeek
  • 10,000 queries in mid and late 1997
  • Estimate is 200M pages
  • Only 1.4% are indexed by all of them

 


(C) 2005, The University of Michigan 

205  

Example (from Bharat&Broder) 

A similar approach by Lawrence and Giles yields 320M pages

(Lawrence and Giles 1998).


(C) 2005, The University of Michigan 

206  

Crawling the web


(C) 2005, The University of Michigan 

207  

Basic principles 

  • The HTTP/HTML protocols
  • Following hyperlinks
  • Some problems:
    • Link extraction
    • Link normalization
    • Robot exclusion
    • Loops
    • Spider traps
    • Server overload

(C) 2005, The University of Michigan 

208  

Example 

  • U-M��s root robots.txt file:
  • http://www.umich.edu/robots.txt
    • User-agent: *
    • Disallow: /~websvcs/projects/
    • Disallow: /%7Ewebsvcs/projects/
    • Disallow: /~homepage/
    • Disallow: /%7Ehomepage/
    • Disallow: /~smartgl/
    • Disallow: /%7Esmartgl/
    • Disallow: /~gateway/
    • Disallow: /%7Egateway/

(C) 2005, The University of Michigan 

209  

Example crawler 

  • E.g., poacher
    • http://search.cpan.org/~neilb/Robot-0.011/examples/poacher
    • /data0/projects/perltree-index

 


(C) 2005, The University of Michigan 

210  
 

&ParseCommandLine();

&Initialise();

$robot->run($siteRoot) 

#=======================================================================

# Initialise() - initialise global variables, contents, tables, etc

# This function sets up various global variables such as the version number

# for WebAssay, the program name identifier, usage statement, etc.

#=======================================================================

sub Initialise

{

    $robot = new WWW::Robot(

                            'NAME'      => $BOTNAME,

                            'VERSION'   => $VERSION,

                            'EMAIL'     => $EMAIL,

                            'TRAVERSAL' => $TRAVERSAL,

                            'VERBOSE'   => $VERBOSE,

                           );

    $robot->addHook('follow-url-test',     \&follow_url_test);

    $robot->addHook('invoke-on-contents',  \&process_contents);

    $robot->addHook('invoke-on-get-error', \&process_get_error);

}

#=======================================================================

# follow_url_test() - tell the robot module whether is should follow link

#=======================================================================

sub follow_url_test {}

#=======================================================================

# process_get_error() - hook function invoked whenever a GET fails

#=======================================================================

sub process_get_error {}

#=======================================================================

# process_contents() - process the contents of a URL we've retrieved

#=======================================================================

sub process_contents

{

    run_command($COMMAND, $filename) if defined $COMMAND;

}


(C) 2005, The University of Michigan 

211  

Focused crawling 

  • Topical locality
    • Pages that are linked are similar in content (and vice-versa: Davison 00, Menczer 02, 04, Radev et al. 04)
  • The radius-1 hypothesis
    • given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page)
  • Focused crawling
    • Keeping a priority queue of the most relevant pages

(C) 2005, The University of Michigan 

212  

Question answering


(C) 2005, The University of Michigan 

213  

People ask questions 

  • Excite corpus of 2,477,283 queries (one day��s worth)
  • 8.4% of them are questions
    • 43.9% factual (what is the country code for Belgium)
    • 56.1% procedural (how do I set up TCP/IP) or other
  • In other words, 100 K questions per day

 


(C) 2005, The University of Michigan 

214  

People ask questions 

In what year did baseball become an offical sport? 

Who is the largest man in the world? 

Where can i get information on Raphael? 

where can i find information on puritan religion? 

Where can I find how much my house is worth? 

how do i get out of debt? 
Where can I found out how to pass a drug test? 
When is the Super Bowl? 
who is California's District State Senator? 
where can I buy extra nibs for a foutain pen? 
how do i set up tcp/ip ? 
what time is it in west samoa? 
Where can I buy a little kitty cat? 
what are the symptoms of attention deficit disorder? 
Where can I get some information on Michael Jordan? 
How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? 
When did the Neanderthal man live? 
Which Frenchman declined the Nobel Prize for Literature for ideological reasons? 
What is the largest city in Northern Afghanistan? 

How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? 

When did the Neanderthal man live?


(C) 2005, The University of Michigan 

215


(C) 2005, The University of Michigan 

216  

Question answering 

What is the largest city in Northern Afghanistan?


(C) 2005, The University of Michigan 

217  

Possible approaches 

  • Map?
  • Knowledge base 
    Find x: city (x)  located (x,��Northern Afghanistan��)  
     ¬exists (y): city (y)  located (y,��Northern Afghanistan��)  
     greaterthan (population (y), population (x))
  • Database?
  • World factbook?
  • Search engine?

(C) 2005, The University of Michigan 

218  

The TREC Q&A evaluation 

  • Run by NIST [Voorhees and Tice 2000]
  • 2GB of input
  • 200 questions
  • Essentially fact extraction
    • Who was Lincoln��s secretary of state?
    • What does the Peugeot company manufacture?
  • Questions are based on text
  • Answers are assumed to be present
  • No inference needed

(C) 2005, The University of Michigan 

219  

Q: When did Nelson Mandela become president of South Africa?

A: 10 May 1994

Q: How tall is the Matterhorn?

A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches

Q: How tall is the replica of the Matterhorn at Disneyland?

A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years

Q: If Iraq attacks a neighboring country, what should the US do?

A: ?? 

Question answering


(C) 2005, The University of Michigan 

220


(C) 2005, The University of Michigan 

221  

NSIR 

  • Current project at U-M
    • http://tangra.si.umich.edu/clair/NSIR/html/nsir.cgi
  • Reading:
    • [Radev et al., 2005a]
      • Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal. Probabilistic question answering on the web. Journal of the American Society for Information Science and Technology 56(3), March 2005
      • http://tangra.si.umich.edu/~radev/bib2html/radev-bib.html

 


(C) 2005, The University of Michigan 

222


(C) 2005, The University of Michigan 

223  

... Afghanistan, Kabul, 2,450 ... Administrative capital and largest city (1997 est ... Undetermined. 
Panama, Panama City, 450,668. ... of the Gauteng, Northern Province, Mpumalanga ...  
www.infoplease.com/cgi-bin/id/A0855603

... died in Kano, northern Nigeria's largest city, during two days of anti-American riots 
led by Muslims protesting the US-led bombing of Afghanistan, according to ...  
www.washingtonpost.com/wp-dyn/print/world/

... air strikes on the city. ... the Taliban militia in northern Afghanistan in a significant 
blow ... defection would be the largest since the United States ...  
www.afgha.com/index.php - 60k

... Kabul is the capital and largest city of Afghanistan. . ... met. area pop. 2,029,889), 
is the largest city in Uttar Pradesh, a state in northern India. . ...  
school.discovery.com/homeworkhelp/worldbook/atozgeography/ k/k1menu.html

... Gudermes, Chechnya's second largest town. The attack ... location in Afghanistan's outlying 
regions ... in the city of Mazar-i-Sharif, a Northern Alliance-affiliated ...  
english.pravda.ru/hotspots/2001/09/17/

... Get Worse By RICK BRAGG Pakistan's largest city is getting a jump on the ... Region: Education 
Offers Women in Northern Afghanistan a Ray of Hope. ...  
www.nytimes.com/pages/world/asia/

... within three miles of the airport at Mazar-e-Sharif, the largest city in northern 
Afghanistan, held since 1998 by the Taliban. There was no immediate comment ...  
uk.fc.yahoo.com/photos/a/afghanistan.html 

Google


(C) 2005, The University of Michigan 

224  

Document retrieval 

Query modulation 

Sentence retrieval 

Answer extraction 

Answer ranking 

What is the largest city in Northern Afghanistan? 

(largest OR biggest) city ��Northern Afghanistan�� 

www.infoplease.com/cgi-bin/id/A0855603 
www.washingtonpost.com/wp-dyn/print/world/ 

Gudermes, Chechnya's second largest town �� location in Afghanistan's outlying regions

within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan 

Gudermes

Mazer-e-Sharif 

Mazer-e-Sharif

Gudermes


(C) 2005, The University of Michigan 

225


(C) 2005, The University of Michigan 

226


(C) 2005, The University of Michigan 

227  

Research problems 

  • Source identification:
    • semi-structured vs. text sources
  • Query modulation:
    • best paraphrase of a NL question given the syntax of a search engine?
    • Compare two approaches: noisy channel model and rule-based
  • Sentence ranking
    • n-gram matching, Okapi, co-reference?
  • Answer extraction
    • question type identification
    • phrase chunking
    • no general-purpose named entity tagger available
  • Answer ranking
    • what are the best predictors of a phrase being the answer to a given question: question type, proximity to query words, frequency
  • Evaluation (MRDR)
    • accuracy, reliability, timeliness

(C) 2005, The University of Michigan 

228  

Document retrieval 

  • Use existing search engines: Google, AlltheWeb, NorthernLight
  • No modifications to question
  • CF: work on QASM (ACM CIKM 2001)

 


(C) 2005, The University of Michigan 

229  

Sentence ranking 

  • Weighted N-gram matching:
  • Weights are determined empirically, e.g., 0.6, 0.3, and 0.1 
     
     

(C) 2005, The University of Michigan 

230  

Probabilistic phrase reranking 

  • Answer extraction: probabilistic phrase reranking. What is: 
      p(ph is answer to q | q, ph)
  • Evaluation: TRDR
    • Example: (2,8,10) gives .725
    • Document, sentence, or phrase level
  • Criterion: presence of answer(s)
  • High correlation with manual assessment

(C) 2005, The University of Michigan 

231  

Phrase types 

PERSON PLACE DATE NUMBER DEFINITION 
ORGANIZATION DESCRIPTION ABBREVIATION 
KNOWNFOR RATE LENGTH MONEY REASON 
DURATION PURPOSE NOMINAL OTHER


(C) 2005, The University of Michigan 

232  

Question Type Identification 

  • Wh-type not sufficient:
      • Who: PERSON 77, DESCRIPTION 19, ORG 6
      • What: NOMINAL 78, PLACE 27, DEF26, PERSON 18, ORG 16, NUMBER 14, etc.
      • How: NUMBER 33, LENGTH 6, RATE 2, etc.
  • Ripper:
    • 13 features: Question-Words, Wh-Word, Word-Beside-Wh-Word, Is-Noun-Length, Is-Noun-Person, etc.
    • Top 2 question types
  • Heuristic algorithm:
    • About 100 regular expressions based on words and parts of speech

(C) 2005, The University of Michigan 

233  

Ripper performance 


20.69% 


TREC8,9,10 

30% 

17.03% 

TREC10 

TREC8,9 

24% 

22.4% 

TREC8 

TREC9 

Test Error Rate 

Train Error Rate 

Test 

Training


(C) 2005, The University of Michigan 

234  

Regex performance 

7.6% 

5.5% 

4.6% 

TREC8,9,10 

18.2% 

6% 

7.4% 

TREC8,9 

18% 

15% 

7.8% 

TREC9 

Test on TREC10 

Test on TREC8 

Test on TREC9 

Training


(C) 2005, The University of Michigan 

235  

Phrase ranking 

  • Phrases are identified by a shallow parser (ltchunk from Edinburgh)
  • Four features:
    • Proximity
    • POS (part-of-speech) signature (qtype)
    • Query overlap
    • Frequency

(C) 2005, The University of Michigan 

236  

Proximity 

  • Phrasal answers tend to appear near words from the query
  • Average distance = 7 words, range = 1 to 50 words
  • Use linear 
    rescaling 
    of scores

(C) 2005, The University of Michigan 

237  

Part of speech signature 

NO (100%) 
NO (86.7%) PERSON (3.8%) NUMBER (3.8%) ORG (2.5%) 
PERSON (37.4%) PLACE (29.6%) DATE (21.7%) NO (7.6%) 
NO (75.6%) NUMBER (11.1%) PLACE (4.4%) ORG (4.4%) 
PLACE (37.3%) PERSON (35.6%) NO (16.9%) ORG (10.2%) 
ORG (55.6%) NO (33.3%) PLACE (5.6%) DATE (5.6%) 

VBD 
DT NN 
NNP 
DT JJ NNP 
NNP NNP 
DT NNP 

Phrase Types 

Signature 

Example: ��Hugo/NNP Young/NNP�� 
  P (PERSON | ��NNP NNP��) = .458

Example: ��the/DT Space/NNP Flight/NNP Operations/NNP contractor/NN��

            P (PERSON | ��DT NNP NNP NNP NN��) = 0 

Penn Treebank tagset (DT = determiner, JJ = adjective)


(C) 2005, The University of Michigan 

238  

Query overlap and frequency 

  • Query overlap:
    • What is the capital of Zimbabwe?
    • Possible choices: 
       Mugabe, Zimbabwe, Luanda, Harare
  • Frequency:
    • Not necessarily accurate but rather useful

(C) 2005, The University of Michigan 

239  

Reranking 

Rank Probability and phrase 

1 0.599862 the_DT Space_NNP Flight_NNP Operations_NNP contractor_NN ._. 
2 0.598564 International_NNP Space_NNP Station_NNP Alpha_NNP  
3 0.598398 International_NNP Space_NNP Station_NNP  
4 0.598125 to_TO become_VB  
5 0.594763 a_DT joint_JJ venture_NN United_NNP Space_NNP Alliance_NNP  
6 0.593933 NASA_NNP Johnson_NNP Space_NNP Center_NNP  
7 0.587140 will_MD form_VB  
8 0.585410 The_DT purpose_NN  
9 0.576797 prime_JJ contracts_NNS  
10 0.568013 First_NNP American_NNP  
11 0.567361 this_DT bulletin_NN board_NN  
12 0.565757 Space_NNP :_:  
13 0.562627 'Spirit_NN '_'' of_IN  
...   
41 0.516368 Alan_NNP Shepard_NNP  

Proximity = .5164


(C) 2005, The University of Michigan 

240  

Reranking  

Rank Probability and phrase 

1 0.465012 Space_NNP Administration_NNP ._. 
2 0.446466 SPACE_NNP CALENDAR_NNP _. 
3 0.413976 First_NNP American_NNP  
4 0.399043 International_NNP Space_NNP Station_NNP Alpha_NNP  
5 0.396250 her_PRP$ third_JJ space_NN mission_NN  
6 0.395956 NASA_NNP Johnson_NNP Space_NNP Center_NNP  
7 0.394122 the_DT American_NNP Commercial_NNP Launch_NNP Industry_NNP  
8 0.390163 the_DT Red_NNP Planet_NNP ._.  
9 0.379797 First_NNP American_NNP  
10 0.376336 Alan_NNP Shepard_NNP  
11 0.375669 February_NNP  
12 0.374813 Space_NNP  
13 0.373999 International_NNP Space_NNP Station_NNP  
 

Qtype = .7288 
Proximity * qtype = .3763


(C) 2005, The University of Michigan 

241  

Reranking 

Rank Probability and phrase 

1 0.478857 Neptune_NNP Beach_NNP ._.  
2 0.449232 February_NNP  
3 0.447075 Go_NNP  
4 0.437895 Space_NNP  
5 0.431835 Go_NNP  
6 0.424678 Alan_NNP Shepard_NNP  
7 0.423855 First_NNP American_NNP  
8 0.421133 Space_NNP May_NNP  
9 0.411065 First_NNP American_NNP woman_NN  
10 0.401994 Life_NNP Sciences_NNP  
11 0.385763 Space_NNP Shuttle_NNP Discovery_NNP STS-60_NN  
12 0.381865 the_DT Moon_NNP International_NNP Space_NNP Station_NNP  
13 0.370030 Space_NNP Research_NNP A_NNP Session_NNP  
 

All four features


(C) 2005, The University of Michigan 

242


(C) 2005, The University of Michigan 

243


(C) 2005, The University of Michigan 

244


(C) 2005, The University of Michigan 

245  

Document level performance 

164 

163 

149 

#>0 

1.3361 

1.0495 

0.8355 

Avg 

Google 

NLight 

AlltheWeb 

Engine 

TREC 8 corpus (200 questions)


(C) 2005, The University of Michigan 

246  

Sentence level performance 

135 

137 

159 

119 

121 

159 

99 

99 

148 

#>0 

0.49 

0.54 

2.55 

0.44 

0.48 

2.53 

0.26 

0.31 

2.13 

Avg 

GO 
O
 

GO 
L
 

GO 
U
 

NL 
O
 

NL 
L
 

NL 
U
 

AW 
O
 

AW 
L
 

AW 
U
 

Engine


(C) 2005, The University of Michigan 

247  

Phrase level performance 

0.199 

0.157 

0.117 

0.105 

Combined 

0.0646 

0.058 

0.054 

0.038 

Global proximity 

0.0646 

0.068 

0.048 

0.026 

Appearance order 

1.941 

2.698 

2.652 

2.176 

Upperbound 

Google S+P 

Google D+P 

NorthernLight 

AlltheWeb 

Experiments performed 
Oct-Nov. 2001


(C) 2005, The University of Michigan 

248


(C) 2005, The University of Michigan 

249


(C) 2005, The University of Michigan 

250  

Text classification


(C) 2005, The University of Michigan 

251  

Introduction 

  • Text classification: assigning documents to predefined categories
  • Hierarchical vs. flat
  • Many techniques: generative (maxent, knn, Naïve Bayes) vs. discriminative (SVM, regression)
  • Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x)
  • Discriminative: model p(y|x) directly.

(C) 2005, The University of Michigan 

252  

Generative models: knn 

  • K-nearest neighbors
  • Very easy to program 
     
  • Issues: choosing k, b?

(C) 2005, The University of Michigan 

253  

Feature selection: The 2 test 

  • For a term t:
  • Testing for independence: 
    P(C=0,It=0) should be equal to P(C=0) P(It=0) 
     
    • P(C=0) = (k00+k01)/n
    • P(C=1) = 1-P(C=0) = (k10+k11)/n
    • P(It=0) = (k00+K10)/n
    • P(It=1) = 1-P(It=0) = (k01+k11)/n
 

k11 

k10 


k01 

k00 


C 



It


(C) 2005, The University of Michigan 

254  

Feature selection: The 2 test 
 
 

  • High values of 2 indicate lower belief in independence.
  • In practice, compute 2 for all words and pick the top k among them.

(C) 2005, The University of Michigan 

255  

Feature selection: mutual information 

  • No document length scaling is needed
  • Documents are assumed to be generated according to the multinomial model

(C) 2005, The University of Michigan 

256  

Naïve Bayesian classifiers 

  • Naïve Bayesian classifier
  • Assuming statistical independence 
     

(C) 2005, The University of Michigan 

257  

Spam recognition 

Return-Path: <ig_esq@rediffmail.com>

X-Sieve: CMU Sieve 2.2

From: "Ibrahim Galadima" <ig_esq@rediffmail.com>

Reply-To: galadima_esq@netpiper.com

To: webmaster@aclweb.org

Date: Tue, 14 Jan 2003 21:06:26 -0800

Subject: Gooday 

DEAR SIR 

FUNDS FOR INVESTMENTS 

THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD

NO PREVIOUS CORRESPONDENCE WITH YOU 

I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT

NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR

CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE

PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL

TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT

TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES

DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT 

THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH

ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF

OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A


(C) 2005, The University of Michigan 

258  

Well-known datasets 

  • 20 newsgroups
    • http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/
  • Reuters-21578
    • Cats: grain, acquisitions, corn, crude, wheat, trade��
  • WebKB
    • http://www-2.cs.cmu.edu/~webkb/
    • course, student, faculty, staff, project, dept, other
    • NB performance (2000)
    • P=26,43,18,6,13,2,94
    • R=83,75,77,9,73,100,35

(C) 2005, The University of Michigan 

259  

Support vector machines 

  • Introduced by Vapnik in the early 90s.

 


(C) 2005, The University of Michigan 

260  

Semi-supervised learning 

  • EM
  • Co-training
  • Graph-based

(C) 2005, The University of Michigan 

261  

Additional topics 

  • Soft margins
  • VC dimension
  • Kernel methods

(C) 2005, The University of Michigan 

262  

  • SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters.
  • NB also good in many circumstances

 


(C) 2005, The University of Michigan 

263  

Readings 

  • Books:
  • 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison-Wesley/ACM Press, 1999.
  • 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic Methods and Algorithms; Wiley, 2003, ISBN: 0-470-84906-1
  • Papers: 
  • Barabasi and Albert "Emergence of scaling in random networks" Science   (286) 509-512, 1999
  • Bharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998
  • Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998
  • Bush "As we may thing" The Atlantic Monthly 1945
  • Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999
  • Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998
  • Davison "Topical locality on the Web" SIGIR 2000
  • Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999
  • Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS 41(6) 1990

(C) 2005, The University of Michigan 

264  

Readings 

  • Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization" JAIR 22, 2004
  • Jeong and Barabasi "Diameter of the world wide web" Nature (401) 130-131, 1999
  • Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000
  • Haveliwala "Topic-sensitive pagerank" WWW 2002
  • Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph" PODS 2000
  • Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109, 1999
  • Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998
  • Menczer "Links tell us about lexical and semantic Web content" arXiv 2001
  • Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998
  • Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" JASIST 2005
  • Singhal "Modern Information Retrieval: an Overview" IEEE 2001

(C) 2005, The University of Michigan 

265  

More readings 

  • Gerard Salton, Automatic Text Processing, Addison-Wesley (1989)
  • Gerald Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer (1997)
  • Gerard Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill (1983)
  • C. J. an Rijsbergen, Information Retrieval, Buttersworths (1979)
  • Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes, Van Nostrand Reinhold (1994)
  • ACM SIGIR Proceedings, SIGIR Forum
  • ACM conferences in Digital Libraries

(C) 2005, The University of Michigan 

266  

Thank you! 

���ݧѧԧ�էѧ��!


Search more related documents:Database Application Design
Download Document:Database Application Design

Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#nuokui.com
TOP