Databases as a tool for the content specialist
Applying databases to indexing in the information
specialists world
Johan van Wyk (M.Bibl.; BA Hons History, THED)
When you look into the literature on indexing in databases, you are on!ronted
"ith ter#s suh as BTree, T$%& 'arsing, lustered indexes, (ltered) indies et.
Those are all tehni*ues and tools used to ahie+e retrie+al o! in!or#ation in
database syste#s. But this is not at all our onern here today. ,ets -ust say that
"e ae't this as a gi+en (like "e do "ith #ost tehnology "e don.t understand/)
As in!or#ation 'ro!essionals in the in!or#ation industry, "hat is our onern "ith
database syste#s0
Database syste#s an be arranged on a ontinuu# bet"een DBM%.s and !ull text
syste#s. 1n the one extre#e are the DBM%.s2 +ery strutured, good sorting and
re'ort !aility. But a"!ul "ith textual in!or#ation. 3t "as ne+er built to do that.
Exa#'les2
DBM%.s 2 1rale, %4,
1n the other extre#e are the !ull text syste#s. The "ere s'ei(ally built !or
textual dou#ents. Exa#'les o! those syste#s are B5% %earh, Brain"are and
you ould inlude 3nternet searh engines suh as Alta 6ista.
The distinguishing !ators are2
DBM%.s2
%truture
5e'ort !aility to #ani'ulate out'ut by #ani'ulating the out'ut by sorting
the struture ele#ents
$ield lengths o!ten li#ited
%earhable (elds are li#ited
7o#'liated searh
$ull text syste#s2
The !ull dou#ent is the unit o! in!or#ation
E+ery "ord is searhable
8o !or#al struture
Then "e (nd a third ategory in the industry2 text retrie+al syste#s. Text retrie+al
syste#s "ere de+elo'ed to get the best o! both "orlds. They "ere de+elo'ed to
ha+e the strutured ele#ents o! DBM%.s and the ability to handle !ull text. Thus
"e (nd that text retrie+al syste#s ha+e the !ollo"ing harateristis or !eatures2
%truture !ailitated by #eans o! (elds
5e'ort !aility
All (elds o! +ariable length
3ndexing all (elds, !ull text
3nterestingly, though, Text retrie+al syste#s "ere de+elo'ed be!ore $ull text
syste#s. %o#e o! the reasons "ere o#'uting 'o"er, retrie+al tehni*ues and (to
us 9ob+iously:) the needs o! in!or#ation s'eialists. 1ne is al"ays a#a;ed at the
searhing !untionality o! syste#s like 3BM.s 9%TA35%: (<=>?) and online syste#s
used by B5%, Medline and Dialog in the sa#e 'eriod.
To be able to retrie+e in!or#ation in these syste#s 'rogra##ing tehni*ues suh
as BTree et. "ere de+elo'ed. These tehni*ues #ust be +ery suess!ul today @
+olu#e o! data is not a 'roble# any#ore. Today, as you all kno", "e searh the
internet and are a#a;ed at the a#ount o! in!or#ation retrie+ed. And then "e are
1
on!used and irritated by the o+erdose. And that is exatly "hat the ore !untion
o! this 'ro!ession is2 to #ake sense o! the #asses o! in!or#ation. The user is not
interested in being the "orld ha#'ion in the nu#ber o! ite#s retrie+ed. The
*uestion is "hether his need is being addressed2 the retrie+al o! relevant
information is "hat #atters.
The 3T setor has no" reahed a 'oint "here the indexing #asses o! data is
9sol+ed:. What #ore an you "ant "hen you ha+e !ull text indexing0 Why is the
user no" on!used or irritated0 The reason is that the in!or#ation need "as not
#et2 (nding rele+ant in!or#ation.
5ele+ane is a +ery diAult one't to tie do"n. When testing in!or#ation
retrie+al 'er!or#ane, this is the ruial +ariable to de(ne, beause it ould ske"
your researh totally. The #easuring o! retrie+al 'er!or#ane is done using t"o
'arallel #easures, both relying on the -udge#ent o! rele+ane2
5eall (The ability to retrie+e rele+ant in!or#ation)
Breision (the ability to "ithhold nonCrele+ant in!or#ation !ro# the
in!or#ation retrie+ed)
The tehnology used !or !ull text indexing greatly enhanes the reall
'er!or#ane or ability, to the detri#ent o! 'reision 'er!or#ane. Then
in!or#ation retrie+al researh started !oussing on tehni*ues !oussed on
#ani'ulating the out'ut. Here "e (nd tehni*ues suh as 9rele+ane ranking:
"idely used suh as in internet searh engines. These tehni*ues are *uite
suess!ul sine the o#'uting 'o"er bea#e a+ailable.
3n these text retrie+al syste#s "e "ork "ith "ords, "ord ste#s or 'hrases.
Al#ost all o! the rele+ane ranking tehni*ues are based u'on the "ork o! Daren
%'arkCEones in 7a#bridge (FD). %'arkCEones brought us "hat bea#e kno"n as
the ter# !re*ueny@in+erse dou#ent !re*ueny theory. The tfidf "eight is a
statistial #easure used to e+aluate ho" i#'ortant a "ord is "ithin a dou#ent
and then "ithin a olletion o! in!or#ation ite#s. The i#'ortane inreases
'ro'ortionally to the nu#ber o! ti#es a "ord a''ears in the dou#ent but is
oGset by the !re*ueny o! the "ord in a olletion. An un'ublished study at
%yrause uni+ersity (F%A) in <=HI tested ?= +ariations o! the t!@id! "eighting
she#e to deter#ine the diGerene bet"een these +ariations and a#e to the
onlusion that there "as no signi(ant diGerene.
Another tehni*ue used !or retrie+al is the 9!u;;y set theory: or 9!u;;y logi:. This
tehni*ue atte#'ts to retrie+e in!or#ation "ithout being bound to s'ei( "ords
or s'elling. The results here are also greatly enhaning reall 'er!or#ane as
o''osed to 'reision.
But all o! these tehni*ues are on the output side. 3n the indexing 'ro!ession "e
are on the input side. We "ould like to see that the entries reated !or retrie+al
i#'ro+es the ability to retrie+e rele+ant in!or#ation.
Words and textual 9strings: (#ultiC"ord se*uenes or 'hrases) are used index and
searh these syste#s. Words are ob+iously -ust re'resentati+es o! one'ts. The
user atually needs in!or#ation about a one't. This is exatly "hat in!or#ation
'ro!essional tries to ahie+e, na#ely to identi!y "hat the in!or#ation ite# is
9about:. Huthins (<=>H) oined the 'hrase 9aboutness: to desribe this
'heno#enon. This is exatly "here the indexer 'er!or#s his skill2 to reate
retrie+al ele#ents !or o'ti#ising rele+ane. 3t is all about the 'roess o!
identi!ying and reating index entries to re'resent 9aboutness: o! a one't.
1ne o! the areas in+estigated sine the <=>J.s "as the a''liation o! linguisti
theory. The #ain 'roble# here "as that linguisti theory only ga+e ans"ers to the
!untion o! "ords u' to the unit o! a sentene. 6ery little a''liable theory "as
a+ailable !or "orking "ith "hole dou#ents or e+e olletions o! textual ite#s.
2
Today "e "ant to look at "hat Database syste#s an oGer us on the input side.
The database syste#s gi+es us the !ollo"ing !eatures2 (s'ei(ally text retrie+al
syste#s)
The ability to house and "ork "ith a olletion o! in!or#ation ite#s
The struture !or organising or ategorising #etaCdata o! a
olletion
The ability to retrie+e !ull text @ e+ery "ord
The ability to retrie+e ite#s s'ei( to indi+idual #eta data
ategories
%earhing !untionality 2 Boolean, "ord ste#C and 'roxi#ity
searhing
Word and string indexes related to the olletion as a "hole or
"ithin a (eld
Bath #odi(ation o! indexes
%o#e e+en gi+es us a thesaurus a'ability
Beause text retrie+al database syste#s build indexes, "e an use these to
identi!y #ore suess!ul index ter#s. Here 3 "ould suggest using the ele#ents o!
the ter# !re*ueny@in+erse dou#ent !re*ueny theory2 the "ord !re*uenies.
This theory akno"ledges the distinti+e roles o! "ords in a dou#ent, in a
olletion and the dou#ents related to a "ord. Auto#ati indexing theory
identi(ed it as2
Within dou#ent !re*ueny (WD$)
7olletion !re*ueny (7$)
Dou#ent !re*ueny (D$)
WD$ gi+es us the !re*ueny o! a "ord in a dou#ent. Too lo" and too high
!re*uenies gi+es us nonCsigni(ant "ords. We need to look at the #ediu# to
higher !re*uenies. The !ous here is the signi(ane and role o! a ter# "ithin a
dou#ent.
7$ gi+es you the !re*ueny in a olletion. Again, the #ediu# to !re*uenies are
signi(ant. The !ous here is the signi(ane and role o! a ter# "ithin a dou#ent
olletion.
D$ sho"s the nu#ber o! dou#ents linked to a "ord. Again, the #ediu# to
!re*uenies are signi(ant. The !ous here is the signi(ane o! a dou#ent in a
olletion "ith regards to a s'ei( ter# or one't.
Text retrie+al syste#s, beause they build indexes, an gi+e us these "ord
!re*uenies. Through the kno"ledge o! "ord !re*ueny theory one an assess the
signi(ane o! ter#s. Text retrie+al syste#s also allo"s us to inor'orate !ull text
into one o! the (elds alongside the #eta data.
3ndexers tend to !ous on index ter#s related to a s'ei( ite# o! in!or#ation,
relating to the 9"ithin dou#ent !re*ueny: . 1ne an there!ore say that this is
the ele#ent best atered !or na#ely the role and signi(ane o! a ter# "ithin an
in!or#ation ite#.
The other t"o ele#ents should be a onern.
The signi(ane in a olletion is o!ten taken are o! by2
Fsing an indexer that 9kno"s the sub-et area:
The indexer.s kno"ledge o! the olletion
3
The indexer.s kno"ledge o! the users, o#'any or en+iron#ent.
All three o! these are done in a subConsious #anner and o!ten not really taken
are o!.
The third ele#ent, na#ely the role or signi(ane o! a dou#ent in a olletion
"ith regards to a s'ei( ter#, is o!ten not e+en onsidered. An analysis o! the
sub-et o+erage o! the dou#entation used in a ountry "ide H%57 study in
<=HH ga+e alar#ing results. This is not a''arent looking at the study.s re'orts.
We an use text retrie+al database syste#s to deter#ine better index ter#s in
the !ollo"ing "ay2
With single dou#ents one an do an analysis o! the ter# !re*uenies "ithin a
dou#ent. Most text retrie+al syste#s an load a singe dou#ent !ull text into a
reord in the database, generating the "ord !re*uenies in the index. %to' "ord
lists an suess!ully be a''lied.
With olletions o! dou#ents the text retrie+al database syste#s an be used to2
Deter#ine olletion !re*uenies o! ter#s in a olletion be!ore indexing a
ne" dou#ent "ithin or aross (elds
Deter#ining dou#ent !re*uenies o! ter#s (the nu#ber o! dou#ents
relating to the index ter#) in a olletion "hen indexing a dou#ent.
The struture o! database syste#s an suess!ully be used to2
Assigning index ter#s to a s'ei( #etaCdata ele#ent, inreasing the
s'ei( role o! a ter#
7reating database (elds that "ould enhane the role o! index ter#s. An
exa#'le here is "here sub-et (eld is di+ided into broad ter#s !or a
ontrolled +oabulary and a se'arate (eld !or !ree text indexing, and e+en
ha+ing a se'arate (eld !or na#es o! 'ersons, 'laes and o#'anies
Most Text retrie+al syste#s has the ability to do bath hanges in indexes. When
8orthern Bro+ine bea#e ,i#'o'o @ ho" #any o! the in!or#ation ser+ies #ade
that hange0
As #entioned, so#e text retrie+al database syste#s has a thesaurus !eature.
%o#e are linked to an existing thesaurus, so#e allo" the user or indexer to build
a thesaurus. Building your o"n thesaurus an be o! great +alue in narro"er
sub-et (elds. This ould be a +ery +aluable tool (r the indexer to use on the in'ut
side.
,astly "e an use Text retrie+al syste#s to test the eGeti+eness o! our indexing,
using the 'er!or#ane #easures o! reall and 'reision. 3n!or#ation ser+ies
should, on a regular basis, reKet on the eGeti+eness o! indexes o! a olletion by
analysing and reti!ying 'roble#s on a regular basis.
Database syste#s, and s'ei(ally Text retrie+al databse syste#s ha+e
de+elo'ed to a standard "here they are +ery eAient "ith a "hole olletion o!
!eatures rele+ant to the in'ut soide o! indexing. 3t is there!ore a 'o"er!ul tool in
the hands o! the indexer.
4
ibliography
Bertino, E., B. 7. 1oi, 5. %aksCDa+is, DC, Tan, E. Lobel, B. %hidlo+sky, and
B. 7atania. <==>. Indexing Techniques for Advanced Database Systems. Dlu"er
Aade#i Bublishers
El#asri 5. and %. 8a+athe. ?JJJ Fundamentals of database systems. AddisonC
Wesley.
&ra! B., <==M Term Indexing. %'ringer.
Hutchins, W ! The one't o! NaboutnessN in sub-et indexing. Aslib "roceedings
OJ (P), May <=>H, '.<>?C<H<.
Manning, 7hristo'her D. ; 5agha+an, Brabhakar and %hQt;e, Hinrih 3ntrodution
to 3n!or#ation retrie+al, ?JJH 7a#bridge Fni+ersity Bress.
htt'2RRin!or#ationretrie+al.orgR htt'2RRnl'.stan!ord.eduR35C
bookRht#lRht#leditionRirbook.ht#l
5a#esh et al. ,?JJ< 3n2
htt'2RR""".s.toronto.eduRS#os#inR'ubliationsRthesisRnodeM<.ht#l
%alton, &. M.E. M&ill <=HO. 3ntrodution to #odern in!or#ation retrie+al. M&ra"
Hill.
5