PAC3
at
JALT
2001
Conference
Proceedings
MENU
Text Version
Help & FAQ
International
Conference
Centre
Kitakyushu
JAPAN
November
22-25, 2001
Developing a One Million Word Spoken
EFL Learner Corpus
Yukio Tono
Meikai University
Tomoko Kaneko
Showa Women’s University
Hitoshi Isahara
Communication Research Laboratory
Emi Izumi
Communication Research Laboratory
Toyomi Saiga
Communication Research Laboratory
Emiko Kaneko
ALC Press
This paper will report on the progress of a project to
compile a one-million-word spoken corpus of Japanese
learners of English. In 1999, we launched a project to
compile a new learner corpus in collaboration with ALC
Press Inc. and Communications Research Laboratory. The
major characteristic of the project is that the corpus data
is entirely based upon the audio-recordings of an English
oral proficiency interview test called ACTFL-ALC Standard
Speaking Test. One of the unique features of the corpus
is that each speaker’s data include his or her proficiency
profile based on the SST evaluation schemes. This makes
it possible for corpus users to study the learner language
across different proficiency levels, a feature which has
TONO, KANEKO, ISAHARA, IZUMI, SAIGA & KANEKO: DEVELOPING
A
ONE MILLION WORD SPOKEN EFL LEARNER CORPUS
not often been available for other learner corpora. In this
paper, we will describe the project by summarizing the data
collection procedure, text format, transcription guidelines,
and annotation schemes (especially error tagging schemes),
as well as the theoretical and pedagogical implications of
the project.
informationoncommonlearnererrors.Projectssuch
asICLE(InternationalCorpusofLearnerEnglish;see
Granger1998)andJEFLL(JapaneseEFLLearner)
Corpus(seeTono2000)bothaimtocompilelearner
corporatodescribetheinterlanguageofparticularL2
本研究は100万語規模の日本人英語学習者の話し言葉コ learnergroupsusingacorpus-linguisticmethodology.
ーパス構築プロジェクトの中間報告である。1999年に Todate,however,mostoftheselearnercorpusprojects
我々は出版社アルクおよび通信総合研究所の協力を得て
arecomposedofwrittendataonly,whilethefewspoken
新しいコーパスを作成する試みを開始した。このプロジェ
クトの最大の特徴は、コーパスがACTFL-ALC
Standard learnercorporathatdoexist(e.g.LINDSEIprojectat
Speaking Test と呼ばれる英語会話能力面接試験の録音テ Louvain,describedinDeCock,etal.1999),arerather
ープに基づいている点である。統一されたテスト判定結果 smallinsize.
がデータ内に話者属性として与えられているため、学習者
WelaunchedourStandardSpeakingTest(SST)corpus
グループを能力別に分けてサブコーパス検索などが可能に
projectin1999.Thisprojectisajointcollaboration
なる、という大きな特徴を持っている。本論では、データ
収集方法、フォーマット、書き起こしガイドライン、注釈 betweenCommunicationResearchLaboratory
付け(特にエラータグの仕様)などについて概説し、この andALCPress,withafewadvisorymembersfrom
プロジェクトの理論的、教育的意義について述べる。
universities.WhileEnglishistaughtoversixyearsin
secondaryschoolsinJapan,manyofusfeelthatJapanese
hepaperreportsonanon-goingprojectto
learnersstillcannotfunctionproperlyinEnglishfor
compileaone-million-wordspokencorpus
communicationpurposes.Ithasbeenarguedthatone
ofJapaneselearnersofEnglishandprovides
oftheproblemsofEnglishlanguageteachinginJapanis
possibleimplicationsforSLAandELTresearchaswell
thatnoseriousattempthasbeenmadetosystematically
aslanguageengineering.Recently,SLAresearchersand
recordanddescribetheacquisitionprocessofJapanese
languageteachingprofessionalshavebeguntorealize
learnersofEnglishinourEFLcontext.Itisvery
theimportanceoflearnercorporaasresourcesfor
importanttoknowobjectivelyhowmuchEnglishwe
teachingandresearch,andmajordictionarypublishers
haveacquiredaftersixyearsofinstruction,andthe
suchasLongmanandCambridgeUniversityPress
progressionofstandards.Mereadoptionandapplication
havealreadycompiledtheirownlearnercorporain
ofteachingmethodsfromforeigncountrieswillnot
ordertoenrichtheirdictionarycontentbyproviding
alwaysworkinourcountry.Weneedtogatherdata
T
PAC3 at JALT2001
872
Conference Proceedings
TONO, KANEKO, ISAHARA, IZUMI, SAIGA & KANEKO: DEVELOPING
onhowJapaneselearnerslearnEnglishandtodescribe
thedevelopmentalpathoftheirinterlanguage.Thus,
oneofthemainpurposesofthisprojectistoidentify
thefeaturesofinterlanguageatdifferentstagesofL2
acquisitionandconstructamodelofinterlanguage
development.Wehopetoidentifythemechanismsof
developmentfromonestageofinterlanguagetoanother,
whichwehopewillleadtotheimprovementofteaching
methodsandmorerigorousempiricalresearchonthe
effectoflearningmethodsonthetransitionprocess.
TheStandardSpeakingTest
TheStandardSpeakingTest(SST)isacollaboration
betweentheAmericanCouncilontheTeachingof
ForeignLanguages(ACTFL)andALCPress.Itisbased
ontheACTFLProficiencyGuidelinesforspeakingand
theOralProficiencyInterview(OPI).TheACTFL-OPI
wasfirstdevelopedin1982,andsincethenithasbeen
oneofthemostinfluentialspeakingtestsintheworld
despitethefactthattherehasbeensomecriticismagainst
theempiricalbasesoftheguideline(see,forexample,
BachmanandSavignon1986;Chalhoub-Deville1997).
ACTFLandALCPressworkedtogethertodevelopa
newspeakingtestforJapaneselearnersofEnglish.The
proficiencyguidelinedefines9differentproficiencylevels
(Level1:‘Novice-Low’throughLevel9:‘Advanced’).
Eachlevelisdefinedspecificallyintermsofthefollowing
criteria:(a)context/contentarea,(b)texttype,(c)global
PAC3 at JALT2001
A
ONE MILLION WORD SPOKEN EFL LEARNER CORPUS
task&function,(d)accuracy,whichincludesgrammatical
accuracy,fluency,andpronunciation.
TheSSTtakestheformofa10to15-minutetaperecordedconversationbetweenatrainedinterviewerand
atestcandidate.TheSSTutilizesinterviewtechniques
andpicturepromptstosimulatenaturalconversationto
themaximumextentpossibleinatestingsituation.
Thetape-recordedinterviewisscoredbytwodifferent
Raters(incaseofdisagreement,theinterviewisratedby
aMasterRater).IntheSST,theelicitationandscoring
ofspeechsamplesareseparateprocedures,unlikethe
OPIwherebothtasksareperformedonthespotbythe
interviewer.TheSSTinterviewprocesselicitsspeech
samplesthroughtheapplicationofthefollowingfivestageformat:
1. Warm-upandinitialassessment
2. Singlepicturepromptwithlevelchecksand
probes
3. Role-playwithlevelchecksandprobes
4. Singleorpicturesequencepromptwithlevel
checksandprobes
5. Wind-down
Althoughtheinterviewerisnotresponsibleforthe
formalratingofthetestcandidate,theinterviewermust
beabletoconductanon-goinginformalassessmentof
thespeaker’sproficiencyinordertotailorthequestions,
873
Conference Proceedings
TONO, KANEKO, ISAHARA, IZUMI, SAIGA & KANEKO: DEVELOPING
prompts,androleplaysmostsuitedtothetest
candidate’sinterestsandlevelofspeakingproficiency.
Ifthespeechsampleispoorlyelicitedviaprompts
inappropriatetothespeaker’slevelofproficiency,the
ratingofthespeechsamplemaybecomeinvalid.
TheSSTservestodiscriminatespokenEnglish
atNovicetoIntermediateHighlevelsofproficiency
utilizingashorterinterviewthantheOPI.TheSST
discriminatesmorefinelyatIntermediateproficiency
levelsthandootherexistingstandardizedmeasurement
instrumentsofspeakingproficiency.TheSSTcanalso
serveasapotentialscreeninginterviewforspeakerswho
mightbereadyfor,andbenefitfromtaking,theOPI.
ALCPresspossesseslargearchivesofaudiorecordings
ofthistest.Wesawthisdataaspotentiallyvery
usefulspokenresources,asmostlearnercorporato
dateconsistofwrittendataonly.Forthisreason,we
decidedtolaunchthepresentprojecttotranscribethese
archivedrecordingsandconvertthemintoaspoken
learnercorpus.Thestrengthofthiscorpusprojectis
thateachfile/pieceofdatahasspecificinformation
ontheexaminee’soralproficiencylevel,asassessed
bytheprofessionalexaminer.Whilsttherearesome
developmentalinterlanguagecorporaavailable(e.g.
JEFLL),thelabellingordeterminationoflearner
proficiencylevelsisoftenbasedonexternalfactorssuch
asschoolyears.Thus,acomparisonbetweensubcorpora
basedonschoolyearssometimescausesaproblem.In
PAC3 at JALT2001
A
ONE MILLION WORD SPOKEN EFL LEARNER CORPUS
theLongmanLearner’sCorpus,learnerproficiency
foreachfileisencodedinitsheader,butjudgements
abouttheproficiencylevelsseemtobeentirelyupto
theteacherswhodonatedthecompositiondata,andare
thusnotentirelyreliable,sincewehavenoinformation
aboutthepossiblyvariedcriteriaorstandardsused
bythedifferentteachers.Comparedwiththeseother
learnercorpora,therefore,SSTdatahavemorereliable
informationonlearners’proficiencylevels,whichwill
helptomakecomparativeresearchbasedonproficiency
subsectionsofthecorpusmorevalid.
TheSSTCorpusProject
TheSSTCorpushasbeencompiledaspartofa
largerprojectcalled‘Research&Developmentof
CongruentCommunicationTechnologies’ledbythe
TelecommunicationsAdvancementOrganizationof
Japan(TAO)andCommunicationResearchLaboratory
(CRL).Theprimarygoalofthisprojectistodevelop
naturallanguageprocessingtechnologiesthatcanhandle
humanerrorsinspeechorwritingandtheirapplication
insuchareasassupportsystemsforwritinginEnglish,
machinetranslation,andapplicationsineducation(e.g.
learnererrordatabases,computer-assistedself-access
languagelearningsystems).
TheSSTlearnercorpusdatahavevarioustypesof
error(lexical,syntactic,phonetic,etc.).Information
onthetypesandratesoferrorthatlearnersmake
874
Conference Proceedings
TONO, KANEKO, ISAHARA, IZUMI, SAIGA & KANEKO: DEVELOPING
atvariousproficiencylevels,thecontextsinwhich
particularlearnererrorsoccur,etc.willserveasthe
inputformachinelearningofinterlanguagegrammar.
CRLhascreatedanautomaticmachinelearningsystem
(consistingofalexiconandagrammar)whichcanbe
potentiallyveryusefulfortesting
andevaluatingtheinterlanguage
grammarmodelthatthemachine
learnsautomatically.Theresults
canbeusedforNLPand
educationalpurposes.
Theinitialphaseoftheproject
includesthedevelopmentofthe
followingtoolsandguidelines:
(1)transcriptionguidelines,(2)
taggingschemes,(3)atageditor,
(4)errortaggingschemes,and
(5)errortaggingsupporttools.
Atthetimeofwriting(December
2001),thesecondversionofa
setoftranscriptionguidelineshas
beenpreparedfortranscribingthe
dataofabout510subjects,usinga
“TagEditor”(seeFigure1).
A
ONE MILLION WORD SPOKEN EFL LEARNER CORPUS
TagEditorhasbeendevelopedspeciallyforthis
learnercorpusproject.Besidestheusualfunctionalities
commontotexteditors,ithasuniquefeaturessuchas
theabilitytoinsertcustomizedXML-liketags,validate
tags(seeFigure2),showtagsincolour,dosimplegrep
Figure1:Ascreenshotof
TagEditorVersion1.2.
PAC3 at JALT2001
875
Conference Proceedings
TONO, KANEKO, ISAHARA, IZUMI, SAIGA & KANEKO: DEVELOPING
A
ONE MILLION WORD SPOKEN EFL LEARNER CORPUS
searchesandconcordancing,use
filetemplates,checkfileformats,
doautomatic‘searchandreplace’,
andsoon.Asimilartoolwas
developedatLouvainforthe
ICLEproject,butourtoolhas
severalnewfeatures,including
tagvalidationandasimple
concordancer.
Figure2:Tagvalidation
functionofTagEditor
Currently,theteamisworking
onerrortagging.Itisextremely
difficulttodefineagenericerror
tagsetthatcanbeappliedtoany
typeoflearnercorpus.Sofarwe
havedevelopedagenericerror
tagsetandanaccompanyingerror
tagmanual.Wefocusmainlyon
lexicalandsyntacticerrorsonly
atthisstage.Atalaterstage,
however,wewillshiftourfocustoothertypesoferrors,
e.g.prosodic,pragmatic,anddiscoursalerrors.The
datawillbemadepubliclyavailableafterthethree-year
projectendsin2003.
PAC3 at JALT2001
SSTData,SLA/ELTResearchandBeyond
UsingSSTCorpusforSLAresearch
Theinfluenceofcorpuslinguisticshasspreadrather
slowlyinthefieldsofSLAresearch.Leech(1998)states
876
Conference Proceedings
TONO, KANEKO, ISAHARA, IZUMI, SAIGA & KANEKO: DEVELOPING
twomainreasonsforthis:firstly,itisextremelytimeconsumingandlabour-intensivetocreateacorpus.
Secondly,theintellectualclimateinappliedlinguistics
hasbeensuchthattheory-drivenexperimentalstudies
haveprevailedforthelast20yearsandthedata-driven
descriptivemethodsofcorpuslinguisticshavenot
beenpreferred.Recently,however,thereisagrowing
awarenessthatacomputationalanalysisofalargebody
oflanguageusedatacouldprovideusefuldatafor
languageresearchersasmorefocusisnowonusagebasedlinguisticmodelsaswellaslexico-grammatical
aspectsofalanguage.
L2learnercorporahavegreatpotentials
forclarifyingtheL2acquisitionprocess.
Itservesusasbetterdatathanwhat
wehavemostlyreliedonsofarand
enablesustoinvestigatethenon-native
speakinglearners’languagenotonly
fromanegativepointofview(‘Whatdid
thelearnergetwrong?’)butalsofrom
apositiveone(‘Whatdidthelearner
getright?’)Italsomakesitpossible
toinvestigateL2learners’avoidance
behaviourbyexaminingoveruse/underuse
phenomena.Undeniably,alltypesofSLA
datahavetheirstrengthsandweaknesses
andonecannothelpbutagreewith
Ellis(1994:676)that‘Goodresearchis
PAC3 at JALT2001
A
ONE MILLION WORD SPOKEN EFL LEARNER CORPUS
researchthatmakesuseofmultiplesourcesofdata.’The
learnercorpora,ifproperlycompiled,servewellasa
valuableadditiontocurrentSLAdatasources.Theycan
beusedfor(a)verificationoftheoriesandhypothesesin
SLAand(b)descriptionofinterlanguagedevelopment
(developmentalstagesoflinguisticfeatures(lexicogrammatical/discoursal/pragmatic),L1transfereffect,
overuse/underuseofparticularfeatures,universal/L1
specificerrors,native-like/non-native-likeperformance
(Tono1998).
877
Conference Proceedings
TONO, KANEKO, ISAHARA, IZUMI, SAIGA & KANEKO: DEVELOPING
Figure3:ThepotentialofSSTCorpusforSLAresearch
Figure3illustratesthispoint.Ifthepatternsof
use/misuseandoveruse/underuseareidentifiedacross
differentstagesofacquisitionandapropercomparison
ismadebetweentheSSTdataanddifferenttypesof
learnercorporasuchaswrittenorparallelversions(with
correctederrors),wewillbeabletoidentifyvarious
aspectsofinterlanguagegrammardevelopment.Ellis
(2001)predictsanimportantroleofcorporainSLAin
thefutureanddistinguishesthreetypesof
corporaasuseful:(1)corporaof‘authentic’
nativespeakerlanguageuse,(2)corpora
oflearnerlanguageuse,and(3)corporaof
nativespeakerlanguageusewithlearners
(i.e.samplesofforeignertalk).TheSST
dataservesasausefultoolforthesecond
type.
A
ONE MILLION WORD SPOKEN EFL LEARNER CORPUS
syllabusandmaterialsdesign,(b)testconstructionand
(c)teachereducation.Fordirectuse,SSTCorpuscanbe
exploitedasapartofon-lineresourcesforL2learners’
data-drivenlearningorapartofresearchresourcesfor
postgraduatestudentsinappliedlinguistics.
Figure4:ThepotentialofSSTCorpusforEnglish
languageteaching
TheimplicationsoftheSSTCorpus
projectforEnglishlanguageteaching
TheexploitationofSSTCorpusfor
Englishlanguageteachingcanbeeither
directorindirect.AsshowninFigure4,as
indirectapplicationsofthedata,spoken
learnercorpusdatacouldprovidethe
informationaboutL2learners’acquisition
processandpossibleerrorsourcesand
patterns,whichcanbeinvaluableresourcesfor(a)ELT
PAC3 at JALT2001
878
Conference Proceedings
TONO, KANEKO, ISAHARA, IZUMI, SAIGA & KANEKO: DEVELOPING
Futuretechnologies
Anotherinterestingpossibilityisthecollaborationwith
researchersinnaturallanguageprocessing(NLP)fields.
Theyhaveconsiderableexpertiseinusinglanguage
corporaformachinetranslation,informationretrieval,
andlanguagemodelling.Somecomputerscanlearn
rulesofalanguagestatisticallyonthebasisofalarge
amountofnaturallanguagedata.Ifalearnercorpus
isfedintosuchacomputer,itcouldpossiblylearna
grammarofinterlanguage.SinceSSTCorpushasnine
proficiencylevels,itwouldbeaninterestingpossibility
toexplorehowacomputercanstatisticallylearnthe
grammarsofinterlanguagesatdifferentproficiency
levels.ThisapproachisinlinewiththeonethatDan
Jurafskyandhiscolleagueshavetakenrecentlyin
whattheycall‘computationalpsycholinguistics.’We
couldpossiblydevelopanerrordetectororalanguage
A
ONE MILLION WORD SPOKEN EFL LEARNER CORPUS
proficiencyassessmentsystemthatcananalysetheinput
fromlearnersandidentifywhichdevelopmentalstagea
particularlearnerisat.
Learnerdatacanalsobeusedforthedevelopment
ofanoise-proofmachinetranslationsystem.Afuture
translationtoolcouldhearthenon-nativespeakers’
erroneoussentencesandinterpretthemproperly.
Thelearnercorpusdatacanbeusefulresourcesfor
developingtheknowledgebaseforsuchasystem.Our
teamisidealinthesensethatwehavebothSLAand
NLPresearchersworkingtogethertofullyexploita
corpusofJapanese-speakinglearnersofEnglish.
Wearehopingtofinishthedatacollectionwithina
yearandreleasethecorpusinthefiscalyear2003sothat
itwillbefreelyavailableforresearchandcommercial
purposes.Wearelookingforwardtoyourinputand
commentsonthisexcitingproject.
References
Bachman,L.F.andSavignonS.(1986).Theevaluationofcommunicativelanguageproficiency:Acritiqueofthe
ACTFLoralinterview.ModernLanguageJournal70,380-390.
Chalhoub-Deville,M.(1997).Theoreticalmodels,assessmentframeworksandtestconstruction.LanguageTesting
14,3-22.
DeCock,S.,Granger,S.andPetch-Tyson,S.(1999).TheLouvainInternationalDatabaseofSpokenEnglish
Interlanguage(LINDSEI)Project.AninternalreportatCatholicUniversityofLouvain.
Ellis,R.(1994).TheStudyofSecondLanguageAcquisition.OxfordUniversityPress:Oxford.
PAC3 at JALT2001
879
Conference Proceedings
TONO, KANEKO, ISAHARA, IZUMI, SAIGA & KANEKO: DEVELOPING
A
ONE MILLION WORD SPOKEN EFL LEARNER CORPUS
Ellis,R.(2001).RealDataandRealPedagogy.Lecturegivenatthe2ndLearnerCorpusWorkshopatShowa
Women’sUniversity,Tokyo.
Leech,G..(1998).Preface.InS.Granger(ed.)LearnerEnglishonComputer.London:Longman.
Tono,Y.(1998).Acomputerlearnercorpus-basedanalysisoftheacquisitionorderofEnglishgrammatical
morphemes.InTALC(TeachingandLanguageCorpora)98ConferenceProceedings,KebleCollegeOxford.
Tono,Y.(2000)Acorpus-basedanalysisofinterlanguagedevelopment:analysingpart-of-speechtagsequencesof
EFLlearnercorpora.Lewandowska-Tomaszcyk&J.Melia(eds.)PALC’99:PracticalApplicationsinLanguage
Corpora.PeterLangGmbH,FrankfurtamMain,pp.323-342.
PAC3 at JALT2001
880
Conference Proceedings