Icor's team has proposed an export of Clapi corpora in TEI since 2006 to promote the diffusion of oral corpora in standardized format and to share this usage in the projects involving Clapi databank. Our aim is to use TEI to encode both metadata and transcript, our transcripts including interactionnal phenomena like overlaps, pauses, prosody, vocal or gestures with a fine level of granularity. Moreover our interactional situations could have several recordings more or less anonymized, audio or video sources or different levels of quality according to end user needs. The subset of TEI elements has changed with the databank releases (last one in november 2013). It includes a new tool concerning adapted spelling like the token `fin in place of the token enfin and we are now abble to switch to standard spelling to take benefit of automatic annotations tools based on standard spelling (request tool in Clapi, morphosyntaxic tools in Orféo)
Technical choices
In a first step, Clapi wouldn't like to develop his own tagset with an ODD personnalization and prefer waiting for a common solution inside oral corpora community based on Ircom interoperability workgroup or iso TEI workgroup or other further initiatives.
In the same way, oral corpora community has to define a common classification of corpora with optional sublevels currently Clapi has based its own classification on the type of interactionnal situation : private, professionnal, institutionnal, commercial, medical, didactic, ... based on <keywords> and <item> elements
Data available in TEI
Metadata of all the transcripts are available in TEI (595 transcripts) Metadata and transcripts are available in TEI for searchable transcripts(123 transcripts), it means normalized in Xml by Clapi team
List of evolutions required concerning elements and attributes
<teiHeader>
Recording and transcript
An element to describe recording quality in oder to add a desc element inside media <quality level="///">
An element to describe the kind of anonymization of the recording in oder to add a desc element inside media <anonymization type="bip/contour prosodique">
Transcript definition with a dedicated element with its duration, its timing if partial, its format (transcreiber, praat, clan, elan, ...) <transcript duration="" from="" to="" format="" url="">
List of Annotations (type,level)
An element dedicated to transcript checking <checking>
Multilinguism and speakers
For multilinguism interactions, main and secondary languages need to be indicated in percents which are difficult to evaluate, a main/other attribute would be more adaptated
The language of speaker's parents is missing
The place or the kind of place where non nativ speakers have learned the language is missing, an element like <learning>
<text>
For the timecode definition and duration, we have choosen with <timeline unit="s" origin="#T0"><when xml:id="T0" absolute="00:00:00"/><when xml:id="T1" absolute="00:01:05:20"/> ... For synchronization between verbal or non verbal and timecodes we have decided to use anchor with Txx references to make a difference between an overlapped segment and an overlapping segment in place of defining relative timecodes and link the elements to these timecodes, moreover the result is easier to understand because in our data we often have something like one thousand and more overlaps <anchor synch="#Txx"/>
Then we used anchor with xml-id attribute to fix overlapped segment with a reference like CH_xxx and anchor with synch attribute for the corresponding overlapping segment
<anchor xml:id="CH_xx"/> and <anchor synch="#CH_xx"/> MAR [ça va ____] BEA [on pose nos ] affaires là
An overlap annotated inside a token [c'est chou]ette <anchor xml:id="CH_6" n="155/3a"/> <w n="155/3b">c'</w> <w n="155/3c">est</w> <choice n="155/3d"> <orig>chou <anchor xml:id="CH_7" n="155/3d"/> ette </orig> <reg>chouette</reg> </choice>
A pause : we added the rend attribute to preserve the original notation of the transcriptor to be abble to restore it <pause type="short/long" dur="PT0.2S" rend="(0.2)">
A token <w n="m7/8e4">euh</w>
A token in adapted spelling like `fin in place of enfin : <choice n="m7/7jd"><orig>`fin</orig><reg>enfin</reg></choice>
Lengthening tomber::: <w n="155/12f">tomber <shift feature="tempo" new="rall"/> <shift feature="tempo" new="rall"/> <shift feature="tempo" new="rall"/></w> For lenghtening annotated inside a token , we need to use choice element to keep the original annotated token ou:ais: <choice n="155/5o"> <orig>ou <shift feature="tempo" new="rall"/>ais <shift feature="tempo" new="rall"/></orig> <reg>ouais</reg> </choice>
The loudness of some sylls nON <w n="155/2j">n<shift feature="loud" new="f">on</shift></w>
Unclear segments or words an unclear word (mais) <w n="155/3lb"> <unclear>mais</unclear> </w> An unclear segment (qui tournent) <unclear><w n="155/4c3">qui</w>
<w n="155/4c4">tournent</w></unclear>
An information (vocal or kinesic) concerning a set of word or pauses need an extent of shift to include w and pause elements <((en riant)) bon> <seg type="non-speech" rend="((en riant))"><w n="155/ee">bon</w></seg> <((rebouche la bouteille)) (0.3)> <seg type="incident" rend="((rebouche la bouteille))> <pause type="short" dur="PT0.3S" rend="(0.3)" n="155/148"/> </seg>
Comments : vocal or note comments ((rires)) <vocal><desc n="155/179">((rires))</desc></vocal>
An additionnal element to align in the same range overlapped and overlapping segments <space style="overlap" rend="________"/>
((France - Repas - Kiwi 2008, Repas entre amies))(00:00:00)(25.5) ((ELI marche vers la porte et ouvre la porte))ELI BONJOUR:/(2.8) ((les invitées arrivent devant la porte)) BEA salut/ELI ça va/BEA ça va et toi/(1.6) ((BEA et ELI se font la bise))ELI j'ai ent[endu du BR]UITMAR ________[salut/ ___]MAR ouiBEA ouais((ELI et MAR se font la bise))MAR <((rires)) (0.4)> merde on nous entend arriver de loinELI hm(0.3)BEA ah il fait chaud iciMAR ah ouais ça fait du bienELI bon ben voilà vous aviez pas vu/BEA [nON]MAR [nON](.)ELI [voilà ____]BEA [c'est chou]ette/ (.) c'est chouette ouais\(0.2)MAR c'est coolBEA [.h ouais]ELI [.H: ____] bon j` vous laisse vous installer les [filles/]<text xml:lang="fr"> <timeline unit="s" origin="#T0"> <when xml:id="T0" absolute="00:00:00"/> <when xml:id="T1" absolute="00:01:05:20"/> <when xml:id="T2" absolute="00:01:06:20"/> <when xml:id="T3" absolute="00:01:10:70"/> <when xml:id="T4" absolute="00:01:10:85"/> <when xml:id="T5" absolute="00:01:13:90"/> <when xml:id="T6" absolute="00:01:26:70"/> ... </timeline>
<body> <anchor synch="#T0"/>
<note> <desc n="155/1">((France - Repas - Kiwi 2008, Repas entre amies))</desc> </note> <pause type="long" dur="PT25.5S" rend="(25.5)" n="155/2"/> <note> <desc n="155/3">((ELI marche vers la porte et ouvre la porte))</desc> </note> <u who="#ELI"> <w n="155/4"> <shift feature="loud" new="f">bonjour</shift> <shift feature="tempo" new="rall"/> <shift feature="pitch" new="asc"/> </w> <space style="overlap" rend="_______"/> <pause type="long" dur="PT2.8S" rend="(2.8)" n="155/6"/> <note> <desc n="155/7">((les invitées arrivent devant la porte))</desc> </note> </u>
<u who="#BEA"> <w n="155/17">ouais</w> <space style="overlap" rend="_______"/> <note> <desc n="155/19">((ELI et MAR se font la bise))</desc> </note> </u>
DTD and XML scheme (on going depending on the oral corpora community choices)
<!ELEMENT TEI ( teiHeader, text ) ><!ATTLIST TEI xmlns CDATA #REQUIRED > <!ELEMENT activity ( #PCDATA ) > <!ELEMENT age EMPTY ><!ATTLIST age from NMTOKEN #REQUIRED ><!ATTLIST age to NMTOKEN #REQUIRED > <!ELEMENT anchor EMPTY ><!ATTLIST anchor n CDATA #IMPLIED ><!ATTLIST anchor synch CDATA #IMPLIED ><!ATTLIST anchor xml:id ID #IMPLIED > <!ELEMENT availability ( licence ) ><!ATTLIST availability status NMTOKEN #REQUIRED > <!ELEMENT birth EMPTY ><!ATTLIST birth from NMTOKEN #REQUIRED ><!ATTLIST birth to NMTOKEN #REQUIRED > <!ELEMENT body ( anchor, note, u+ ) > <!ELEMENT choice ( orig, reg+ ) ><!ATTLIST choice n CDATA #REQUIRED > <!ELEMENT correction ( p ) > <!ELEMENT creation ( date ) > <!ELEMENT date ( #PCDATA ) > <!ELEMENT desc ( #PCDATA ) ><!ATTLIST desc n CDATA #IMPLIED > <!ELEMENT editorialDecl ( interpretation, correction, normalization ) > <!ELEMENT education ( #PCDATA ) > <!ELEMENT encodingDesc ( p+, editorialDecl ) > <!ELEMENT fileDesc ( titleStmt, publicationStmt, notesStmt, sourceDesc ) > <!ELEMENT interpretation ( p ) > <!ELEMENT item ( #PCDATA ) > <!ELEMENT keywords ( list ) ><!ATTLIST keywords scheme NMTOKEN #REQUIRED > <!ELEMENT langKnowledge ( langKnown ) > <!ELEMENT langKnown ( #PCDATA ) ><!ATTLIST langKnown level NMTOKEN #REQUIRED ><!ATTLIST langKnown tag NMTOKEN #REQUIRED > <!ELEMENT langUsage ( language ) > <!ELEMENT language ( #PCDATA ) ><!ATTLIST language ident NMTOKEN #REQUIRED ><!ATTLIST language usage NMTOKEN #REQUIRED > <!ELEMENT licence ( p+ ) ><!ATTLIST licence target CDATA #REQUIRED > <!ELEMENT list ( item ) > <!ELEMENT listPerson ( person+, personGrp ) > <!ELEMENT media ( desc+ ) ><!ATTLIST media dur NMTOKEN #REQUIRED ><!ATTLIST media mimeType CDATA #REQUIRED ><!ATTLIST media url CDATA #REQUIRED > <!ELEMENT name ( #PCDATA ) > <!ELEMENT normalization ( p ) ><!ATTLIST normalization source CDATA #REQUIRED > <!ELEMENT note ( #PCDATA | desc )* > <!ELEMENT notesStmt ( note ) > <!ELEMENT occupation ( #PCDATA ) > <!ELEMENT orig ( #PCDATA | anchor | shift | unclear )* > <!ELEMENT p ( #PCDATA ) > <!ELEMENT particDesc ( listPerson ) > <!ELEMENT pause EMPTY ><!ATTLIST pause dur NMTOKEN #IMPLIED ><!ATTLIST pause n CDATA #REQUIRED ><!ATTLIST pause rend CDATA #REQUIRED ><!ATTLIST pause type ( long | short ) #REQUIRED > <!ELEMENT person ( age | birth | education | langKnowledge | note | occupation )* ><!ATTLIST person sex NMTOKEN #REQUIRED ><!ATTLIST person xml:id NMTOKEN #REQUIRED > <!ELEMENT personGrp EMPTY ><!ATTLIST personGrp sex NMTOKEN #REQUIRED ><!ATTLIST personGrp size NMTOKEN #REQUIRED > <!ELEMENT principal ( #PCDATA ) > <!ELEMENT profileDesc ( creation, settingDesc, langUsage, particDesc, textClass ) > <!ELEMENT pubPlace ( #PCDATA ) > <!ELEMENT publicationStmt ( publisher, pubPlace, availability ) > <!ELEMENT publisher ( #PCDATA ) > <!ELEMENT recording ( media ) > <!ELEMENT recordingStmt ( recording ) > <!ELEMENT reg ( #PCDATA ) > <!ELEMENT resp ( #PCDATA ) > <!ELEMENT respStmt ( resp, name+ ) > <!ELEMENT seg ( anchor | choice | pause | seg | shift | unclear | w )* ><!ATTLIST seg rend CDATA #REQUIRED ><!ATTLIST seg type ( non-speech | voice ) #REQUIRED > <!ELEMENT setting ( activity ) > <!ELEMENT settingDesc ( setting ) > <!ELEMENT shift ( #PCDATA ) ><!ATTLIST shift feature ( loud | pitch | tempo ) #IMPLIED ><!ATTLIST shift new ( asc | breathing-in | breathing-out | desc | f | rall ) #REQUIRED > <!ELEMENT sourceDesc ( recordingStmt ) > <!ELEMENT space ( #PCDATA ) ><!ATTLIST space style NMTOKEN #REQUIRED ><!ATTLIST space rend NMTOKEN #REQUIRED > <!ELEMENT teiHeader ( fileDesc, encodingDesc, profileDesc ) ><!ATTLIST teiHeader xml:lang NMTOKEN #REQUIRED > <!ELEMENT text ( timeline, body ) ><!ATTLIST text xml:lang NMTOKEN #REQUIRED > <!ELEMENT textClass ( keywords ) > <!ELEMENT timeline ( when+ ) ><!ATTLIST timeline origin CDATA #REQUIRED ><!ATTLIST timeline unit NMTOKEN #REQUIRED > <!ELEMENT title ( #PCDATA ) > <!ELEMENT titleStmt ( title, principal, respStmt+ ) > <!ELEMENT u ( #PCDATA | space | anchor | choice | note | pause | seg | shift | unclear | vocal | w )* ><!ATTLIST u who CDATA #REQUIRED > <!ELEMENT unclear ( #PCDATA | space | anchor | choice | pause | seg | shift | unclear | vocal | w )* ><!ATTLIST unclear extent CDATA #IMPLIED > <!ELEMENT vocal ( desc ) > <!ELEMENT w ( #PCDATA | anchor | shift | unclear )* ><!ATTLIST w n CDATA #REQUIRED > <!ELEMENT when EMPTY ><!ATTLIST when absolute NMTOKEN #REQUIRED ><!ATTLIST when xml:id ID #REQUIRED >
Using TEI in Ciel project: Daniel Alcon, Carole Etienne
Ciel-f projecthas chosen to archive its corpora in both Clapi and Moca databanks, the corpora collections would be shared and exchanged regularly, transcripts are in the same praat format and metadat in TEI format
using element <teiCorpus> for the whole collection
using element<teiheader> for the whole Ciel project
using the element<teiCorpus> for a corpus based on an area
using element <TEI> for each interactionnal recording <teiheader>
Using TEI in Orfeo project
Orfeo project stands on an initial corpora made of several corpora of several ressources transcripted in different formats and would use TEI to manage them and deliver its own annotations
Clapi Management Council Heike Baldauf-Quilliatre, Isabel Colon de Carvajal, Carole Etienne, Justine Lascar, Véronique Traverso
Others contributions to CLAPI design and development Kamel Aouiche, Lukas Balthazar, Michel Bert, Sylvie Bruxelles, Emilie Jouin-Chardon, Lorenza Mondada, Christian Plantin, Daniel Valéro
Les enregistrements et les transcriptions de CLAPI sont disponibles sous licence Creative Common 4.0 International (CC BY-NC-SA 4.0) c'est à dire qu'elles sont réutilisables à l'identique ou modifiables mais pour des usages non commerciaux et avec citation de la source ( CLAPI, http://clapi.icar.cnrs.fr ), elles peuvent être redistribuées dans les mêmes conditions.
The recordings and the transcripts of CLAPI are available with license Creative Common 4.0 International (CC BY-NC-SA 4.0) which means that they could be reusable faithfully or modified but not for commercial purposes and with appropriate credit ( CLAPI, http://clapi.icar.cnrs.fr ), they could be redistributed under the same license.