Logged in : Guest
Log in
Français
English
Log out
Collections
Downloads
Examples
Search
History
Exporting Oral CLAPI corpora in TEI

Technical choices
Data available in TEI
List of evolutions required concerning elements and attributes
Metadata example
Transcript example
DTD (Scheme ongoing according to ircom workgroup)
Using Tei in Ciel project: Daniel Alcon, Carole Etienne
Using Tei in Orféo ANR project

Exporting Oral CLAPI corpora in TEI

Goals
Icor's team has proposed an export of Clapi corpora in TEI since 2006 to promote the diffusion of oral corpora in standardized format and to share this usage in the projects involving Clapi databank.
Our aim is to use TEI to encode both metadata and transcript, our transcripts including interactionnal phenomena like overlaps, pauses, prosody, vocal or gestures with a fine level of granularity. Moreover our interactional situations could have several recordings more or less anonymized, audio or video sources or different levels of quality according to end user needs.
The subset of TEI elements has changed with the databank releases (last one in november 2013).
It includes a new tool concerning adapted spelling like the token `fin in place of the token enfin and we are now abble to switch to standard spelling to take benefit of automatic annotations tools based on standard spelling (request tool in Clapi, morphosyntaxic tools in Orféo)

Technical choices
In a first step, Clapi wouldn't like to develop his own tagset with an ODD personnalization and prefer waiting for a common solution inside oral corpora community based on Ircom interoperability workgroup or iso TEI workgroup or other further initiatives.

This choice imply to keep in use generic elements like <p> or <note> to describe specific oral needs concerning corpora, recordings or transcripts gathering in List of evolutions required concerning elements and attributes

In the same way, oral corpora community has to define a common classification of corpora with optional sublevels currently Clapi has based its own classification on the type of interactionnal situation : private, professionnal, institutionnal, commercial, medical, didactic, ... based on <keywords> and <item> elements

Data available in TEI
Metadata of all the transcripts are available in TEI (595 transcripts)
Metadata and transcripts are available in TEI for searchable transcripts(123 transcripts), it means normalized in Xml by Clapi team

List of evolutions required concerning elements and attributes

<teiHeader>
Recording and transcript
Multilinguism and speakers
<text>

Metadata Example


<teiHeader xml:lang="fr">

<fileDesc>


<titleStmt>
<title>
Réunion de conception en architecture - mosaic ~ Mosaic - architecture ~ Mosaic - architecture - xml
</title>
<principal>Detienne Françoise</principal>
<principal>Traverso Véronique</principal>

<respStmt>
<resp>conçu par</resp>
<name>Baker Mickael</name>
<name>Bruxelles Sylvie</name>
<name>Darses Françoise</name>
<name>Detienne Françoise</name>
<name>Lund Kris</name>
<name>Mondada Lorenza</name>
<name>Sejourne Arnaud</name>
<name>Traverso Véronique</name>
<name>Visser Willemien</name>
</respStmt>

<respStmt>
<resp>collecté par</resp>
<name>Detienne Françoise</name>
<name>Visser Willemien</name>
</respStmt>

<respStmt>
<resp>transcrit par </resp>
<name>Greco Luca</name>
<name>Lascar Justine</name>
<name>Jouin-chardon Émilie</name>
</respStmt>

<respStmt>
<resp>préparé et balisé par</resp>
<name>CLAPI - Equipe Médiathèque</name>
</respStmt>

</titleStmt>

<publicationStmt>

<publisher>Groupe ICOR/ Plateforme CLAPI</publisher>
<pubPlace>http://clapi.univ-lyon2.fr</pubPlace>

<availability status="restricted">
<licence target="http://clapi.univ-lyon2.fr/V3_CGU.php">
<p>Conditions générales d'accès pour ce document</p>
<p>Copyright © ICAR. Tous droits réservés.</p>
<p>
Enregistrement vidéo d'une durée de 1h18m45s téléchargeable sous convention de recherche
</p>
<p>
Transcription mosaic - architecture - adaptée CLAPI au format doc en téléchargement libre
</p>
<p>
Transcription mosaic - architecture - clan au format clan - ca ou cha en téléchargement libre
</p>
<p>Transcription requêtable par les outils librement</p>
<p>Agrément CNIL de Clapi numéro : 2-12064</p>
</licence>
</availability>

</publicationStmt>

<notesStmt>
<note>
Des notes plus complètes sont disponibles en ligne sur CLAPI <ref target="http://clapi.univ-lyon2.fr/V3_Feuilleter.php?num_corpus=42"/>
</note>
</notesStmt>


<sourceDesc>

<recordingStmt>

<recording>
<media dur="PT1H18M45S" mimeType="vidéo" url="file://Clapi_Signal_reunion_conception_a_mosaic__architectur_ed7fb39ece.mov">
<desc>Niveau de qualité bonne</desc>
<desc>Streaming</desc>
<desc>enregistrement anonymisé</desc>
</media>

</recording>
</recordingStmt>

</sourceDesc>

</fileDesc>

<encodingDesc>
<p>Transcription totale</p>
<p>Orthographe adaptee</p>
<p>Format xml</p>

<editorialDecl>

<interpretation>
<p>
La transcription a été vérifiée d'après la convention fournie par le transcripteur et est disponible en ligne <ref target="http://clapi.univ-lyon2.fr/V3_Feuilleter.php?choix_corpus=42"/>
</p>
</interpretation>

<correction>
<p>
Les erreurs qui génaient le traitement automatique de son contenu ont fait l'objet de corrections reportées dans le fichier de la convention, <ref target="http://clapi.univ-lyon2.fr/V3_Feuilleter.php?num_corpus=42">disponible en ligne</ref>
</p>
</correction>

<normalization source="http://icar.univ-lyon2.fr/projets/corinte/bandeau_droit/convention_icor.htm">
<p>
En cas d'incohérence, les modifications ont suivi la norme ICOR, disponible en ligne
</p>
</normalization>

</editorialDecl>
</encodingDesc>

<profileDesc>

<creation>
<date>2002-Nov</date>
</creation>

<settingDesc>
<place type="country" xml:id="FR">
<placeName>FRANCE</placeName>
</place>

<setting>
<activity>réunion de conception en architecture </activity>
Enregistrement vidéo d'une réunion de conception entre trois architectes (et une observatrice) à qui la réhabilitation d'un château en centre de séminaires a été confiée.
. 3 + l'observatrice
</activity>
</setting>

</settingDesc>

<langUsage>
<language ident="fr" usage="100">français</language>
</langUsage>
<particDesc>

<listPerson>

<person xml:id="c" sex="1">
<age from="40" to="50"/>
<birth from="1952" to="1962"/>
<occupation>En activité architecte</occupation>
<education>Niveau d'études Supérieur</education>
<langKnowledge>
<langKnown level="first" tag="fr">français</langKnown>
</langKnowledge>
<note>C= Charles</note>
</person>

<person xml:id="m" sex="2">
<age from="30" to="40"/>
<birth from="1962" to="1972"/>
<occupation>En activité architecte d'intérieur</occupation>
<education>Niveau d'études Supérieur</education>
<langKnowledge>
<langKnown tag="fr">français</langKnown>
</langKnowledge>
<note>M = Marie</note>
</person>

<person xml:id="l" sex="1">
<age from="30" to="40"/>
<birth from="1962" to="1972"/>
<occupation>En activité architecte,architecte d'intérieur</occupation>
<education>Niveau d'études Supérieur</education>
<langKnowledge>
<langKnown level="first" tag="fr">français</langKnown>
</langKnowledge>
<note>L = Louis</note>
</person>

<personGrp size="3" sex="mixed"/>

</listPerson>

</particDesc>

<textClass>
<keywords scheme="genre_clapi">
<list>
<item>Interactions de travail</item>
</list>
</keywords>
</textClass>

</profileDesc>

</teiHeader>


Transcript example



((France - Repas - Kiwi 2008, Repas entre amies))
(00:00:00)
(25.5) ((ELI marche vers la porte et ouvre la porte))
ELI BONJOUR:/
(2.8) ((les invitées arrivent devant la porte))
BEA salut/
ELI ça va/
BEA ça va et toi/
(1.6) ((BEA et ELI se font la bise))
ELI j'ai ent[endu du BR]UIT
MAR ________[salut/ ___]
MAR oui
BEA ouais
((ELI et MAR se font la bise))
MAR <((rires)) (0.4)> merde on nous entend arriver de loin
ELI hm
(0.3)
BEA ah il fait chaud ici
MAR ah ouais ça fait du bien
ELI bon ben voilà vous aviez pas vu/
BEA [nON]
MAR [nON]
(.)
ELI [voilà ____]
BEA [c'est chou]ette/ (.) c'est chouette ouais\
(0.2)
MAR c'est cool
BEA [.h ouais]
ELI [.H: ____] bon j` vous laisse vous installer les [filles/]



<text xml:lang="fr">


<timeline unit="s" origin="#T0">
<when xml:id="T0" absolute="00:00:00"/>
<when xml:id="T1" absolute="00:01:05:20"/>
<when xml:id="T2" absolute="00:01:06:20"/>
<when xml:id="T3" absolute="00:01:10:70"/>
<when xml:id="T4" absolute="00:01:10:85"/>
<when xml:id="T5" absolute="00:01:13:90"/>
<when xml:id="T6" absolute="00:01:26:70"/>
...
</timeline>

<body>
<anchor synch="#T0"/>

<note>
<desc n="155/1">((France - Repas - Kiwi 2008, Repas entre amies))</desc>
</note>
<pause type="long" dur="PT25.5S" rend="(25.5)" n="155/2"/>
<note>
<desc n="155/3">((ELI marche vers la porte et ouvre la porte))</desc>
</note>
<u who="#ELI">
<w n="155/4">
<shift feature="loud" new="f">bonjour</shift>
<shift feature="tempo" new="rall"/>
<shift feature="pitch" new="asc"/>
</w>
<space style="overlap" rend="_______"/>
<pause type="long" dur="PT2.8S" rend="(2.8)" n="155/6"/>
<note>
<desc n="155/7">((les invitées arrivent devant la porte))</desc>
</note>
</u>

<u who="#BEA">
<w n="155/9">salut
<shift feature="pitch" new="asc"/>
</w>
</u>

<u who="#ELI">
<w n="155/a">ça</w>
<w n="155/b">va
<shift feature="pitch" new="asc"/>
</w>
</u>

<u who="#BEA">
<w n="155/c">ça</w>
<w n="155/d">va</w>
<w n="155/e">et</w>
<w n="155/f">toi
<shift feature="pitch" new="asc"/>
</w>
<space style="overlap" rend="_______"/>
<pause type="long" dur="PT1.6S" rend="(1.6)" n="155/h"/>
<note>
<desc n="155/i">((BEA et ELI se font la bise))</desc>
</note>
</u>

<u who="#ELI">
<w n="155/j">j'</w>
<w n="155/k">ai</w>
<choice n="155/l">
<orig>ent
<anchor xml:id="CH_1" n="155/l"/>endu
</orig>
<reg>entendu</reg>
</choice>
<w n="155/n">du</w>
<choice n="155/o">
<orig>
<shift feature="loud" new="f">br</shift>
<anchor xml:id="CH_2" n="155/o"/>
<shift feature="loud" new="f">uit</shift>
</orig>
<reg>bruit</reg>
</choice>
</u>

<u who="#MAR">
<space style="overlap" rend="________"/>
<anchor synch="#CH_1 " n="155/12"/>
<w n="155/13">salut
<shift feature="pitch" new="asc"/>
</w>
<space style="overlap" rend="___"/>
<anchor synch="#CH_2" n="155/15"/>
</u>

<u who="#MAR">
<w n="155/16">oui</w>
</u>

<u who="#BEA">
<w n="155/17">ouais</w>
<space style="overlap" rend="_______"/>
<note>
<desc n="155/19">((ELI et MAR se font la bise))</desc>
</note>
</u>

<u who="#MAR">
<kinesic>
<desc>((rires))</desc>
<pause type="short" dur="PT0.4S" rend="(0.4)" n="155/1c"/>
</kinesic>
<w n="155/1e">merde</w>
<w n="155/1f">on</w>
<w n="155/1g">nous</w>
<w n="155/1h">entend</w>
<w n="155/1i">arriver</w>
<w n="155/1j">de</w>
<w n="155/1k">loin</w>
</u>

<u who="#ELI">
<w n="155/1l">hm</w>
<space style="overlap" rend="_______"/>
<pause type="short" dur="PT0.3S" rend="(0.3)" n="155/1n"/>
</u>

<u who="#BEA">
<w n="155/1o">ah</w>
<w n="155/20">il</w>
<w n="155/21">fait</w>
<w n="155/22">chaud</w>
<w n="155/23">ici</w>
</u>

<u who="#MAR">
<w n="155/24">ah</w>
<w n="155/25">ouais</w>
<w n="155/26">ça</w>
<w n="155/27">fait</w>
<w n="155/28">du</w>
<w n="155/29">bien</w>
</u>

<u who="#ELI">
<w n="155/2a">bon</w>
<w n="155/2b">ben</w>
<w n="155/2c">voilà</w>
<w n="155/2d">vous</w>
<w n="155/2e">aviez</w>
<w n="155/2f">pas</w>
<w n="155/2g">vu
<shift feature="pitch" new="asc"/>
</w>
</u>

<u who="#BEA">
<anchor xml:id="CH_3" n="155/2i"/>
<w n="155/2j">n
<shift feature="loud" new="f">on</shift>
</w>
<anchor xml:id="CH_4" n="155/2l"/>
</u>

<u who="#MAR">
<anchor synch="#CH_3" n="155/2n"/>
<w n="155/2o">n
<shift feature="loud" new="f">on</shift>
</w>
<anchor synch="#CH_4" n="155/31"/>
<space style="overlap" rend="_______"/>
<pause type="short" rend="(.)" n="155/33"/>
</u>

<u who="#ELI">
<anchor synch="#CH_3" n="155/35"/>
<w n="155/36">voilà</w>
<space style="overlap" rend="____"/>
<anchor xml:id="CH_5" n="155/38"/>
</u>

<u who="#BEA">
<anchor xml:id="CH_6" n="155/3a"/>
<w n="155/3b">c'</w>
<w n="155/3c">est</w>
<choice n="155/3d">
<orig>chou
<anchor xml:id="CH_7" n="155/3d"/>ette
<shift feature="pitch" new="asc"/>
</orig>
<reg>chouette</reg>
</choice>
<pause type="short" rend="(.)" n="155/3f"/>
<w n="155/3g">c'</w>
<w n="155/3h">est</w>
<w n="155/3i">chouette</w>
<w n="155/3j">ouais
<shift feature="pitch" new="desc"/>
</w>
<space style="overlap" rend="_______"/>
<pause type="short" dur="PT0.2S" rend="(0.2)" n="155/3l"/>
</u>

<u who="#MAR">
<w n="155/3m">c'</w>
<w n="155/3n">est</w>
<w n="155/3o">cool</w>
</u>

<u who="#BEA">
<anchor xml:id="CH_8" n="155/41"/>
<shift new="breathing-in"/>
<w n="155/43">ouais</w>
<anchor xml:id="CH_9" n="155/45"/>
</u>

<u who="#ELI">
<anchor synch="#CH_8" n="155/47"/>
<shift new="breathing-in"/>
<shift feature="tempo" new="rall"/>
<space style="overlap" rend="____"/>
<anchor synch="#CH_9" n="155/4a"/>
<w n="155/4b">bon</w>
<choice n="155/4c">
<orig>j`</orig>
<reg>je</reg>
</choice>
<w n="155/4d">vous</w>
<w n="155/4e">laisse</w>
<w n="155/4f">vous</w>
<w n="155/4g">installer</w>
<w n="155/4h">les</w>
<anchor xml:id="CH_10" n="155/4j"/>
<w n="155/4k">filles
<shift feature="pitch" new="asc"/>
</w>
<anchor xml:id="CH_11" n="155/4m"/>
</u>



DTD and XML scheme (on going depending on the oral corpora community choices)


<!ELEMENT TEI ( teiHeader, text ) ><!ATTLIST TEI xmlns CDATA #REQUIRED >
<!ELEMENT activity ( #PCDATA ) >
<!ELEMENT age EMPTY ><!ATTLIST age from NMTOKEN #REQUIRED ><!ATTLIST age to NMTOKEN #REQUIRED >
<!ELEMENT anchor EMPTY ><!ATTLIST anchor n CDATA #IMPLIED ><!ATTLIST anchor synch CDATA #IMPLIED ><!ATTLIST anchor xml:id ID #IMPLIED >
<!ELEMENT availability ( licence ) ><!ATTLIST availability status NMTOKEN #REQUIRED >
<!ELEMENT birth EMPTY ><!ATTLIST birth from NMTOKEN #REQUIRED ><!ATTLIST birth to NMTOKEN #REQUIRED >
<!ELEMENT body ( anchor, note, u+ ) >
<!ELEMENT choice ( orig, reg+ ) ><!ATTLIST choice n CDATA #REQUIRED >
<!ELEMENT correction ( p ) >
<!ELEMENT creation ( date ) >
<!ELEMENT date ( #PCDATA ) >
<!ELEMENT desc ( #PCDATA ) ><!ATTLIST desc n CDATA #IMPLIED >
<!ELEMENT editorialDecl ( interpretation, correction, normalization ) >
<!ELEMENT education ( #PCDATA ) >
<!ELEMENT encodingDesc ( p+, editorialDecl ) >
<!ELEMENT fileDesc ( titleStmt, publicationStmt, notesStmt, sourceDesc ) >
<!ELEMENT interpretation ( p ) >
<!ELEMENT item ( #PCDATA ) >
<!ELEMENT keywords ( list ) ><!ATTLIST keywords scheme NMTOKEN #REQUIRED >
<!ELEMENT langKnowledge ( langKnown ) >
<!ELEMENT langKnown ( #PCDATA ) ><!ATTLIST langKnown level NMTOKEN #REQUIRED ><!ATTLIST langKnown tag NMTOKEN #REQUIRED >
<!ELEMENT langUsage ( language ) >
<!ELEMENT language ( #PCDATA ) ><!ATTLIST language ident NMTOKEN #REQUIRED ><!ATTLIST language usage NMTOKEN #REQUIRED >
<!ELEMENT licence ( p+ ) ><!ATTLIST licence target CDATA #REQUIRED >
<!ELEMENT list ( item ) >
<!ELEMENT listPerson ( person+, personGrp ) >
<!ELEMENT media ( desc+ ) ><!ATTLIST media dur NMTOKEN #REQUIRED ><!ATTLIST media mimeType CDATA #REQUIRED ><!ATTLIST media url CDATA #REQUIRED >
<!ELEMENT name ( #PCDATA ) >
<!ELEMENT normalization ( p ) ><!ATTLIST normalization source CDATA #REQUIRED >
<!ELEMENT note ( #PCDATA | desc )* >
<!ELEMENT notesStmt ( note ) >
<!ELEMENT occupation ( #PCDATA ) >
<!ELEMENT orig ( #PCDATA | anchor | shift | unclear )* >
<!ELEMENT p ( #PCDATA ) >
<!ELEMENT particDesc ( listPerson ) >
<!ELEMENT pause EMPTY ><!ATTLIST pause dur NMTOKEN #IMPLIED ><!ATTLIST pause n CDATA #REQUIRED ><!ATTLIST pause rend CDATA #REQUIRED ><!ATTLIST pause type ( long | short ) #REQUIRED >
<!ELEMENT person ( age | birth | education | langKnowledge | note | occupation )* ><!ATTLIST person sex NMTOKEN #REQUIRED ><!ATTLIST person xml:id NMTOKEN #REQUIRED >
<!ELEMENT personGrp EMPTY ><!ATTLIST personGrp sex NMTOKEN #REQUIRED ><!ATTLIST personGrp size NMTOKEN #REQUIRED >
<!ELEMENT principal ( #PCDATA ) >
<!ELEMENT profileDesc ( creation, settingDesc, langUsage, particDesc, textClass ) >
<!ELEMENT pubPlace ( #PCDATA ) >
<!ELEMENT publicationStmt ( publisher, pubPlace, availability ) >
<!ELEMENT publisher ( #PCDATA ) >
<!ELEMENT recording ( media ) >
<!ELEMENT recordingStmt ( recording ) >
<!ELEMENT reg ( #PCDATA ) >
<!ELEMENT resp ( #PCDATA ) >
<!ELEMENT respStmt ( resp, name+ ) >
<!ELEMENT seg ( anchor | choice | pause | seg | shift | unclear | w )* ><!ATTLIST seg rend CDATA #REQUIRED ><!ATTLIST seg type ( non-speech | voice ) #REQUIRED >
<!ELEMENT setting ( activity ) >
<!ELEMENT settingDesc ( setting ) >
<!ELEMENT shift ( #PCDATA ) ><!ATTLIST shift feature ( loud | pitch | tempo ) #IMPLIED ><!ATTLIST shift new ( asc | breathing-in | breathing-out | desc | f | rall ) #REQUIRED >
<!ELEMENT sourceDesc ( recordingStmt ) >
<!ELEMENT space ( #PCDATA ) ><!ATTLIST space style NMTOKEN #REQUIRED ><!ATTLIST space rend NMTOKEN #REQUIRED >
<!ELEMENT teiHeader ( fileDesc, encodingDesc, profileDesc ) ><!ATTLIST teiHeader xml:lang NMTOKEN #REQUIRED >
<!ELEMENT text ( timeline, body ) ><!ATTLIST text xml:lang NMTOKEN #REQUIRED >
<!ELEMENT textClass ( keywords ) >
<!ELEMENT timeline ( when+ ) ><!ATTLIST timeline origin CDATA #REQUIRED ><!ATTLIST timeline unit NMTOKEN #REQUIRED >
<!ELEMENT title ( #PCDATA ) >
<!ELEMENT titleStmt ( title, principal, respStmt+ ) >
<!ELEMENT u ( #PCDATA | space | anchor | choice | note | pause | seg | shift | unclear | vocal | w )* ><!ATTLIST u who CDATA #REQUIRED >
<!ELEMENT unclear ( #PCDATA | space | anchor | choice | pause | seg | shift | unclear | vocal | w )* ><!ATTLIST unclear extent CDATA #IMPLIED >
<!ELEMENT vocal ( desc ) >
<!ELEMENT w ( #PCDATA | anchor | shift | unclear )* ><!ATTLIST w n CDATA #REQUIRED >
<!ELEMENT when EMPTY ><!ATTLIST when absolute NMTOKEN #REQUIRED ><!ATTLIST when xml:id ID #REQUIRED >


Using TEI in Ciel project: Daniel Alcon, Carole Etienne

Ciel-f projecthas chosen to archive its corpora in both Clapi and Moca databanks, the corpora collections would be shared and exchanged regularly, transcripts are in the same praat format and metadat in TEI format
  • using element <teiCorpus> for the whole collection
  • using element<teiheader> for the whole Ciel project
  • using the element<teiCorpus> for a corpus based on an area
  • using element <TEI> for each interactionnal recording <teiheader>


Using TEI in Orfeo project

Orfeo project stands on an initial corpora made of several corpora of several ressources transcripted in different formats and would use TEI to manage them and deliver its own annotations

Copyright © Laboratoire ICAR. All rights reserved.
Contact    |
    About us         |
Legal