Implementing two dialogue act taggers using Conditional Random Field.
- Speaker change indicator.
- First utterance indicator.
- A feature for each tokens in the utterance.
- A feature for each part-of-speech in the utterance.
- Length of utterance.
- Bigrams of tokens.
- various features extracted from tokens string: e.g. IS_UPPER to indicate whether the token is in upper case.
- A feature for each token and a feature for each POS, like in the baseline.
- Non-words sounds in the utterance, e.g. <Laughter>
- Non-words sounds in the previous utterance, e.g. <Laughter>
- Speaker change indicator: as in the baseline.
- A feature for each part-of-speech in the previous utterance.
- A Bias feature.
The Switchboard (SWBD), which is a collection of phone dialogues of volunteers over a predetermine topics, is used. The tags in the corpus are the SWBD-DAMSL dialogue acts. See this for the annotation manual.
A python interface of the CRFsuite is used, see this for installation and documentation.
hw2_corpus_tool.py
is written by Christopher Wienberg, a previous TA for CSCI544 at USC.