The task has concluded at the data has been released. Please see MediaEval Datasets.
The 2013 Similar Segments in Social Speech Task
Further information on this task available at:
The task involves searching in social multimedia, specifically conversations between students in one academic department. This task is the first on retrieval over spoken dialog using pure similarity judgments, rather than approximations using topic-, term- or dialog-act matches
The scenario is this: A new member has joined an organization or social group that has a small archive of conversations among its members. He starts to listen, looking for any information that can help him better understand, participate in, enjoy, find friends in, and succeed in this group. As he listens to the archive (perhaps at random, perhaps based on some social tags, perhaps based on an initial keyword search) he finds something of interest, and wants to find more like it, across the entire archive. He marks what he found as a region of interest and requests more like it. The system comes back with a set of ``jump-in'' points, places in the archive to which he could jump and start listening/watching with the expectation of finding something similar.
In the task, the input to the systems will be a 1-10 second audio/video region of interest, and the desired output an ordered list of regions similar to it, matching as closely as possible the judgments of human searchers.
This task of interest to researchers in the areas of speech technology, dialog, and topic modeling.
Participants will receive a 2-hour collection of dyadic English-language conversations, each 5-10 minutes in length, by members of semi-cohesive group. These will include video, two-microphone stereo audio, speech recognition transcripts and a small set of prosodic features computed every 10 milliseconds. Metadata will include the native languages of the speakers. There will be several dozen similarity sets, each containing 5-40 regions, each about 3-20 seconds long, which were judged by one of the user population to all be similar in some way.
The test set will be a smaller set of conversations and a set of regions of interest, or seeds. For each seed, a system will return a list of jump-in points for its inferred similar-region set.
Ground truth and evaluation
The ground truth is created manually and provided by the task organizers.
The Nature of Search Needs, section 3 of Thirty-Two Sample Audio Search Tasks. UTEP Technical Report UTEP-CS-12-39, Nigel G. Ward and Steven D. Werner, 2012; and the references therein.
Spoken content retrieval: A survey of techniques and technologies. Martha Larson and Gareth J. F. Jones. Foundations and Trends in Information Retrieval, vol. 5, no. 4-5, pp. 235-422, 2012.
- Nigel Ward, University of Texas at El Paso, USA;
- David G. Novick, University of Texas at El Paso, USA;
- Tatsuya Kawahara, Kyoto University, Japan;
- Elizabeth Shriberg, Microsoft, USA.
- Louis-Philippe Morency, University of Southern California, USA;
- Catharine Oertel, KTH, Sweden.
- 1 April: Familiarization pack release
- 25 May: Development data release (updated release date)
- 15 July: Test set release (updated release date)
- 5 September: Run submission deadline