The 2015 Query by Example Search on Speech Task (QUESST)
The Query by Example Search on Speech Task (QUESST) involves searching FOR audio content WITHIN audio content USING an audio content query. This task is particularly interesting for speech researchers in the area of spoken term detection or zero/low-resource speech processing.

The task consists in determining how likely it is that a query appears within an audio file. The formulation is familiar from last year’s task. Given an audio file and a spoken query, systems will have to produce a score. The higher the score the more likely is that the query appears in the audio file. Note that the task involves verifying the presence of the query anywhere in the file, and not finding the exact timepoint of query occurrences.

The task data is a set of audio files from multiple languages (some resource-limited, some recorded in challenging acoustic conditions, and some containing heavily accented speech), which will be provided to researchers. In addition, two sets of spoken queries (for development and test) will be provided.

No transcriptions, language tags or any other meta-data will be provided for the development and test corpora (except for a timing of query location inside an utterance). The task therefore requires researchers to build a language-independent audio-within-audio search system.

Three main novelties that set this year’s QUESST task apart from past years are:
1. A larger set of languages.
2. More challenging acoustic and noise conditions.
3. Three evaluation conditions oriented towards different use cases:
  • Read queries, exact matches.
  • Read queries, exact and approximate matches, word level re-orderings allowed.
  • Spontaneous speech queries, exact and approximate matches, word level re-orderings allowed.
Target group
The target group of participants for this task includes researchers in the area of multilingual speech technology (also for under-resourced languages), spoken term detection and spoken content search.

Data
The spirit of the task, along with the fundamental characteristics of the data, remain similar to QUESST 2014. We are planning to cover 8 languages (most of them European). As in QUESST 2014, we will provide a single reference set where both sets of queries will be searched on. Given that the search set contains files from different languages, accents and acoustic environments, systems will need to be built as generic as possible to succeed in finding queries appearing in these multiple sources. The length of this year's reference corpus will not exceed that the one of last year (21 hours and 600 dev/eval queries).

Unlike previous years, we would like to push participants to deal with more challenging acoustic environments (noisy or reverberated). This brings larger channel mismatch between queries (naturally spoken into a recording device) and utterances (taken from various corpora).

Three different types of query searches are proposed for this year's evaluation. The term “query search” is related to the use case, reflecting both how the query was created (read versus spontaneous) and the type of matches that systems are expected to be able to search for (exact, approximate, with word level re-orderings, etc.):

Type 1 search – Exact match. Occurrences of single/multiple word queries in utterances should exactly match the lexical representation of the query. An example of this case is the query "white horse" that should match the utterance "My white horse is beautiful" but should not match to “The whiter horse is faster”.

Type 2 search – Re-ordering and small lexical variations. Here the search algorithm should cope with:
  • Lexical variations. Occurrences of single/multiple word queries might differ slightly (either at the beginning or at the end of the query) from the lexical form of the query. Systems will therefore need to account for small portions of audio either at the beginning or the end of the segment that do not match the lexical form of the reference. In all cases, the matching part of any query will exceed 5 phonemes / 250 ms, and the non-matching part will be much smaller than the matching part. An example of this type of search would be "researcher" matching an utterance containing "research" (note that the inverse would also be possible).
  • Word level re-orderings and small filler content. Occurrences of multiple word queries may involve two or more words appearing in a different order than that of the spoken query. For example, when searching for the query "white horse", systems should be able to match "horse white". The spoken queries will not contain silent portions between words, but the audio files may contain a small amount of filler content between the matching words. For example, when searching for the query "white horse", systems should be able to match both "My horse is white" and "I have two white and beautiful horses". As shown in the latter example, the matching words may also contain some slight variations with regard to the lexical form of the query. Under no circumstance these queries will have a long amount of filler content between words.
Type 3 search – Conversational queries in context. This type of search is another step towards realistic use-case scenarios. The spoken query not only contains relevant terms, but also useless (filler) items. Also the speaking style mismatch is larger here. The search procedure could be similar to that developed for Type 2 queries, but now the spoken query is just part of a sentence, that may contain filler items, such as silent pauses, filled pauses and irrelevant words. For example, "OK Google, let me find some red [uh] white [pau] horse to ride this weekend" could be one of these complex queries. As it is extremely difficult to distinguish between query words ("white [pau] horse") and “fillers” ("OK Google, let me find some red [uh]" and "to ride this weekend"), we will provide timing meta-data of the relevant words inside the spoken query. Participants are free to use the rest of the spoken query (e.g., for adaptation).

The query set will contain queries of all kinds, forcing participants that want to obtain high scores in the evaluation to account for all search types in their systems. Queries in the development set will be labeled according to search type in order to facilitate system development. All queries will be manually recorded in order to avoid problems with the acoustic context when cutting the queries from a longer sentence. When recording the queries, a normal speaking speed and clear speaking style will be used. For Type 3 search queries, we will provide recordings of whole sentences including conversational speech. To that end, speakers will be told the relevant terms and the use case, but not the exact sentence. Also, some of the queries may be contaminated with noise or the channel modified to simulate more challenging conditions.

Like last year, there will be a single track submission, both for zero-resource and low-resource systems. However, participants should mark their systems by category: zero-resource or low-resource. Zero-resource systems are the ones that do not use any external speech resources or engines to develop their systems, such as speech labels, phone labels, phonetic dictionaries, pre-trained classifiers, etc. Low-resource systems are the ones that use external data or sub-systems. Participants will be required to describe in their description paper what kind of system they have developed (zero-resource or low-resource) and what data (if any) they have used to develop and train their systems. Our main interest for asking participants for this information is to be able to properly compare both types of systems and see what effect external data might have in this year's task.

Ground truth and evaluation
Participants are provided with ground truth for dev data and scoring tools together with basic calibration routine to achieve reasonably calibrated output. The primary metric used this year will be the cross entropy score (Cnxe). This metric has been used for several years in the speaker identification community and has interesting properties. Experimentally the results correlate well with the ATWV metric. In order to better compute the entropy-based metric, we will require that all participants return a result decision for each query-utterance pair, attaching to it a log-likelihood ratio score bounded to the range (-inf, inf). Alternatively, a default score for all those trials not being returned can instead be provided. The Actual Term Weighted Value (ATWV) metric will be used as a secondary metric.

Recommended reading
[1] Anguera, X., Metze, F., Buzo, A., Szőke, I., Rodriguez-Fuentes, L. J. The Spoken Web Search Task. In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, CEUR-WS.org, 1043, Barcelona, Spain, 2013.

[2] Anguera, X., Rodriguez-Fuentes, L. J., Szőke I., Buzo, A., Metze, F. Query-by-example Spoken Term Detection Evaluation on Low-resource Languages. In: Proceedings of the 4th International Workshop on Spoken Language Technologies for Under- resourced Languages SLTU-2014. – St. Petersburg, Russia. St. Petersburg: International Speech Communication Association, 2014, pp. 24-31.

[3] Fiscus, J., Ajot, J., Garofolo, J., Doddingtion, G. Results of the 2006 Spoken Term Detection Evaluation. In Proceedings of ACM SIGIR 2007 Workshop on Searching Spontaneous Conversational Speech. Amsterdam, Netherlands, 2007.

[4] Larson et al. (eds.) Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, CEUR-WS.org, 1043, Barcelona, Spain, 2013.

[5] Larson et al. (eds.) Working Notes Proceedings of the MediaEval 2012 Workshop, CEUR-WS.org, 927, ISSN: 1613-0073. Pisa, Italy, 2012.

[6] Metze, F., Anguera, X., Barnard, E., Davel, M., Gravier, G. Language Independent Search In MediaEval’s Spoken Web Search Task. In SLTC IEEE Speech and Language Processing Technical Committee's Newsletter, 2013.

Task organizers
Igor Szőke, Brno University of Technology, Czech Republic
Xavier Anguera, Telefonica Research, Spain
Luis Javier Rodriguez-Fuentes, University of the Basque Country, Spain
Andi Buzo, University Politehnica of Bucharest, Romania
Florian Metze, Carnegie Mellon University, USA

Task schedule
1 April: Release of the audio dataset, the development query set and development set ground truth
1 May: Test query set release
22 July: Deadline for submission of test query set results (updated date)
29 July: System results are returned to participants (updated date)
28 August: Working notes paper deadline
14-15 September MediaEval 2015 Workshop, Satellite Event of Interspeech 2015