The 2017 Emotional Impact of Movies Task

Task Results
Task overview: [Slides] [Presentation video]
Participant results: [Playlist of all presentation videos]

Task Description
Affective video content analysis aims at the automatic recognition of emotions elicited by videos. It has a large number of applications, including emotion-based personalized content delivery, video indexing, summarization and protection of children from potentially harmful video content. While major progress has been achieved in computer vision for visual object detection, scene understanding and high-level concept recognition, a natural further step is the modeling and recognition of affective concepts. This has recently received increasing interest from research communities, e.g., computer vision, machine learning, with an overall goal of endowing computers with human-like perception capabilities. Thus, this task is proposed to offer researchers a place to compare their approaches for the prediction of the emotional impact of movies. It is a sequel of last year’s task.

The prediction of the emotional impact of movies is here considered as the prediction of the expected emotion. The expected emotion is the emotion that the majority of the audience feels in response to the same content. In other words, the expected emotion is the expected value of experienced (i.e. induced) emotion in a population. While the induced emotion is subjective and context dependent, the expected emotion can be considered objective, as it reflects the more-or-less unanimous response of a general audience to a given stimulus [4].

This year, two new scenarios are proposed as subtasks. In both cases, long movies are considered and the emotional impact has to be predicted for consecutive ten seconds-segments sliding over the whole movie with a shift of 5 seconds:

1. Valence/Arousal prediction: participants’ systems are expected to predict a score of induced valence (negative-positive) and induced arousal (calm-excited) for each consecutive ten seconds-segments;

2. Fear prediction: the purpose here is to predict for each consecutive ten second- segments whether they are likely to induce fear or not. The targeted use case is the prediction of frightening scenes to help systems protecting children from potentially harmful video content.

Among the possible directions for developing such prediction systems, one can envisage to predict the emotional impact of each ten-second segment independently from the others.

Another possibility is to model the temporal information. Indeed, as the emotion felt while watching a movie scene may depend not only on the current scene but also on previous scenes and previous felt emotions, a temporal modeling may be useful. Thus, temporal information can be included in the machine learning models, for example using Long Short-Term Memory neural networks [7,8,9,10], or by simply applying a temporal smoothing to the predicted values.

Target group
This task targets (but is not limited to) researchers in the areas of multimedia information retrieval, machine learning, event-based processing and analysis, affective computing and multimedia content analysis.

Data
The dataset used in this task is the LIRIS-ACCEDE dataset (liris-accede.ec-lyon.fr). It contains videos from a set of 160 professionally made and amateur movies, shared under Creative Commons licenses that allow redistribution.

30 movies are provided with the continuous annotation according to fear, valence and arousal for consecutive ten seconds-segments with a shift of 5 seconds.

Additional data and annotations for both subtasks will be provided later as test set.

In addition to the data, participants will also be provided with general purpose audio and visual content descriptors.

In solving the task, participants are expected to exploit the provided resources. Use of external resources (e.g., Internet data) will be however allowed as specific runs.

Ground truth and evaluation
Standard evaluation metrics will be used to assess the systems’ performance. We will consider Mean Square Error and Pearson correlation coefficient for the first subtask, and Mean Average Precision for the second subtask.

Recommended reading
[1] Sjöberg, M., Baveye, Y., Wang, H., Quang, V. L., Ionescu, B., Dellandréa, E., Schedl, M., Demarty, C.-H., Chen, L. The Mediaeval 2015 Affective Impact of Movies Task. In MediaEval 2015 Workshop, 2015.

[2] Baveye, Y., Dellandrea, E., Chamaret, C., Chen, L. LIRIS-ACCEDE: A Video Database for Affective Content Analysis. In IEEE Transactions on Affective Computing, 2015.

[3] Baveye, Y., Dellandrea, E., Chamaret, C., Chen, L. Deep Learning vs. Kernel Methods: Performance for Emotion Prediction in Videos. In 2015 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2015.

[4] A. Hanjalic, “Extracting moods from pictures and sounds: Towards truly personalized TV,” IEEE Signal Processing Magazine, vol. 23, no. 2, pp. 90–100, March 2006.

[5] Eggink, J., A large scale experiment for mood-based classification of TV programmes. In IEEE ICME 2012.

[6] Benini, S., Canini, L., Leonardi, R., A connotative space for supporting movie affective recommendation. In IEEE Transactions on Multimedia, 13.6, 2011, 1356-1370.

[7] M. A. Nicolaou, H. Gunes, and M. Pantic, “A multi-layer hybrid framework for dimensional emotion classification,” in Proceedings of the 19th ACM international conference on Multimedia, pp. 933–936, ACM, 2011.

[8] F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P. Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller, “Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data,” Pattern Recognition Letters, 2014.

[9] M. Soleymani, S. Asghari-Esfeden, M. Pantic, and Y. Fu, “Continuous emotion detection using eeg signals and facial expressions,” in 2014 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, 2014.

[10] F. Weninger, F. Eyben, and B. Schuller, “On-line continuous-time music mood regression with deep recurrent neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5412–5416, 2014.

Task organizers
Emmanuel Dellandréa, Ecole Centrale de Lyon, France (contact person) emmanuel.dellandrea at ec-lyon.fr
Martijn Huigsloot, NICAM, Nederlands
Liming Chen, Ecole Centrale de Lyon, France
Yoann Baveye, Université de Nantes, France
Mats Sjöberg, University of Helsinki, Finland

Task schedule
1 May: Development data release
1 June: Test data release
17 August: Run submission
21 August: Results returned to participants
28 August: Working notes paper deadline
13-15 Sept: MediaEval Workshop in Dublin

Acknowledgments
Visen
VideoSense