iTalk2Learn partner, the University of Hildesheim, research perceived task-difficulty recognition by means of features extracted from students’ speech input
An important aspect of the iTalk2Learn platform is not only the integration of speech (i.e. using a speech production engine), but also its ability to recognise and record students interacting with the system.
Whilst the speech production engine generates a synthetic speech output from text, so that the system is able to ‘speak’ to the student, the speech recognition recognises their answer and internally delivers the response in text.
The idea is now to react to the speech of students. This reaction could come in the form of feedback to the student via hints, prompts or encouragement (see e.g. [1] or [7]). Alternatively, information extracted from the speech input could be used for an adaptive task sequencing (see [8], [9]) within the iTalk2Learn system.
The task sequencing determines the order in which tasks are shown to the student. One part of the iTalk2Learn system is a personalised task sequencer that aims to avoid frustration and boredom; all with the objective of retaining them within the Zone of Proximal Development [10].
In order to support this task sequencer with information, such as gaging the perception of difficulty, we aim at applying a perceived task-difficulty recognition on features extracted from students’ speech input (see [2], [3]).
Speech Features
The features above belong to two different types of speech features: (a) amplitude features ([2], [3]) and (b) articulation features [4].
a) Amplitude features: Extracted from the amplitudes of speech input, the amplitude features deliver information about speech and pauses. An example of this is the maximal and average length of speech phases and pauses: when over-challenged and thinking about the problem at hand, children often exhibit longer pauses of silence. Conversely, they usually take less pauses, and of shorter length, when in full flow.
b) Articulation features: On the other hand, articulation features stem from an intermediate step of the speech recognition process – built from information about silence, vowels and consonants. Examples of articulation features are the maximal, average and minimal length of vowels and consonants. The idea behind these kind of features is that, depending on the affective state, the person speaking lengthens or shortens these variables.
Statistical analyses in [2] and [3] showed that the proposed features are suitable for perceived task-difficulty recognition in adaptive intelligent tutoring systems.
Perceived Task-Difficulty Recognition
These different speech features serve as an input for a state-of-the-art classification model, the support vector machine (SVM), which recognises the perceived task-difficulty. In order to conduct experiments with an SVM, applied to the described features, we conducted a study; collecting real speech data from students and labelling it with the according perceived task-difficulty.
The results of these experiments are reported in [4] indeed proving the possibility to recognise perceived task-difficulty from the features fed into an SVM. However, the results were not yet satisfactory for employing the affect recognition into a real system, and so we are working to improve the affect recognition performance.
The idea is to adapt the hybrid neural network palit approach for improving phoneme recognition, which we proposed and investigated in [5] and [6]. The first promising steps in this direction are shown in [4], where we displayed the ability to improve the affect recognition performance significantly for binary classification problems. The next steps for improvement will aim to also show this for multi-class classification.
References:
[1] Grawemeyer, B., Mavrikis, M., Gutierrez-Santos, S. and Hansen, A. 2014. Interventions during student multimodal learning activities: which, and why? In Extended Proceedings of the 7th International Conference on Educational Data Mining (EDM 2014).
[2] Janning, R., Schatten, C. and Schmidt-Thieme, L. 2014. Multimodal Affect Recognition for Adaptive Intelligent Tutoring Systems. In Extended Proceedings of the 7th International Conference on Educational Data Mining (EDM 2014).
[3] Janning, R., Schatten, C. and Schmidt-Thieme, L. 2014. Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptive Intelligent Tutoring Systems . In Proceedings of the European Conference on Technology Enhanced Learning (EC-TEL 2014) .
[4] Janning, R., Schatten, C., Schmidt-Thieme, L., Backfried, G. and Pfannerer, N. 2014. An SVM Plait for Improving Affect Recognition in Intelligent Tutoring Systems. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2014).
[5] Janning, R., Schatten, C. and Schmidt-Thieme, L. 2014. Automatic Subclasses Estimation for a Better Classification with HNNP . In Proceedings of the 21st International Symposium on Methodologies for Intelligent Systems (ISMIS 2014).
[6] Janning, R., Schatten, C. and Schmidt-Thieme, L. 2014. Local Feature Extractors Accelerating HNNP for Phoneme Recognition. In Proceedings of the 37th German Conference on Artificial Intelligence (KI 2014).
[7] Mavrikis, M. , Grawemeyer, B., Hansen, A. and Gutierrez-Santos, S. 2014. Exploring the Potential of Speech Recognition to Support Problem Solving and Reflection – Wizards Go to School in the Elementary Maths Classroom. In Proceedings of the European Conference on Technology Enhanced Learning (EC-TEL 2014).
[8] Schatten, C. and Schmidt-Thieme, L. 2014. Adaptive Content Sequencing without Domain Information. Proceedings of the Conference on computer supported education (CSEDU 2014).
[9] Schatten, C., Janning, R. and Schmidt-Thieme, L. 2014. Integration and Evaluation of a Matrix Factorization Sequencer in Large Commercial ITS. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI15).
[10] Vygotsky, L. 1978. Mind in society: The development of higher psychological processes. Harvard University Press.