This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on https://mental.jmir.org/, as well as this copyright and license information must be included.
Emotion dysregulation is a key dimension of adult psychological functioning. There is an interest in developing a computer-based, multimodal, and automatic measure.
We wanted to train a deep multimodal fusion model to estimate emotion dysregulation in adults based on their responses to the Multimodal Developmental Profile, a computer-based psychometric test, using only a small training sample and without transfer learning.
Two hundred and forty-eight participants from 3 different countries took the Multimodal Developmental Profile test, which exposed them to 14 picture and music stimuli and asked them to express their feelings about them, while the software extracted the following features from the video and audio signals: facial expressions, linguistic and paralinguistic characteristics of speech, head movements, gaze direction, and heart rate variability derivatives. Participants also responded to the brief version of the Difficulties in Emotional Regulation Scale. We separated and averaged the feature signals that corresponded to the responses to each stimulus, building a structured data set. We transformed each person’s per-stimulus structured data into a
We found an average Pearson correlation (
In psychometry, our results represent excellent evidence of convergence validity, suggesting that the Multimodal Developmental Profile could be used in conjunction with this methodology to provide a valid measure of emotion dysregulation in adults. Future studies should replicate our findings using a hold-out test sample. Our methodology could be implemented more generally to train deep neural networks where only small training samples are available.
Emotion regulation is currently conceptualized as involving the following 5 distinct abilities: (1) having awareness and an understanding of one’s emotions, (2) being able to accept them, (3) being able to control impulsive behaviors related to them, (4) having the capacity to behave according to our desired goals in the midst of negative emotions, and (5) having the capacity to implement emotion regulation strategies as required to meet individual goals and situational demands. The absence of these abilities indicates the presence of
Emotion dysregulation is typically assessed through a self-report questionnaire, the Difficulties in Emotional Regulation Scale (DERS) [
Attempts to measure psychological dimensions “in the wild” (ie, a naturalistic approach) using machine learning and unimodal sensing approaches, such as measuring heart rate throughout the day with a smartwatch or measuring the patterns of social media interactions by a user, have not yet produced good enough results leading to major changes in the way the mental health industry practices psychometrics. It still relies almost entirely on self-assessment questionnaires or professional interviews [
To overcome these limitations, in 2017, we introduced the Biometric Attachment Test (BAT) in the Journal of Medical Internet Research [
Developing deep multimodal fusion models to combine the MDP obtained features in order to predict actual psychological dimensions, such as emotion dysregulation, is a challenge due in part to the small nature of samples in psychology research [
In this work, we propose a series of methods that we hypothesize will allow us to train a scoring model for the MDP to estimate emotion dysregulation in adults. We hypothesize that such an estimation of emotion dysregulation will have psychometric convergence with the “gold standard” measure, the DERS. Our approach of choice is particularly important for the machine learning field. We hypothesize that our methodology will unleash training deep neural networks for multimodal fusion with a very small training sample.
The organization of the rest of this paper is as follows. First, we will introduce the multimodal codex, which is the heart of our approach, and the techniques required to build it and fill its missing values. Second, we will present our convolutional neural network (CNN)-transformer network architecture, including our new layer, the Feature Map Average Pooling (FMAP) layer. Third, we will discuss our training methodology. Fourth, we will present our results, including the quality of our estimation of emotion dysregulation in adults. Lastly, we will discuss these results.
This subsample consisted of 69 participants (39 females and 30 males) and was recruited online using Amazon Mechanical Turk and Prolific services between January and July 2019. The mean age for this subsample was 35.05 years (SD 12.5 years, minimum 18 years, maximum 68 years). We did not intentionally recruit any clinical participants for this subsample, but we cannot guarantee the absence of clinical patients within it.
This subsample consisted of 146 participants (88 females and 58 males) recruited between the months of January and July 2019, and was formed from multiple sources in different regions of France. Of the 146 participants, 10 clinical patients were recruited at University Hospital Center Sainte-Etienne and 22 at the Ville-Evrard Center of Psychotherapy and Psychotrauma in Saint-Denis, 33 volunteers were enrolled in Paris and 19 in Lyon, 3 college students were enrolled at Paris Descartes University and 11 at University Bourgogne Franche-Comté (Dijon), and 43 clinical private practice patients were enrolled in Paris and 5 in Lyon. The mean age for this subsample was 39.25 years (SD 13.6 years, minimum 18 years, maximum 72 years). Clinical patients were included to examine whether the MDP was capable of rightly assessing more extreme emotion dysregulation cases.
This subsample consisted of 33 Tunisian participants (21 females and 12 males) recruited in July 2019 in the city of Tunis. The mean age was 37.6 years (SD 10.5 years, minimum 17 years, maximum 55 years). While there was no intention to recruit clinical participants for this subsample, we cannot guarantee the absence of clinical patients within it.
The original DERS [
Explored in depth in an article in the Journal of Medical Internet Research [
Importantly, the first stimulus is fully neutral and allows us to acquire a baseline for all our measurements, which is later subtracted from them. In theory, this allows us to work with signals that react solely to the stimuli. Whether the participants came already upset to the test situation or whether they were already fatigued, the test will measure this during the first stimulus and then subtract it from the following signals; thus, it will only take into account whether a stimulus made them more upset or more fatigued, or perhaps whether a stimulus managed to soothe or relax them. The short duration of the test assures us that any abrupt changes in the signals from which the baseline was subtracted will indeed be caused by the test situation itself and not due to time simply passing by. Furthermore, the order of the stimuli themselves is such that stress and soothing themes are alternated, allowing us to get more contrast in our measurements of what each stimulus is doing to the person.
A simple way of conceptualizing the MDP is as a series of
As the participant perceives the stimuli and responds aloud to them, the software automatically collects video and audio data and automatically extracts features from them. Specifically, the MDP uses an RPPG method to extract HRV features that allow measuring the sympathetic and parasympathetic branches of the autonomic nervous system; detects facial action units, head movements, and gaze direction with respect to the stimuli being presented; and analyzes speech, extracting paralinguistic features as well as conducting a linguistic analysis [
An important aspect of the MDP is that it does not rely on a naturalistic approach. Rather, it is based on a tightly controlled experiment carefully conceived and validated in order to evoke specific reactions.
In addition, the MDP has
Finally, contrary to most projects, wherein a machine learning system is trained to predict a category with relation to mental health, such as depressed vs not depressed, the MDP is
To prevent any form of data leaking, every step described below was conducted
A few participants took the test twice at intervals of a few weeks to help with a future study on test-retest reliability, and we included both of their sessions in this study, treating them as if they were different participants. To prevent data leakage, however, when one of them was randomly put into the validation set, their other session got automatically put there as well. This explains why the validation set size changes from fold to fold (with a range of 29 to 35).
All data preparation was performed in MATLAB 2021b (MathWorks). The MDP outputs a set of CSV files containing the structured data for each sense modality (facial expressions, linguistic analysis, etc). In most cases, this comes in the form of a table containing the timestamps as rows and the features as columns.
We averaged each feature per stimulus (ie, an average of values for facial action unit 10 from the moment stimulus 3 was shown till the moment it disappeared). We discounted the first stimulus’s results, the neutral one (see previous section), from all others so that we dealt solely with the variance produced by the test itself. Features were scaled to the −1 to 1 range, using either previous knowledge about the actual signal’s minimum and maximum values, or the empirical minimum and maximum levels found within the signal in all our training samples for a given fold.
DERS-16 scores were also linearly scaled, to the 0-1 range, to allow for quicker training times and easier interpretation of results. An important step in our data preparation procedure was to uniformize our training sample with regards to the ground truth (ie, DERS-16 scores) so that all levels of the ground truth could be equally represented in terms of the number of samples being fed to our learning algorithm. Our code did this by binning the DERS-16 score, and up-sampling our data set until all bins (ie, all score levels) had the same number of cases representing them. This, of course, presented the problem of potentially overfitting these repeated cases. In the section about test time data augmentation, we present how we dealt with this problem.
From a clinician’s perspective, a typical assessment interview can be thought of as having 2 main components as follows: what is happening
Based on years of clinical experience, we argue that the psychologist or psychiatrist ends the interview with a newly acquired succession of
The multimodal codex is our attempt to imitate this clinical phenomenon in a machine learning multimodal fusion context.
The multimodal codex is a grayscale computer image that encodes within it a set of meaningful multimodal features representing human responses to a controlled experiment. A multimodal codex
The multimodal codex is also a practical way to encode structured tabular data in a format that can more readily be taken advantage of by CNNs. CNNs are of practical interest because (1) they ditch the need for feature engineering as they create their own features and (2) they can be trained with relatively few learnable parameters, helping to prevent overfitting.
Converting tabular data sets to images in order to use CNNs on them has been exploited by several researchers recently. Alvi et al showed that tabular data on neonatal infections could be successfully exploited using a CNN by implementing a simple transformation where features (ie, columns) are assigned, one by one, to an X-Y coordinate, with their values becoming the pixel’s intensity [
Buturović et al designed a tabular-data-to-graphical mapping in which each feature vector is treated as a kernel, which is then applied to an arbitrary base image [
The approach clearly closer to ours is that of DeepInsight [
The approach we used for creating the multimodal codexes is similar, yet it differs from DeepInsight’s approach in that we implement a more modern and reliable dimensionality reduction method, the Uniform Manifold Approximation and Projection (UMAP) [
Our proposed method to missing data imputation can be described by the following pseudocode:
For each fold, we learn the missing data imputation models from the learning set and fill with it the missing values of both training and validation sets.
Our proposed process to create a multimodal codex sequence is resumed in the following pseudocode:
The resultant images look like those in
From test to result. Top left: a woman taking the Multimodal Developmental Profile test. Top center: the audio wave and video frames, with the latter showing the analysis for head pose, eye gaze, and facial expressions. Top right: tabular data of some of the features extracted from the audio and video. Bottom: the 2nd, 3rd, 4th, and 14th multimodal codexes for a participant in the sample. CNN: convolutional neural network; w/: with.
This process naturally builds images with distinct clusters of features for each stimulus depending on the specific relationship between the typical responses to the said stimulus in the sample and the ground truth variable. Like a clinician’s intuition described earlier, our approach could end clustering together a series of language markers, facial expressions, and HRV features, which might not initially be obvious, in the context of what is evoked by a specific stimulus and the typical response pattern in the sample.
Practically, this takes the guessing out of feature engineering, while also providing the CNNs with smaller clusters to “look at,” which in turn puts less stringent requirements on the
An important limitation of UMAP and all other visualization techniques of the sort is that the proximity of points in the projection they generate does not follow a predictable pattern. While points that are closer together typically are more related than those projected far away, this is not guaranteed for all cases, and the relationship between distance and importance is certainly not linear.
On occasion, the mapping for two or more features falls in the exact same X and Y coordinates. While this could be easily remediated by enlarging the codex resolution, we decided to leave this as a feature. When UMAP considers 2 features to be so close, they might as well mean the exact same thing. In that case, we average the value of the features to find the value of the pixel in question.
For each fold, we learned the mapping from the learning set and created with it the multimodal codexes for the learning and validation sets.
As described in the previous section, the problem of assessing a psychological construct during an interview is both a spatial problem (ie, measuring different things that happen simultaneously) and a temporal problem (understanding the succession of events and their relationship).
For dealing with the first part of the problem, we implemented 13 CNNs, with 1 per stimulus (minus the baseline stimulus). The reason not to rely on just 1 network for all of the stimuli is that we do not assume the features that are important to predict emotion dysregulation are the same during each stimulus response. On the contrary, a clinician will look for specific patterns in the patient’s behaviors depending on the queue the therapist has sent right before during the interview. Patterns can actually reverse. A cluster of features indicative of emotion dysregulation given 1 stimulus can actually be indicative of good regulation during another.
We confronted the following challenges when designing the architecture for our CNNs: (1) How to create a deep enough network that will be able to extract complex concepts, while keeping the number of learnables (ie, weights) very lean to avoid overfitting (ie, memorizing) our small training set? (2) How to avoid downsampling/blurriness of the codex when going deeper into the network, a classic byproduct of pooling layers, so that deeper layers can still take advantage of details while simultaneously uncovering more global patterns? To overcome these challenges, we implemented cutting-edge best practices as well as some innovations.
The network begins with a multimodal codex augmentation layer that we will explore later. The rest of the network is basically constituted of 8 convolutional blocks, each containing a depth-wise separable convolution layer [
Importantly, our proposed architecture dispenses with pooling layers entirely. They are typically used as a means to increase the effective receptive field when moving deeper into the network. They were replaced with a carefully calculated set of kernel dilation factors, which increase from the 1st block to the 5th, then decrease for blocks 6 and 7, and then increase once again in block 8 before the network ends. This decrease and increase between blocks 6 and 8 is what Hamaguchi et al have called a local feature extraction (LFE) module [
Our convolutional architecture (339 weights). LFE: local feature extraction; SGELU: Symmetrical Gaussian Error Linear Units.
In the following paragraphs, we provide a brief description of each of the components of the network as well as the rationale behind their implementation in the context of deep learning from small data sets.
Depth-wise separable convolutional layers were first introduced in a previous study by Chollet et al [
SGELU activation was recently introduced in a previous study by Yu et al [
Mean shifting [
Group normalization was introduced by the Facebook AI Research (FAIR) team in 2019 [
The networks end with a GAP [
The full CNN model is shown in
After each of the 13 CNNs produce an estimation of emotion dysregulation, those estimations become the sequential data fed to the next and final architecture, to deal with the temporal aspect of our problem, which is the transformer.
Endowed with the task of decoding the sequential meaning of the participant’s responses to the succession of MDP’s controlled experiments, our transformer network is of course inspired by the seminal work of Vaswani and the team at Google Brain [
At their core is the multiheaded attention mechanism, which allows evaluating, in parallel and for each data point in a sequence, which other data points in the said sequence are relevant to the assessment. The attention heads in our encoder block are of size 13, to cover the whole MDP sequence, as opposed to the size of 64 used in the study by Vaswani et al, and we used 4 heads as opposed to 8. Our encoder block also includes residual connections, layer normalization, and dropout. The projection layers are implemented using a 1D convolution layer.
The encoder was followed by a 1D GAP layer to reduce the output tensor of the encoder to a vector of features for each data point in the current batch. Right after this is the multilayer perceptron regression head, consisting of a stack of fully connected layers with ReLU activation, followed by a final 1 neuron–sized fully connected layer with linear activation that produces the actual estimation of emotion dysregulation. We tried implementing positional encodings, as per the original paper, as well as look-ahead masking; however, both methods yielded worse results for our use case, so we discarded them.
In the original paper, Vaswani et al implemented label smoothing. Given that ours is a regression problem, we switched this for test-time augmentation (TTA), which will be described later.
The loss function for our transformer architecture was the concordance correlation coefficient (CCC) [
Our transformer architecture (4223 weights).
This new kind of layer computes the average of the activations or feature maps produced by a 2D convolution layer as follows:
where
It was inspired by the GAP layer, which revolutionized CNNs by drastically reducing the number of weights without sacrificing performance, while increasing regularization. However, the FMAP layer averages tensors among feature maps (ie, channels), as opposed to across the 2 dimensions of each feature map like GAP does.
If included at the end of every convolutional block, FMAP assures that the depth (ie, number of channels) of the activations flowing forward in the network remains flat (ie, 1 channel) at all depths of the network, instead of exponentially increasing, as is typically the case.
It is important to realize that a sort of weighted average
The FMAP can also be thought of as a nonlearnable version of the depth-wise convolution (ie, convolutions with kernel size 1×1 typically used to reduce the complexity of a model by merging its feature maps). By using a fixed function (average) instead of a learned one, though, we obtain a decrease in learnable weights in our model. For a depth-wise convolution, we need 1 weight and 1 bias per input feature map, whereas with FMAP, we need none. We also prevent the network from overfitting the training set during the computation.
In terms of the decrease in the number of weights for a network, in our own CNNs, the reduction is of 71% (from 1172 weights to 339). This remarkable reduction in weights has several effects, including reducing computational demands for both training and prediction, and, as we mentioned earlier, reducing the number of degrees of freedom in the model, thus reducing the potential to overfit the training set.
We believe this layer forces an ensembling effect onto the network’s block in which it is inserted. It is a consensual observation that ensembles of trained neural networks generalize better than just 1 trained neural network [
In our quest against overfitting, we implemented data augmentation. In its classic form, it allows for the on-the-fly creation of new training examples based on random transformations of the original ones.
With regard to our CNNs, we created a layer designed to introduce uniform random noise within the multimodal codexes. During training, it introduces up to 10% noise for each pixel representing a feature in the multimodal codex (while it leaves all other pixels, the ones not representing any feature, alone). This meant that, for each epoch, the network saw an up to 10% different version of each image.
This procedure was especially important given that our uniformization of the ground truth variable by upsampling meant that there was a nonnegligible amount of image (multimodal codex) repetition being fed to the CNNs. So this data augmentation scheme allowed for them to be actually
Another more modern form of data augmentation is TTA [
The way we implement TTA is innovative. We use it between our spatial (CNNs) and temporal (transformer) networks. When our 13 CNNs predict their final emotion dysregulation estimates, we do so using TTA, and moreover, we repeat the process 10 times. As a result, we provide the transformer with both better predictions and more diverse data to train on. We believe this procedure can greatly increase the generalization of the network to unseen data.
We used vanilla Adam optimizer for both our CNNs and the transformer network, with default settings. We did not implement any learning rate scheduler.
We trained our CNNs for 500 epochs each. We trained our transformer network for 100 epochs. At each epoch, the models were saved. By the end of training, our code automatically selected the best model, which was the one with the highest Pearson correlation for our CNNs and that with the highest CCC for our transformer, between predictions and the ground truth on the validation set.
As we described earlier, all the aforementioned steps were implemented within each fold of a cross-validation procedure. Eight folds were utilized overall.
Pearson correlation coefficient was calculated using SciPy, version 1.7.1 (Community Library Project). Mean absolute error and the CCC were assessed using Tensorflow, version 2.6.0 (Google Brain; code included in the associated Google Colab, see section below). Means and standard deviations were calculated using NumPy, version 1.19.5 (Community Project).
Convergent validity is the extent to which a measure produces results that are similar to other validated measures measuring the same construct [
We decided to port a large portion of our work from MATLAB to Tensorflow/Keras (created by François Chollet) and to prepare a Jupyter Notebook within Google Colab so that every reader can replicate our findings. The notebook can be accessed online [
The results are presented in
Scatter plot. Prediction (ie, estimation) vs Difficulties in Emotion Regulation Scale, brief version (DERS-16) for each fold. Pearson
Eight folds’ validation sets combined (N=248). Pearson
Data per fold for our system’s estimated emotion dysregulation versus the findings with the Difficulties in Emotion Regulation Scale, brief version (DERS-16; ground truth).
Variable | Number | Pearson |
CCCa | MAEb | ||
|
|
|
|
|
|
|
|
1 | 35 | 0.51 | .002 | 0.51 | 0.20 |
|
2 | 31 | 0.45 | .01 | 0.45 | 0.18 |
|
3 | 30 | 0.44 | .01 | 0.44 | 0.15 |
|
4 | 31 | 0.46 | .01 | 0.43 | 0.18 |
|
5 | 31 | 0.54 | .002 | 0.52 | 0.14 |
|
6 | 31 | 0.72 | <.001 | 0.72 | 0.12 |
|
7 | 30 | 0.61 | <.001 | 0.60 | 0.17 |
|
8 | 29 | 0.64 | <.001 | 0.64 | 0.17 |
Mean valuec | N/Ad | 0.55 | <.001 | 0.54 | 0.16 | |
SD valuee | N/A | 0.10 | .01 | 0.10 | 0.02 |
aCCC: concordance correlation coefficient.
bMAE: mean absolute error.
cThe mean across folds for each metric.
dN/A: not applicable.
eThe mean of the standard deviations across folds for each metric.
Can computers detect emotion dysregulation in adults, by looking at their behavior and physiology during a set of controlled experiments? Can they generate “mental images” containing different sense modalities, like clinicians do? Can they do so in a sample that spans different cultures and languages? Can one train a deep multimodal fusion neural network using only a couple of thousand parameters? These are some of the questions we set out to answer in this work. This study evaluated the convergence validity of MDP’s emotion dysregulation estimation with regard to DERS-16, a brief version of the “gold standard” measure for emotion dysregulation. We interpret our results as excellent evidence for convergence validity between MDP’s emotion dysregulation estimation and the DERS-16 in our sample, suggesting that scores obtained using the MDP are valid measures of emotion dysregulation in adults.
It is important to reflect on the diversity of our sample. It spanned 3 continents and 2 languages, with a broad age range, and included individuals with psychopathology to represent the higher end of the emotion dysregulation spectrum. With that in mind, we believe it is impressive that emotion dysregulation estimations were so correlated with their DERS-16 counterparts for all folds, showing similar results. We think this shows a preliminary form of cross-cultural validity for the approach, adding to the evidence we found in our prior work [
We think the multimodal codex approach captures quite well the mental processes that occur in the mind of a clinician while conducting an assessment interview. We attribute the success of our approach in large part to the good framing of the problem as spatiotemporal, and believe this representation of all sense modalities as a combined image is closer to the way we humans do multimodal fusion.
To our knowledge, the MDP is the first test of its kind. It is a validated exposure-based psychometric test that implements deep multimodal fusion to analyze responses within a set of controlled experiments in order to measure psychological constructs.
Its advantages over classical questionnaires and interview-based tests are manifold. They are as follows: the MDP takes less than 10 minutes to complete; it can be taken at home with a computer or tablet and is resilient to unpredictable variability in the test conditions; it is scored automatically in minutes; it is objective and replicable in its observations; it is holistic, taking into account language, voluntary and involuntary behavior, and physiology; it can be used in different cultures with only minimal translation efforts; and it can evolve over time, learning new scoring models based on different validated psychometric measures.
In terms of deep learning, we cannot stress enough how this work defies current trends and tenets within the field. In the current international race toward the trillion-parameter model, how can anyone dare to present a deep network capable of estimating very abstract psychological phenomena with only 8630 weights? In a field powered by Google, Apple, Facebook, Amazon, and other American and Asian tech giants data mining free online services for millions of data points, how can anyone dare to present a model that can be well trained with only 274 examples? We think this work should be seen as pertaining to a concurrent and perhaps literally opposite trend. Humans do not need that many examples to learn something, even something complex. Maybe machines do not need it either, provided intelligent constraints are put in place (sort of bike wheels for children) to prevent the system from falling into tendencies (memorization, ie, overfitting) that would prevent real learning. We think that at the heart of this concurrent view of machine learning, there is chaos in the form of randomness. Random noise has been added to our samples as data augmentation. There are random paths toward minima spearheaded by an increase in stochasticity due to small batches during training. There is randomness during prediction by implementing TTA. There is randomness in the random initialization of each kernel within each convolutional block, and the way the FMAP layers force them to ensemble. There is randomness in the automatic choice of the stimulus from the stimuli pool so that no single person experiences the exact same stimuli set. There is randomness in the random errors that occur in pretty much every one of the feature extraction processes implemented by the MDP software. Randomness might seem to be just noise, but what if, in reality, it is what allows us to separate signal from noise?
One of the obvious limitations of our work is the size of our sample. Although we purposely set to prove that one can learn very complex and deep multimodal models that can be accurate and reliable with just a few hundred cases, this does not in any way disprove the common sense assumption that, with more data, the model would improve even more. In addition to sheer sample size, we believe it would be interesting, and quite unexplored in psychometry, to use census-based samples (data sets whose distribution in terms of sex, age, income, etc, matches the census of a given country). Online recruiting agencies are beginning to propose this as a service, and we hope we will be able to work with such a sample in the near future.
Another weak point of our study is the lack of a hold-out test set. We did not implement one primarily because of a lack of enough data. Indeed, it is known that validation sets can be overfitted, in a process some have called “model hacking” [
Furthermore, we question whether a hold-out sample, proportional in size to our overall sample, would have been a better unbiased estimator (how can a sample with a size of around 30 be taken as representative of the whole population?). In the future, we will look to the works of Martin and Corneanu [
In this work, we successfully trained a deep neural network consisting of spatial (convolutional) and sequential (transformer) submodels, to estimate emotion dysregulation in adults. Remarkably, we were able to do so with only a small sample of 248 participants, without using transfer learning. The metrics of performance we used show not only that the network seems to generalize well, but also that its correlation with the “gold standard” DERS-16 questionnaire is such that our system is a promising alternative. Perhaps most importantly, it was confirmed that deep learning does not need to mean millions of parameters or even millions of training examples. Carefully designed experiments, diverse small data, and careful design choices that increase self-regularization might be sufficient.
Biometric Attachment Test
concordance correlation coefficient
convolutional neural network
Difficulties in Emotional Regulation Scale
Feature Map Average Pooling
Global Average Pooling
heart rate variability
local feature extraction
Multimodal Developmental Profile
natural language processing
remote photoplethysmography
Symmetrical Gaussian Error Linear Units
test-time augmentation
Uniform Manifold Approximation and Projection
We want to thank Gwenaëlle Persiaux for her recruiting efforts in Lyon, France; Nahed Boukadida for her recruiting efforts in Tunisia; Susana Tereno, Carole Marchand, Eva Hanras, and Clara Falala-Séchet for their recruiting efforts in Paris, France; and Khalid Kalalou and Dominique Januel for their recruiting efforts at Etablissement Public De Santé Ville-Evrard in Saint-Denis, France. Funding for this publication (fees) was provided by FP and the University of Bourgogne Franche-Compte.
FP handled project funding, training scheme, network design, multimodal codex development, coding, and recruitment at Paris and the United States. YB handled remote photoplethysmography algorithm development, recruitment at Dijon, and academic review. FY handled recruitment at Dijon and academic review.
None declared.