Deep Learning With Anaphora Resolution for the Detection of Tweeters With Depression: Algorithm Development and Validation Study

Background: Mental health problems are widely recognized as a major public health challenge worldwide. This concern highlights the need to develop effective tools for detecting mental health disorders in the population. Social networks are a promising source of data wherein patients publish rich personal information that can be mined to extract valuable psychological cues; however, these data come with their own set of challenges, such as the need to disambiguate between statements about oneself and third parties. Traditionally, natural language processing techniques for social media have looked at text classifiers and user classification models separately, hence presenting a challenge for researchers who want to combine text sentiment and user sentiment analysis. Objective: The objective of this study is to develop a predictive model that can detect users with depression from Twitter posts and instantly identify textual content associated with mental health topics. The model can also address the problem of anaphoric resolution and highlight anaphoric interpretations. Methods: We retrieved the data set from Twitter by using a regular expression or stream of real-time tweets comprising 3682 users, of which 1983 self-declared their depression and 1699 declared no depression. Two multiple instance learning models were developed—one with and one without an anaphoric resolution encoder—to identify users with depression and highlight posts related to the mental health of the author. Several previously published models were applied to our data set, and their performance was compared with that of our models. Results: The maximum accuracy, F1 score, and area under the curve of our anaphoric resolution model were 92%, 92%, and 90%, respectively. The model outperformed alternative predictive models, which ranged from classical machine learning models to deep learning models. Conclusions: Our model with anaphoric resolution shows promising results when compared with other predictive models and provides valuable insights into textual content that is relevant to the mental health of the tweeter.


Background
Mental health problems are widely recognized as major public health challenges worldwide.According to the World Health Organization, 264 million people were affected by depression globally in 2020 [1].Mental illness, in general, is one of the leading causes of the global burden of this disease.It was estimated that in England, 105 billion British pounds (US $145 billion) were spent on mental health services and treatments or lost in productivity at work in 2018 [2], with the global costs expected to rise to US $6 trillion by 2030 [3].A significant contributor to this cost is that people living with mental health problems sometimes receive inaccurate assessments [1].This highlights the need for effective mental health services and a novel approach for diagnosing mental health disorders.
User-generated content on social media, reviews, blogs, and message board platforms offers an opportunity for researchers to explore and classify the huge amount of content in different domains, such as marketing [4], politics [5], and health [6][7][8], thereby providing a rapid method to understand user-created text and expressed emotion using text classification algorithms.Social networking (eg, Facebook and LinkedIn) and microblogging platforms (eg, Twitter and Tumblr) provide internet users with a safe space to post their feelings, thoughts, and activities.With some users publicly expressing their mental health statuses on their profiles, it becomes possible to train classification engines to detect internet users with mental health problems [9,10].Using Twitter data, in particular, studies have examined users with depression [11][12][13][14], postpartum depression [15], anxiety, obsessive compulsive disorder, and posttraumatic stress disorder [11,16].In addition, Facebook data were also used to detect users with depression [17,18] and postpartum depression [19].
Generally, text classifiers and user classification models tend to be developed separately.This presents a challenge for researchers who want to simultaneously understand both text sentiment analysis and user sentiment analysis.In this paper, we present a predictive model that can detect users with depression and identify their tweets as those related to health.An ideal technique for developing this type of model is multiple instance learning (MIL) [20], where the model can learn from a set of labeled bags or users instead of a set of individual instances or user-generated messages.
Anaphora resolution is an established natural language processing (NLP) problem and an emerging field in the analysis of social media content that helps with determining which previously mentioned person is the subject of a subsequent statement and understanding references to someone in the content on social media.This is particularly relevant to social media, as posts may frequently refer to individuals other than the tweeter [21].

Objectives
To the best of our knowledge, no study has focused on detecting users with depression on social networks with an anaphoric interpretation of the content.In this study, we aim to address the problem of anaphora resolution in user-generated content and present a predictive model that can reliably identify statements, thoughts, and attitudes relating to the tweeter, rather than a third party.
The objective of this study is to investigate whether user-generated content from Twitter can be used to detect users with depression.This raises three research questions: 1. Can MIL be used to develop a predictive model for detecting users with depression from their tweets? 2. Can sentiments of unlabeled tweets be predicted from the labels of users with depression? 3. Can anaphora resolution be combined with MIL to eliminate false positives?
This paper introduces MIL models with and without anaphora resolution to detect users with depression from their generated textual content on Twitter and predictive models that can highlight posts relevant to mental health.The results show that our algorithm outperforms the major recently published algorithms in the field.We further illustrate the differences in the tweets related to mental health from users with self-declared depression and users with no depression.

This Study
This study focuses on text analysis, predictive models for detecting social network users with mental disorders, and MIL.
The most relevant studies published to date are reviewed below.
Text analysis is an NLP approach for identifying information within text.This technique has been developed to understand the textual content automatically and computationally.During the early stages of sentiment and emotion analysis, researchers manually annotated the text [22].With the possibility of identifying emotions in text, the content has been computationally analyzed using a keyword or corpus-based approach and a learning-based approach [23,24].
The learning-based approach uses a predictive model to determine the relationship between an input and output word.Word embedding is a common learning-based technique that transforms the words of a document into dimensional vectors for word representation and determines word similarity.Global Vectors for Word Representation (GloVe) is a word-embedding approach that computes and aggregates word co-occurrence for representing the closest linguistic or semantic similarity between co-occurrent words as vectors [25].GloVe was trained on several textual data sets, such as Wikipedia and common crawl (a copy of web content), and supported 50D, 100D, 200D, and 300D vectors.
Anaphora resolution is another text analysis problem related to determining which person is mentioned within textual content.There are three reference resolution algorithms [26].The rule-based entity resolution extracts syntactic rules and semantic knowledge from the text.The statistical and machine learning-based entity resolution is a method to understand the coreference of a reference to an early entity.Deep learning for entity resolution reduces handcrafted feature requirements and represents words as vectors conveying semantic units.Aktaş et al [21] investigated anaphora resolution for conversations on Twitter using a corpus and manual annotation.Twitter conversations revealed the cues of anaphora resolution to identify a mentioned person and provide context.
De Choudhury and Gamon [13] pioneered NLP and machine learning approaches for developing predictive models to detect users with mental disorders from social network data using a mental health screening questionnaire and linguistic analysis tools to extract emotional words and web-based behaviors from users' posts.However, the screening and data collection process was time consuming, and Coppersmith et al [11] introduced an automatic data gathering method using keywords to find the target users and programmatically retrieve the posts.
Following these initial studies, a number of novel methods have emerged for predicting mental disorders in social network users.
The early work focused on classical supervised machine learning techniques and traditional text analysis approaches.
The psychometric analysis of textual content was used to compute the percentage of emotional, functional, and social concern words [13,15].Linguistic inquiry and word count (LIWC) was used to compute the percentage of words relevant to categories from each tweet.The extracted percentages were then used to train a predictive model based on a support vector machine with a radial basis function [13].
Language models have been applied to analyze social media texts to address spelling errors, shortenings, and emoticons [11].
The language model was developed from an n-gram, which learns from the sequences of text and computes the probability of unseen text relevant to a category of the trained model.This model scored the probabilities of users with depression based on a higher probability of the positive class language model trained from the tweets of users with depression or the negative class language model developed from the tweets of control users [11].
A predictive model based on topic models was developed from the social network profiles of clinically diagnosed patients [17].
The topic model used latent Dirichlet allocation to extract topics from the text.All tweets from each user were used to compute 200 topics, which were then used to develop a logistic regression model for classifying the users with depression [17].
Building on the popularity of neural networks, novel models have been developed using word embedding [27,28] and deep neural network models [28].The Usr2Vec model transformed text into an embedding matrix, where words commonly used together were represented in closely dimensional spaces for classifying users.The embeddings were learned from users' tweets and then summarized as user representations.The embedding matrices were used to train a predictive model using a multinomial logistic regression technique [27].
The deep learning model uses word embeddings to represent the sequential words of users' tweets.A predictive model was trained using a 1D convolutional neural network (CNN) and a global max pooling layer [28].
In addition to the textual content of the posts, a number of writing features can be analyzed: post or blog lengths, time gap between consecutive posts, and day of the week and time of the day of postings.Further network features of interest include likes, numbers of followers or following, characteristics of comments on other users' posts compared with original posts, and numbers of shares or retweets.Image analysis was used to characterize user posts [29,30].
To develop a predictive model, this study focused on MIL.It is a supervised learning technique first proposed by Keeler et al [31,32].Although classical supervised learning requires an instance and a single label to learn during the training process, MIL can learn from a bag of instances X=x 1 , x 2 ,…, x N .Each instance x n can be independent and has its own individual label, y n , where y n ∈ {0, 1} for n=1,…, N, and it is assumed that each y n is unknown during the training process.On the basis of these assumptions, an MIL classifier can predict a label Y for a given bag X as follows: On the basis of these assumptions, MIL can provide an extreme result Y=1 in the case of having a predicted positive-instance label y n =1 in a given input X.The relaxation of the MIL assumption can be computed using aggregated probabilistic distributions of instances, where Y=P(x n ) for n=1,…, N.
The purpose of MIL is to facilitate the development of a predictive model for detecting social media users with depression and instantly label each of the posts associated with either mental health or other topics.Normally, data sets from social networking are labeled at the user level but not at the post level.This makes it difficult to find a change in patterns in the message topics posted on social networks.
MIL models have been widely applied to image classification [32], object detection [33], image annotation [34], medical image and video analysis [35,36], sentence selection [37], and document classification [38].In document sentiment analysis, Angelidis and Lapata [20] proposed the MIL network (MILNET) to classify web-based review documents and instantly identify the sentiment polarity of each segment of given documents.MILNET comprises segment encoding, segment classification, and document classification via an attention mechanism.Segment encoding transformed sentences in a document into segments via word-embedding matrices and a CNN.Each segment representation was classified using a softmax classifier.An attention mechanism based on a bidirectional gated recurrent unit (GRU) was used to weight the important segments to make a final document prediction as the weighted sum of the segment distributions.MILNET performed well in predicting the sentiment of a document and identifying the sentiment of the text segments but was not as successful in identifying a person mentioned in the document.
In this study, we adopt the MIL approach to develop two models, namely multiple instance learning for social network (MIL-SocNet) and multiple instance learning with an anaphoric resolution for social network (MILA-SocNet), to classify users with depression and highlight published posts associated with the mental health topic of a tweeter.Both models use novel document segment encoding, a tweet encoder, and user

XSL • FO
RenderX representation rather than a document vector.The latter model also includes the anaphora resolution, which further improves the performance.

Data Set
The data set was retrieved from Twitter, which provides an application programming interface (API) to search public tweets using regular expressions or stream real-time tweets.This study collected only tweets and users set as public.All collected tweets and users were anonymized.This study was approved by the King's College Research Ethics Committee (reference number LRS-16/17-4705).
We selected a group of users with depression using the method proposed by Coppersmith et al [11].Specifically, a regular expression was used to search tweets that contained the statement "I was diagnosed with depression" between January and May 2019.This resulted in 4892 tweets from 4545 unique users, who were then manually screened to ensure that the tweets did not refer to jokes, quotes, or someone else's depression symptoms.After removing these messages, all tweets in the profiles of the users who posted the tweets were downloaded.After verification, 2132 unique users were included in this data set.
A control group was randomly selected from a list of 2036 users who posted tweets between June 1 and June 7, 2019.Users from the group with depression were removed from the list of the control group.
The limits imposed by the Twitter API allowed us to only download the 3200 most recent tweets of all verified users from the depressed and control groups.In total, 5 million tweets were collected from the 2132 users with depression and 4.2 million tweets from the 2036 users with no declared depression.

Preprocessing
Before developing our MIL model, several transformations were performed on the data set.First, the user ID in each tweet was replaced by a generic user.Similarly, any numbers mentioned in tweets were replaced by the number and any specific URLs by url.The # character in each hashtag was replaced by the string hashtag (eg, #depression became hashtag depression).Finally, users with fewer than 100 tweets or less than 80% of tweets in English were removed from the data set, resulting in 3682 users, 1983 with declared depression and 1699 with no declared depression, as depicted in on the left-hand side of Figure 1.In addition, other dimensions of the data set were explored, as shown in Figure 1. Figure 2 illustrates the distribution of the number of tweets between the depressed and control groups.Slight differences were present between the control and depressed groups.
All tweets in our final data set were embedded from pretrained GloVe word vectors.GloVe is an unsupervised machine learning approach and an NLP technique that represents a word as a set of word vectors.GloVe computes and aggregates word co-occurrences to create a vector representation of the closest linguistic or semantic similarity between co-occurrent words [22].As explained earlier, GloVe was trained on several textual data sets, for example, Wikipedia and common crawl (a copy of web content), and supported 50D, 100D, 200D, and 300D vectors.However, our study used pretrained word vectors trained on 2 billion tweets and 100D vectors to transform our tweets into word embedding.

Overview
This section describes the structure of our predictive model to classify a Twitter user with depression.This section will explain how an MIL model with supervised neural networks classifies users and provides a changing pattern of generated text associated with mental health or other topics.
Our proposed MIL-SocNet architecture comprises a tweet encoder, word attention on a tweet, tweet classification, a user encoder, tweet attention, and user classification (Figure 3).The differences between MIL-SocNet and the basic MILNET architecture are the tweet encoder and word attention, respectively.Our model uses a GRU, whereas MILNET uses a CNN and does not have an attentional mechanism.
Furthermore, the MIL-SocNet model was extended with an anaphoric resolution to create the MILA-SocNet model.We present this model to improve performance by adding an anaphora resolution encoder to ensure that the algorithm focuses on posts related to the author (Figure 4).

Tweet Encoder
The first layer of our proposed model transforms each tweet into a machine-readable form.First, tweets were transformed into word-embedding matrices.Each user publishes j=1, 2,…, n tweets, where n is the number of tweets used to train the model.Each tweet contains k=1, 2,…, i words, where i is the number of words in each tweet and varies from post to post.W jk represents the kth word in the jth tweet.Every w jk is then embedded through an embedding matrix W e to be received a word vector x jk .This layer embeds all words w jk of jth post to the word vector: The abovementioned equation operates times.After embedding all words, a bidirectional GRU is used to encode the vector: The bidirectional GRU presents a hidden representation of h jk , which is concatenated from and .The word hidden vector h jk is then sent to an attention mechanism to select the important words.

Word Attention on a Tweet
Not every word equally represents tweet meanings.An attention mechanism is used to select words that best capture the relevant XSL • FO RenderX meaning of a tweet.The attention layer comprises a tanh function to produce an attention vector u jk of the kth word in the jth tweet, where W w and b w are weights and bias, respectively.
The importance of words or attention weights a jk is calculated via the normalized similarity of u jk with the context vector of the word level u w , which is learned and updated during the training step.
Finally, the tweet vector t j is computed using the weighted sum of word importance with the hidden representation of h jk generated from the bidirectional GRU.

Tweet Classification
To make a prediction about a tweet related to either a mental health or another topic, each tweet vector t 1 , t 2 ,..., t n from the attention layer is classified through a softmax function [39].

User Encoder
Detecting users with depression requires a pattern to differentiate between user groups.To predict these users, this study used a temporal pattern of posting generated from the tweet classification layer.This layer concatenates the probabilities of every classified tweet label into a single list of label probabilities called user representation.The user representations between the 2 groups are expected to differ, which will be explored and illustrated in the Discussion section.Then, user representation is passed through a bidirectional GRU to learn the changing patterns of text categories over the observation time.This generates the forward hidden state and the backward hidden state of the user representation.Finally, they were concatenated to h j .

Anaphora Resolution Encoder
For the MILA-SocNet model with anaphora resolution, pronoun features from LIWC [40] are used to add informative interpretations to each tweet.Every tweet was analyzed for emotions, thinking styles, social states, parts of speech, and psychological dimensions.
Each tweet is combined between the extracted pronoun features s j from the LIWC and a tweet classified label p j from the tweet classification layer, where with represents the extracted features in the jth tweet.This yields the following anaphora resolution vector: The vector is then passed through a bidirectional GRU to learn the text category and anaphoric features.This generates h j combined from the forward hidden state and backward hidden state .

Tweet Attention
Not all user tweets were equally associated with depression.Some tweets may contain cues relevant to depression, whereas others may not.For this purpose, an attention mechanism is applied to reward tweets that correctly represent the characteristics and are important for correctly detecting a user with depression.This layer performs similarly in both MIL-SocNet and MILA-SocNet.A multilayer perceptron (MLP) produces the attention vector u j of the jth tweet.The parameter W t denotes the weights of the tweet and parameter b t represents the bias of the tweet.
The attention weights of tweets or important tweets α j are computed through the similarity of u j with the context vector of tweet level u t , which is learned and updated during the training step.
The user vector v is achieved by summarizing all the information of the tweet label possibilities of a user.

User Classification
Finally, a predictive model for detecting a user with depression can be achieved through the user vector v derived from encoding the concatenation of the probabilities and the attention weights of the classified tweet labels from the user.A softmax function was again used to perform the classification.

Training the MIL Model
To train MILA-SocNet and MIL-SocNet, we used the Keras library with TensorFlow backend, a Python library for neural network APIs.We used an adaptive and momental bound method (AdaMod) [41], and the binary cross-entropy loss function to minimize loss.Every tweet from each user was tokenized and limited to 55 tokens or words.The model was trained using 2000 recent tweets from each user, with users with fewer than 2000 tweets having empty tweets padded with 0 values to achieve the matching length.To eliminate overfitting, dropout and early stopping were applied to the model during the training step.
Both our models and replicated models were trained and tested with holdout cross-validation.We split the users experiencing depression into four equal chunks and trained the models against all control users.Thus, each round used 496 users experiencing depression (22.60%) and 1699 control users (77.40%), mirroring the real-world incidence of depression.From the total users included in each round, 20% were used as test sets to evaluate the performance of the models.To reserve the same proportions of classes between the training and test tests, stratified cross-validation was used.Figure 5 shows the cross-validation process.

Model Evaluation
To predict whether each Twitter user was likely to be depressed, we also trained a set of published predictive models ranging from classical machine learning to deep learning techniques by using user-generated textual content.Accuracy, precision, recall, and F1 scores were averaged across the test sets.Each model was trained and tested with the same samples in each round; however, data transformations differed in some cases, as explained in the Background section.
To compute the predictive performance of models for detecting social network users with depression, we used the following metrics: To further compare the performance of MILA-SocNet and MIL-SocNet with the other published models, Akaike information criterion (AIC) was applied across all the models.
AIC is a commonly used tool for model comparison and selection [42,43] that measures the information loss in each model, considering the model's complexity as well.AIC is defined as follows: where n is the number of samples and K is the number of parameters or features of a model.
denotes the natural logarithm of likelihood [44].The equation also uses bias adjustment because of the small sample size [45,46].A lower AIC value indicates better performance.

Results
This section shows the performance of MILA-SocNet and MIL-SocNet and compares their results in terms of accuracy, precision, recall, and F1 score against several published models including LIWC [13], language [11], topic [17], Usr2Vec [27], and deep learning [28] models, as explained in the Background section.
Table 1 shows the performance of our proposed MILA-SocNet and MIL-SocNet models against the alternative models.As observed, the MILA-SocNet achieves a maximum accuracy (92%), precision (92%), recall (92%), and F1 score (92%), immediately followed by the MIL-SocNet.The MIL-SocNet yielded an accuracy, precision, recall, and F1 score of 90%, 91%, 90%, and 90%, respectively.Each model was evaluated using the area under the curve of the receiver operating characteristic curve.As can be seen in Figure 6, the MILA-SocNet and MIL-SocNet models achieved the highest areas under the curve-93% in both cases.It should be noted that in those studies, the replicated models were reported with different proportions of classes.These results might be higher or lower than our reported results.In our study, the baseline result was 77% in the case of predicting the majority class in all cases.As can be observed, all the models achieved results that were above this baseline.
Table 2 lists the AIC values for each model.The likelihood was computed from the model-based probabilities of the observed labels.The number of parameters of the MILA-SocNet, MIL-SocNet, and deep learning models were recovered from the number of trainable parameters reported by the Keras library.The number of parameters of the language model was taken from the number of vocabularies in the positive and negative language models.The number of parameters of LIWC, Usr2Vec, and topic models were features in the models.The likelihoods and AICs were averaged from cross-validation, as explained earlier.As can be observed, MILA-SocNet achieves the lowest AIC, reflecting the best performance.

Principal Findings
In this study, we presented two novel MIL models for detecting social network users with depression based on their self-identifying tweets.The original MIL-SocNet model was extended with anaphoric resolution to produce the second MILA-SocNet model.We also compared the performance of both models with that of several previously published models.As can be seen from Tables 1 and 2, MILA-SocNet and MIL-SocNet outperformed all other models in all metrics.We now look at several potential reasons for this result.
Although deep learning models can be trained on raw textual data, traditional machine learning models (eg, the LIWC, language, topic, and Usr2Vec models) require feature extraction to be performed using external tools, which may introduce the additional risk of losing useful information from short textual data [47,48].For instance, misspelled and abbreviated words in tweets may not be present in the dictionary of an extraction tool, resulting in the mislabeling of words.This may be one of the reasons why traditional machine learning techniques performed worse than our proposed models.
Another reason for the performance gap may be that the sequential ordering of words in a tweet and tweets posted on a timeline may influence model performance.Training a predictive model with traditional machine learning methods requires aggregated data, which may cause the loss of contextual information compared with deep neural networks that can learn from the sequential information in the data [49][50][51][52].
Unlike the deep learning model that we have compared against [28], MILA-SocNet and MIL-SocNet used an attention mechanism that highlights words and tweets relevant to mental health.This attention mechanism may have contributed to our proposed models outperforming the deep learning model, even though our approach is also based on deep learning techniques.
Another important point to consider is that the addition of anaphoric resolution improves the performance of the base MIL model.The difference between MILA-SocNet and MIL-SocNet is only in anaphora resolution encoding, which highlights posts related to the tweeters rather than someone else.This is an important feature that has not been widely investigated in the field and should be considered while designing future studies.
We further explored our proposed models by comparing the model performance under different conditions.A set of different parameters was used to train the models.The number of each user's posts used to train a model ranged from 500 to 3200 XSL • FO RenderX posts.The numbers of embedded dimensions were 50 and 100.The lengths of word tokens used to train the models were 18 and 55 tokens, respectively.Table 3 and Figure 7 show the predictive results of MILA-SocNet and MIL-SocNet with different parameters.Longer post length and longer word token provide better results, which is expected as these provide more textual content.Furthermore, models with fewer embedded dimensions perform worse than models with more dimensions.
After training the models, we investigated their interpretability by observing the attention weights to find out which tweets the model paid most attention to.Two users from each group were randomly selected from those correctly labeled by our model, and attention weights were extracted from the tweet attention layer.Textbox 1 highlights the tweets that achieved the highest and lowest weights for these 4 users, offering some insight into model's decision-making.Our predictive model with anaphoric resolution can identify tweets related to the tweeters' own experiences.Textbox 1. Attention weights of posts.The "text" was paraphrased to anonymize users' identities.

•
User 1 • Highest weight: I was also dealing with depression and anxiety badly.School was hell.
• Lowest weight: @user Exam without someone's supervision is bad.

•
Highest weight: I get some rest, take medication, and engage with what I like.These help me and I do not force myself to do things.
• Lowest weight: Talk about offensive things to physical harm: url.

Users with no depression
• User 1 • Highest weight: The lady christmas jumper: url.
• Lowest weight: All the best for your match and hope to see you play.
• User 2 • Highest weight: He reminds me someone in a football team.He can play many positions and he is our best player.
• Lowest weight: People believe you when you have evidence.
A recent survey on using social media data to identify users with depression showed that users from the United Kingdom expressed serious concerns about privacy risks and did not see the potential societal benefits outweighing these risks [53].Thus, if these technologies are to have a meaningful impact on people's lives, increased importance must be placed on the transparency and trust of the analytics performed.
Achieving this trust is, to an extent, helped by the compliance of any research with ethical codes and with the General Data Protection Regulation (GDPR), which helps in raising confidence in data safety and transparent analysis.However, GDPR Article 9: Processing of special categories of personal data specifically mentions that consent is not required if permission relates to personal data that are manifestly made public by the data subject.A core problem is the perception that any data in the public domain are automatically available for research.This is highly controversial from an ethical point of view, as the disruption presented by the wide availability of social network data impacts the norms that guide our perception of the usage of our data for research.Ultimately, GDPR is focused on process, not on the objective of the research, which is fundamental to shaping any research consent and the social consensus around it.
This study had some limitations.Collecting control group data is challenging because the samples may contain users with depression who do not publicly express their mental health state on their profiles.Although keyword-based self-declaration is a popular way of asserting depression [11,12], social media users with depression may use more complex ways of communicating their mental health state [54].There is evidence that social media users post less frequently when they feel low, suggesting that there may be less data available for modeling depression [53].
With regard to technical limitations, this study used additional features from a language analysis tool, which counts words in psychological and word function categories.This may prevent our models from learning word functions directly from sentences.Our future work will use sentence structures extracted from text and train a predictive model with those features [55], which may produce further performance improvements.
The availability of data for model validation is another major concern.Owing to potential ethical issues, there are currently no open data sets to evaluate the performance of predictive models on social network data, making it difficult to compare the model performance.The alternative benchmarking approach used in this study is to replicate well-known study models in the field and apply them to the same data set as the new model being investigated.
Another source of potential bias is the pages that publish tweets about mental health information (eg, mental health charities) and users who report depression experiences of other people (eg, users' friends, family, or a celebrity).Although we filtered those instances in our study, a significant concern still exists for similar work in the field.

Conclusions
This paper proposes two novel MIL models with and without anaphoric resolution to detect Twitter users with depression.
Anaphoric resolution is introduced to address the problem of identifying the subject of a statement made in the post.The classifiers developed comprise a tweet encoder, word attention, tweet classification, user encoder, anaphoric resolution encoder, tweet attention, and user classification layers.Bidirectional long short-term memory layers were used to learn the sequence of words and order of tweets posted on a timeline.Word embedding was applied to transform the textual content into vectors.Additional pronoun features were used to add informative dimensions to our proposed model and highlight posts relevant to the posters themselves.The approach was evaluated against previously published traditional machine learning and deep learning techniques, and the experimental results show that our proposed model produces notably better results.Anaphoric resolution, in particular, improved the performance of our model further and should be considered for inclusion in future studies.
The potential impact of this research lies in its ability to offer social media users exhibiting signs of depression that are suitable for their formal diagnosis.As in other mental health disorders, treatments for depression produce better outcomes and at a lower cost of treatment, the earlier patients get into services.Targeted advertising by mental health charities may be seen as intrusive but is no different than companies advertising any other products to potential consumers based on their web activity.
Early research into public perception of this type of data usage shows that there is public skepticism about this approach.To overcome this animosity toward using social media data for mental health prediction modeling, we believe that future research in this area should focus on explainability and interpretability.We have shown that deep learning MIL models perform well, but they offer no explanation of their decision-making processes [56,57].Extraction of patterns from the models can provide interpretability, as we demonstrated with tweet weight examples, and systematic sampling should be used to achieve the levels of trust acceptable to users.To gauge how acceptable these techniques are to the public, we intend to work with citizen juries to explore the change in opinion that such explainability can deliver [58].
©Akkapon Wongkoblap, Miguel A Vadillo, Vasa Curcin.Originally published in JMIR Mental Health (https://mental.jmir.org),06.08.2021.This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited.The complete bibliographic information, a link to the original publication on https://mental.jmir.org/,as well as this copyright and license information must be included.
The function generates the probabilities of tweet labels , where with 1 denoting a mental health-related post and 0 denoting a non-mental health-related post.The labels used to train this layer are derived and computed from the labels of the user level only.The parameters W c and b c are learned and updated during the training step.Every predicted tweet label is used to teach a predictive model and detect a user with depression.
a MILA-SocNet: multiple instance learning with an anaphoric resolution for social network.bMIL-SocNet: multiple instance learning for social network.
c LIWC: linguistic inquiry and word count.

Table 1 .
Performance of our proposed MILA-SocNet (multiple instance learning with an anaphoric resolution for social network) and MIL-SocNet (multiple instance learning for social network) models and all replicated models.

Table 2 .
The Akaike information criterion (AIC) results against all models.Each row is reported with the number of parameters (K), the residual sum of squares, and the AIC.A lower AIC is better.

Table 3 .
Performance of MILA-SocNet (multiple instance learning with an anaphoric resolution for social network) and MIL-SocNet (multiple instance learning for social network) with different parameters.The first number in the model name (first column) represents the number of posts, the second is the number of embedded dimensions, and the last is the number of word tokens.