This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on http://mental.jmir.org/, as well as this copyright and license information must be included.
Internet is a particularly dynamic way to quickly capture the perceptions of a population in real time. Complementary to traditional face-to-face communication, online social networks help patients to improve self-esteem and self-help.
The aim of this study was to use text mining on material from an online forum exploring patients’ concerns about treatment (antidepressants and anxiolytics).
Concerns about treatment were collected from discussion titles in patients’ online community related to antidepressants and anxiolytics. To examine the content of these titles automatically, we used text mining methods, such as word frequency in a document-term matrix and co-occurrence of words using a network analysis. It was thus possible to identify topics discussed on the forum.
The forum included 2415 discussions on antidepressants and anxiolytics over a period of 3 years. After a preprocessing step, the text mining algorithm identified the 99 most frequently occurring words in titles, among which were escitalopram, withdrawal, antidepressant, venlafaxine, paroxetine, and effect. Patients’ concerns were related to antidepressant withdrawal, the need to share experience about symptoms, effects, and questions on weight gain with some drugs.
Patients’ expression on the Internet is a potential additional resource in addressing patients’ concerns about treatment. Patient profiles are close to that of patients treated in psychiatry.
The different Internet resources make it possible to share content quickly and to interact within a large population. The biomedical literature shows a significant increase in studies published in online communities. In 2015, almost 400 publications linking to Facebook, over 300 linking to Twitter, and almost 400 documents linking to blogs and forums were published. Online health communities (OHCs), social media, and blogs are a potential mine of information exchanged daily. Among the applications of text mining, the automatic analysis of data from the Internet is a challenge. In fact, the large amount of data available on this platform can be processed by the tools of natural language processing. In addition, screening the Internet is almost impossible manually, and it is a very interesting application for automatic data extraction tools [
Social networks are a particularly dynamic way to capture the concerns of a population in real time. Blogs, microblogs such as Twitter [
The online social networks provide a valuable complement to communication face-to-face and help patients to improve their self-esteem and social skills [
Some studies have focused on the use of the social media and Internet psychiatry forums. OHCs and the social media as a resource for mental health service users are important in reducing stigma and promoting help-seeking behaviors. The analysis of the information shared on the Web is crucial in psychiatry, as public misinformation could negatively affect mood. Research on the social networks has increased and has explored different aspects of people’s health status. Several studies have considered the way depression and eating disorders are discussed on Twitter [
The influence of the social networks has been studied for the development of attitudes of mutual trust and self-help. In some cases, face-to-face support cannot provide adequate help for patients with mental disorders [
Several studies have been published on the applications of text mining to Web data. The exploitation of Internet data has provided early monitoring information on adverse reactions to drugs [
Text mining is proving to be a powerful method to exploit large continuous flows of user-generated content on the World Wide Web. This resource has rarely been exploited to understand the perceptions of patients about treatments. We set out to study an online discussion forum dedicated to the use of antidepressants and anxiolytics to explore patient concerns.
Our dataset is derived from a French online discussion forum about drugs, illness, procedures, and other information relating to general health. As mentioned in the forum charter, discussions can be read and potentially used by all. We focused initially on the titles of discussions from 2013 to 2015. The participants themselves summarize the topic or question they post on the forum. In other words, we focus on a condensed form of the concerns of the participants regarding antidepressants or anxiolytics.
The data extraction step is dependent on the data source and differs according to whether it is data from a website, from patient medical records, or from qualitative interviews. In our study, health information was extracted from Web pages via a program that explores the Web of data using R packages (R Project for Statistical Computing). Our corpus of documents was formed from discussion titles. A page contains 50 topics and each topic consists of a title, an initial message (demand), and potential responses (messages). To create this database, we implemented the following process: first, we extracted pages including lists of discussions. Links to each discussion are found on the website in a specific location. The storage address for a discussion is indicated via a URL link. By capturing all the discussion addresses, we had access to the messages stored there in Hyper Text Markup Language (HTML) format. Second, each URL and each of the discussions were analyzed to remove unnecessary information (images and advertising) and to extract the date, titles, and discussion of messages in a Microsoft Excel file.
There has been discussion about the ethical concerns of analyzing data retrieved from OHCs [
Once the extraction of data is performed, the data preparation stage can begin. The tools need to be adapted to the language (English, French, and Spanish) and to the vocabulary if certain words are used in a specific domain (eg, medical). In addition, words used on the Internet via social networks or blogs are not the same as those used in a newspaper. We, therefore, need to pay special attention to spelling irregularities and to include everyday words used in spoken language.
A morphological analysis is the first part of the process. It consists of analyzing the morphology of the sentences in the text. All messages are reviewed by screening for particular typographic elements such as accents. This step is essential in French because accents are a characteristic of our language. For example, the letter “a” can have several variants and is replaced by its generic form without an accent. Then, there is a harmonization step, consisting of converting all lowercase letters to uppercase. Each sentence is finally cut off using the punctuation that defines it. Punctuation marks are deleted. Finally, the spaces between the words are used to delineate them.
The next step is to parse the text and to remove noninformative elements such as numbers or link words occurring in the database. The words and codes for data extraction from the Internet may also be present, such as XML and HTML tags (< html>, </ n>...). Finally, a list of “stop words” is predefined in the software to automatically delete the list of prepositions and articles ( what, my, ...) that are not informative and to reduce the list of words that are most relevant to the analysis.
The last step is stemming, which consists of grouping similar words according to a common root. For instance, a verb can have different spellings following conjugation rules. Stemming enables us to group every inflection of the verb into one term, which is the root. For instance, the word “continue” exists in different variants—“continued,” “continuing,” “continuous,” “continuation,” and so on. The three inflected forms will be identified as one relating to “continue.” The same principle is applied for compound words or words with a prefix or suffix. To perform the stemming step, it is necessary to have an exhaustive list of words including all variants and the associated root. The quality of this processing varies with the software used and from one language to another and depends on the list of words referenced. Finally, to simplify the analysis of the treatments mentioned, we harmonized the names of the different drugs by using their international nonproprietary names.
Initially, every word in every sentence is recognized as unique. At the end of this stage of data preparation, the number of words is reduced following the simplification of word variants. To analyze the occurrence of these words in our corpus, a contingency table is created and called document-term matrix (DTM). In our study, we analyzed only the words used in the titles of discussions. Our DTM table shows the number of times a word was used (column) in a discussion title (online). The majority of words appear only in some titles. Accordingly, if a DTM is still almost empty, it means that there are a large number of 0s in the table. We, therefore, need to adapt the data modeling approach to this type of data.
The easiest and commonest way to visualize textual data is the word cloud. The aim is to display each word and represent its frequency by the size of the font used. First, only words included in the DTM table are used. Then, the frequency of each word is calculated, and the list is ordered in decreasing manner. The word that appears most frequently is represented with the largest font. The second word is most often graphically smaller than the first word but larger than the third word in the list and so on with other words. In the end, the word cloud reflects the word frequency table, maximizing the visibility of the most common terms.
To analyze the patterns of occurrence of words in the discussion titles, we studied the influence of each word in terms of co-occurrence. Due to the sparsity of the DTM, correlation analysis was not appropriate. The analysis of co-occurrences, via centrality measures using graph theory, is an alternative, which has been proved to be better in quantity and quality [
The first type identifies a centrality pattern, which highlights some words of importance based on their better positioning in the co-occurrence relationships. These words have a central role in some units in the graph. Centrality can be measured by a local measure using the degree of centralization, considering that words with many connections are the most important words. The degree of centrality measures the importance of a word and is involved in a large number of interactions, measuring by an exposure index to what is flowing through the network. Another way of looking at centrality is by considering how important words are in connecting to other words (betweenness centrality). The idea is to reflect the mediation role of words based on how many words each word would have to go through to reach the others.
The second type identifies a modular pattern of occurrence (community), where the words are grouped into classes based on semantic similarities (ie, similar semantic patterns of word occurrences). The aim of this analysis was to identify the thematic structure of the text [
The Health Forum studied is a French-language website, Doctissimo [
The preprocessing step is represented in
The titles of discussions extracted initially contained 3025 different words. After the pretreatment step, only 99 words were identified as being the most representative, in other words, only the words that appeared most frequently in the titles and considered the most informative (excluding prepositions, articles, and some adverbs).
Finally, the final table reduction step was applied to remove words that appeared infrequently. We did not analyze all the terms in the titles because many words are not informative. To reduce the size of our final DTM table without the risk of losing information, we removed the words occurring in less than 0.05% of the titles. Few words are retained as the most relevant by text mining to define the title content. Content of some of the titles was reduced to one or two words. More than 400 titles did not contain any of the words listed by text mining in the DTM as the most frequent. Several reasons could explain this phenomenon. First, some titles could include some uncommon words. Second, noninformative words are deleted during the preprocessing phase.
Words related to antidepressants were the most frequent. The words corresponding to information sharing between participants, such as “help,” “testimony,” “advice,” “need,” and “opinion” are also present.
Preprocessing step.
Wordcloud.
Centrality of co-occurrences based on degree algorithm.
Cluster of co-occurrences-community detection based on modularity (fast greedy algorithm).
Centrality reflects the relative importance of a word within a corpus (ie, the links between words by measuring the position of a word in the network). The centrality measure based on degree enables visualization of the most frequently used words in the forum.
As in the word cloud, the most popular words are “withdrawal,” “stop,” and “antidepressant.” These words reflect major concerns expressed in the forum. Betweenness centrality relates to words with a mediator role, serving as paths linking to other words in the network. It quantifies the control of a word on the communication between other words. Six words are considered as mediator-linking terms relative to the request (“after,” “under,” and “do”) and defining different topics around common terms (“escitalopram,” “withdrawals,” and “antidepressant”).
The detection of “communities” makes it possible to highlight patterns of co-occurrences, nonhierarchical but localized. Community detection based on modularity (fast greedy algorithm) is used to visualize different topics in
Nine clusters of words are identified, representing the interconnection of terms on the basis of their co-occurrence in titles:
Depression, distress, and anxiety, where people ask about the experiences of people who took treatments against these symptoms (11 words in turquoise)
Withdrawal linked to paroxetine, escitalopram, alprazolam, and changes of treatment especially with venlafaxine and sertraline (9 words in red)
Effects after stopping medication of duloxetine and agomelatine (7 words in violet)
Search for advice, assistance, and libido issues (7 words in yellow)
Weight gain with fluoxetine and aripiprazole (4 blue words)
Effects of amitriptyline (3 words in green)
Side effects of risperidone (3 words in orange)
Changing prescription and switching two antidepressants: duloxetine and agomelatine (2 green words)
Concerns about the effects and side effects of medications (2 gray words)
Detecting communities is an interesting graphic approach to visualize knowledge of relational data and to bring information to light more quickly when it is hidden in large volumes of data. Similar results were found using the random walk algorithm (walktrap).
The principal concerns in the forum relate to withdrawal and discontinuing certain antidepressants. We can see the central role of withdrawal in patients’ questions. This issue was previously minimized for a long period. In 1997, a survey concluded that many physicians denied being aware of the existence of antidepressant withdrawal symptoms [
The study of interactions on OHCs provides an additional source of information to better understand the difficulties encountered in real life. Patients may be able to develop skills to overcome the difficulties of communication and recovery. In our study, we identified patients’ need to share experiences about illness management and to brighten their lives through social interactions on these online platforms [
We focus our analysis on people posting messages via the Internet, meaning that they have Internet access. There are still many people who do not use the Web on a regular basis. In these communities, there may be no easy way to obtain general health information, and we cannot therefore extrapolate our results to the views and behaviors of other population. However, Internet usage is increasing exponentially with technology and connectivity ever more widely available. There is, therefore, a need to monitor the changing demographics of website users (geographical location, age, and gender). In addition, not all information has the same impact on the Internet, and certain factors can quantify their influence on patient behavior [
For the moment, no guideline is available to inform how to deal with data ownership. Although there are potential benefits of OHCs’ content analysis, it introduces new ethical challenges. The lack of clear guideline to conduct online human subject’s research leaves researchers with no clear way to analyze data shared on the Internet [
Consequently, this gap discourages social scientists from conducting online research. Several options have been used by researchers to publish their research based on Internet data. Some scientists do not publish any information about ethics consideration. Computer scientists raise fewer concerns because they are often unfamiliar with ethical and social implications. One study using Twitter data asked the advice of an institutional review board. They qualified the project as not human subjects’ research because public identification handles are avatars and are not identifiable living individuals according to local and national regulations [
Despite these limitations, the antidepressants and anxiolytics cited are coherent for the management of patients with depression. Escitalopram, paroxetine, venlafaxine, and sertraline are the main antidepressants used in practice to treat depressed patients in France [
Our analysis focuses on the most frequent words used in 2415 titles on a French online forum about antidepressants and anxiolytics. Major concerns addressed in the titles are as follows: (1) stopping medications (using the word “withdrawal”) for certain antidepressants, (2) the need to share the experience of symptoms (depression and anxiety), (3) effects, and (4) questions concerning weight gain with some treatments. The analysis of centrality gives a general idea of the words used. In addition, community analysis provides the context of the use of these words, helping to identify questions discussed in the forum. Our findings show that the patient profiles asking questions in the forum is close to that of patients treated in psychiatry. The concerns expressed are coherent with real-life situations and are not outlandish requests and complaints about mental health issues. Our research is based on text mining tools that are more and more used to evaluate drug surveillance. In practice, pharmacovigilance refers almost exclusively to spontaneous reporting systems. As a complement to the standard approach, analyzing what is spontaneously reported by patients could improve investigations in pharmacovigilance.
Top 20 most frequent words in titles.
Frequency of drug names in titles.
application program interface
document-term matrix
Hyper Text Markup Language
online health communities
The authors would like to thank Doctissimo for allowing them to use data from their website.
None declared.