Twitter content analysis: WORDij and LIWC software tools

WORDij and LIWC are two Computer-Assisted Qualitative Data Analysis Software tools (CAQDAS) that could be successfully employed to run a content analysis of tweets.

WORDij

WORDij[1] includes a number of components that serve different research purposes. WordLink is part of the WORDij suite. It extracts word pairs, which “are the basis for creating networks” (Danowski[2], 2012, p.3). The words are the network nodes, which are called “link strengths” that are inter-connected, according to the word co-occurrence frequency.

Word pairs are made based on word proximity, which, according to Danowski, is calculated by a “window that slides through the text, counting all word pairs inside as it moves from word in the full text” (p.4). The application counts “pairs 3 positions before each focal word, as well as those within 3 words after it”, which means that the key parameter is 3 by default. The parameter value can be customised according to one’s needs. An example of a semantic network is pictured in Figure 1.

20_Nodes_SocialEuropefinal100

Figure: A spring embedded graph consisting of nodes and arrows making a semantic network of the top 20 nodes and 3 minimum link values, designed with WORDij software suite, for the Twitter communications of Social Europe, in 2012

The WORDij software suite has mainly been used to carry out “text mining and semantic network analysis” (Yuan et al., forthcoming, p.1). The word pair co-occurrence is explained by Danowski[3] (2013) as follows:

“Defining word-pair link strength as the number of times each word occurs closely in text with another, all possible word pairs have an occurrence distribution whose values range from zero on up. This ratio scale of measurement allows the use of sophisticated statistical tools from social network analysis toolkits. These enable the mapping of the structure of the word network. They identify word groups, or clusters, and quantify the structure of the network at different levels. Using these word-pair data as input to network analysis tools, you map the language landscape. On the map, instead of cities, the nodes are words. Rather than roads, there are links or edges among words” (2013, online).

In a paper introducing the outcomes of a research activity that compared the professional behaviour of high-contact vs. low-contact designers who are members of the LinkedIn platform, Danowski (2012) describes the performances of WORDij, which can detect proximate word pairs in a manner which does not apply the “bag of words” approach. On the contrary, WordLink, the WORDij’s component, counts “words as paired that appeared anywhere in the same profile document” (Danowski, 2012, p.9). The “bag of words” is a research technique where a text corpus is represented as the bag containing its words, ignoring grammar and word order, but preserving the word occurrence.

WordLink therefore detects word pairs “within 3 word positions on either side of each word in the text” (p.9). Danowski used a “stop list” to remove common function words in the content body. He then discarded frequencies of 1 and 2 for words and word pairs, as this “is supported by empirical research in natural language processing” (p.9). Further, he dropped numerals, punctuation, and normalised contractions. He then used advanced WORDij functions to “test for differences in word-pair frequencies” (p.9).

Danowski (2012) argues that WORDij could be employed for automatic link coding with proximities since it overcomes the problems that occur when employing the “bag of words” method:

“While word bags are useful for document retrieval they blur social meaning by ignoring the relationships of social units within the texts, whether these units are words, people, or other entities” (p.217).

LIWC

In support to employing CAQDAS, Gluesing et al[4]. (2009) provide clear evidence that “it is possible to design research that takes full advantage of information technologies to gather large amounts of data for data mining and network analysis, but also to embed qualitative methods in parallel and in a measured, targeted way to maximize the richness of results while minimizing the costs usually involved in long-term, labour-intensive ethnographic studies” (p.25).

Derczynski et al[5]. (2013) consider Twitter a “noisy environment” given the SMS-like behaviour of users, where people tend to use word abbreviations and SMS conventions. To overcome or reduce linguistic noise, one needs to normalise data. That requires additional work for researchers when they analyse Twitter content. It means that the researchers have to transform abbreviations and other conventions into their full word equivalent. The work can be done in two stages:

1) identification of orthographic errors and

2) the correction of the errors (p.7).

Elson et al[6]. (2012) state that until 2012 researchers studied social media content by employing a manual approach: focusing on certain pieces of content and on specific users and interpreting and reporting the findings. The authors mention that the researchers might have analysed the irrelevant pieces of content, so that their findings could apply to particular cases and situations and may not apply to Twitter mass communication.

The authors propose to employ LIWC[7] (The Linguistic Inquiry and Word Count) as a “computerized method to study the content of social media”, which may compensate for the limitations of the “manual” methods and therefore reduce “the chance that their biases affect their interpretations of social media texts” (p.xii).

Despite its later popularity as a piece of CAQDAS, LIWC “is largely untested in political contexts”, which was the case before 2012, when this paper was published.

The authors employed LIWC to “investigate how men and women communicate differently” and realised that the software application “has not been widely applied to understanding a non-Western political context” (p.xii).

Elson et al. (2012) analysed Iranian Twitter users’ opinion and what they felt about the 2009 election in Iran and what were their attitudes towards certain countries. The researchers actually wanted to validate the LIWC methodology on a special piece of Twitter content, which has not been done previously.

It is worth noting that Elson et al. reviewed some of the previous research projects that employed LIWC as a CAQDAS and claim that LIWC provides clear evidence on people’s behaviours, attitudes and their emotions. For example “greater use of first-person singular pronouns… has been shown to suggest feelings of depression” while “second-person or plural pronouns indicate reaching out to others… and a sense of community or group identity” (p.xiii). LIWC has proven to be a reliable method to identify “emotions in written language… and the results it generates are comparable to those produced with other content-analysis methods” (p.xiii). Furthermore “LIWC has been successfully applied more recently to various forms of social media” (p.xiii) and the procedure “holds much promise” conclude Elson et al. (p.xvii).

The LICW software application was developed by the psychologist James W. Pennebaker and his team. LIWC processes texts and provides information about 80 linguistic categories contained in the analysed texts. The LIWC power consists of the ability to output information about the positive and negative emotions carried by the texts:

“LIWC represents only a transitional text analysis program in the shift from traditional language analysis to a new era of language analysis” (Tausczik and Pennebaker[8], 2010, p. 38).

LIWC goes through any type of text, including blogs, novels, speeches, poems etc. and checks each word against a set of dictionaries, which are embedded in the application. The dictionaries define a particular word category and “capture different psychological concepts” (Pennebaker[9], 2011, p. 6). Once the programme has checked and counted all the words, it then “calculates a ratio of the number of words in each word category” (Servi and Elson[10], 2012).

Pennebaker (2011) acknowledges some LIWC limits. According to Pennebaker, word counting programmes cannot detect linguistic nuances such as irony or sarcasm. LIWC also fails “to capture the context of language. One word, for example, can have very different meanings, depending on how it is used” (p.8). However, researchers aim to create “smarter word-count programs that will eventually take into account syntax, grammar, and context in general” (p.9).

References

[1] http://wordij.net/

[2] Danowski, J. A. (2012). “Social network size and designers’ semantic networks for collaboration” in International Journal of Organization Design and Engineering 2(4/2012)

[3] Danowski, J. A. (2013). WORDij version 3.0: Semantic network analysis software, Chicago: University of Illinois at Chicago, Available from http://wordij.net/

[4] Gluesing, J. et al. (2009), “Mixing ethnography and information technology data mining to visualize innovation networks in global networked organizations” in Dominguez S. and Hollstein B. (eds.) Mixed methods in studying social networks, Cambridge University Press

[5] Derczynski, L. et al. (2013), “Microblog-Genre Noise and Impact on Semantic Annotation Accuracy” in Proceedings of the 24th ACM Conference on Hypertext and Social Media

[6] Elson, S.B. et al. (2012), Technical Report: Using Social Media to Gauge Iranian Public Opinion and Mood After the 2009 Election, Rand Corporation

[7] http://liwc.net/

[8] Tausczik, Y. R. and Pennebaker J. W. (2010), “The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods”, in Journal of Language and Social Psychology, 29 (I) 24-54, Sage Publications

[9] Pennebaker, J.W. (2011), The Secret Life of Pronouns: What our words say about us, New York, Bloomsbury Press, 2011

[10] Servi, L. and Elson, S. B. (2012) “A Mathematical Approach to Identifying and Forecasting Shifts in the Mood of Social Media Users”, in MITRE Technical Report #120090 p. 27-30

Leave a reply, don't be shy! :)

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s