Studying News Use with Computational Methods

# Studying News Use with Computational Methods
## Text Analysis in R, Part I: Text Description, Word Metrics and Dictionary Methods
### Julian Unkel
### University of Konstanz
### 2021/06/21

---

# Agenda

.pull-left[At it's most basic, automated content analysis is just counting stuff: most frequent words, co-occuring words, specific words, etc.

We can already learn a lot about a corpus of documents just by looking at word metrics and applying dictionaries. Even if they are not part of the main research interest, it still might prove useful to use the following methods to describe and familiarize yourself with a large text corpus.
]

- Text description and word metrics
  - Frequencies
  - Keywords in context
  - Collocations
  - Cooccurences
  - Lexical complexity
  - Keyness
- Dictionary-based methods
  - Basics
  - Applying categorical dictionaries
  - Applying weighted dictionaries
  - Validating dictionaries
]

---
class: middle

# Text description and word metrics

---

# Setup

We will be mainly using the packages known from the last few sessions:

```r
library(tidyverse)
library(tidytext)
library(quanteda)
```

```
## Package version: 3.0.0
## Unicode version: 13.0
## ICU version: 69.1
```

```
## Parallel computing: 16 of 16 threads used.
```

```
## See https://quanteda.io for tutorials and examples.
```

```r
library(quanteda.textstats)
```

---

# Setup

We will be working with a sample of 10,000 Guardian articles published in 2020:

```r
guardian_tibble <- readRDS("data/guardian_sample_2020.rds")
```

---

# Setup

Before we start, let's add a column indicating the day the respective article was published in an extra column (you'll soon enough see why):

```r
guardian_tibble <- guardian_tibble %>% 
  mutate(day = lubridate::date(date))
```

```r
guardian_tibble %>% 
  select(date, day)
```

```
## # A tibble: 10,000 x 2
##    date                day       
##    <dttm>              <date>    
##  1 2020-01-01 00:09:23 2020-01-01
##  2 2020-01-01 00:34:18 2020-01-01
##  3 2020-01-01 02:59:09 2020-01-01
##  4 2020-01-01 06:20:56 2020-01-01
##  5 2020-01-01 07:00:58 2020-01-01
##  6 2020-01-01 08:00:01 2020-01-01
##  7 2020-01-01 08:50:00 2020-01-01
##  8 2020-01-01 09:01:00 2020-01-01
##  9 2020-01-01 10:00:02 2020-01-01
## 10 2020-01-01 10:57:37 2020-01-01
## # ... with 9,990 more rows
```

---

# Preprocessing

Just like last time, we'll do some preprocessing of our data by creating a corpus object, tokenizing all documents and creating a DFM.

Keep all of these objects, as different methods require differently structured data.

```r
guardian_corpus <- corpus(guardian_tibble, 
                          docid_field = "id", text_field = "body")

guardian_tokens <- guardian_corpus %>% 
  tokens(remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
         remove_url = TRUE, remove_separators = TRUE) %>% 
  tokens_tolower()

guardian_dfm <- guardian_tokens %>% 
  dfm()
```

---

# Word frequencies

`featfreq()` counts all features. Not that the resulting list is not sorted:

```r
featfreq(guardian_dfm)
```

```
##         there            is             a       message         woven 
##         18152         77962        187892           930            21 
##          into    everything           the         prime      minister 
##         11856          1856        453840          2482          3635 
##          says         about         these         fires     carefully 
##          9596         20189          6695           394           281 
##      threaded       through         every pronouncement          that 
##             9          6086          4226             5         86117 
##          they           are           not extraordinary unprecedented 
##         28376         39966         32524           476           526 
##          with         skill            of           man           who 
##         54959           141        205550          2789         24401 
##          made  pre-politics        career     messaging         scott 
##          6620             1          1314           155           517 
##    morrison's     narrative      disaster            in            no 
##            86           381           490        157939         12547 
##           way     different          from     disasters   australians 
##          6723          2873         37464           102           590 
##          have         faced          past      terrible         event 
##         43054           677          2739           368          1135 
##            to            be          sure           but           one 
##        225486         43044          1548         38462         18749 
##         which            we          will       recover    resilience 
##         18940         29679         24175           271           210 
##           and        aussie        spirit        always         shown 
##        197056            36           431          3242           640 
##        during           our          long       history       similar 
##          5489         11591          4069          2162          1306 
##        crises      whatever        trials      befallen            us 
##           156           657           382             5         11965 
##         never     succumbed         panic            do          this 
##          4020            37           324         10384         33182 
##           now          face       current          fire        crisis 
##         11200          2593          1611          1200          3670 
##   generations          went        before     including         first 
##           259          2416          8077          5282         10442 
##          also       natural        floods        global     conflicts 
##         13465           693           107          2358            81 
##       disease       drought            he          told           new 
##          1315           134         41298          5610         13902 
##  [ reached getOption("max.print") -- omitted 135380 entries ]
```

---

# Word frequencies

`topfeatures()` returns the *n* most common features (default: 10):

```r
topfeatures(guardian_dfm)
```

```
##    the     to     of    and      a     in   that     is    for     on 
## 453840 225486 205550 197056 187892 157939  86117  77962  75739  66469
```

---

# Word frequencies

Some more options, including grouping for docvars, are available with `textstat_frequency()`:

```r
textstat_frequency(guardian_dfm, n = 5, groups = pillar)
```

```
##    feature frequency rank docfreq     group
## 1      the     73441    1    1713      Arts
## 2       of     38415    2    1708      Arts
## 3        a     37528    3    1711      Arts
## 4      and     37483    4    1711      Arts
## 5       to     33283    5    1708      Arts
## 6      the     31317    1     860 Lifestyle
## 7        a     18502    2     842 Lifestyle
## 8      and     18090    3     850 Lifestyle
## 9       to     17431    4     854 Lifestyle
## 10      of     15079    5     846 Lifestyle
## 11     the    253420    1    5325      News
## 12      to    127021    2    5321      News
## 13      of    110784    3    5319      News
## 14     and    100977    4    5317      News
## 15       a     91590    5    5301      News
## 16     the     42100    1     845   Opinion
## 17      to     21923    2     845   Opinion
## 18      of     21479    3     845   Opinion
## 19     and     18732    4     845   Opinion
## 20       a     17047    5     845   Opinion
##  [ reached 'max' / getOption("max.print") -- omitted 5 rows ]
```

---

# Word frequencies

Let's get some more useful results by removing stopwords:

```r
dfm_remove(guardian_dfm, stopwords("english")) %>% 
  textstat_frequency(n = 5, groups = pillar)
```

```
##       feature frequency rank docfreq     group
## 1         one      3929    1    1330      Arts
## 2        like      3124    2    1096      Arts
## 3      people      2883    3     909      Arts
## 4        just      2389    4     993      Arts
## 5        says      2376    5     504      Arts
## 6         one      1807    1     647 Lifestyle
## 7         can      1787    2     592 Lifestyle
## 8        says      1551    3     263 Lifestyle
## 9        like      1499    4     566 Lifestyle
## 10     people      1298    5     433 Lifestyle
## 11       said     28843    1    4490      News
## 12     people     13557    2    3579      News
## 13        one      8569    3    3514      News
## 14 government      8521    4    2841      News
## 15        new      8351    5    3095      News
## 16     people      2404    1     650   Opinion
## 17        one      1850    2     699   Opinion
## 18        can      1573    3     633   Opinion
## 19         us      1509    4     544   Opinion
## 20        now      1398    5     615   Opinion
##  [ reached 'max' / getOption("max.print") -- omitted 5 rows ]
```

---

# Word frequencies

More relevant features emerge after some strong trimming of the DFM:

```r
dfm_trim(guardian_dfm, max_docfreq = .20, docfreq_type = "prop") %>% 
  textstat_frequency(n = 3, groups = pillar)
```

```
##      feature frequency rank docfreq     group
## 1       film      1686    1     558      Arts
## 2       show      1480    2     612      Arts
## 3      music      1358    3     440      Arts
## 4    fashion       508    1      99 Lifestyle
## 5       food       498    2     194 Lifestyle
## 6        add       430    3     139 Lifestyle
## 7      trump      4029    1     826      News
## 8     police      3621    2     926      News
## 9      cases      3443    3    1249      News
## 10     trump       808    1     184   Opinion
## 11 political       660    2     291   Opinion
## 12     black       632    3     150   Opinion
## 13    league      2266    1     684     Sport
## 14   players      1962    2     669     Sport
## 15    season      1824    3     688     Sport
```

---

# Keywords in context

Use `kwic()` to get a view of up to 1000 occurences of a keyword in a given context window (default: 5 words before/after):

```r
kwic(guardian_tokens, "belarus") %>% 
  as_tibble()
```

```
## # A tibble: 66 x 7
##    docname  from    to pre                  keyword post                 pattern
##    <chr>   <int> <int> <chr>                <chr>   <chr>                <fct>  
##  1 959       609   609 and europe we went ~ belarus she said it was rea~ belarus
##  2 1633      445   445 jack on a stick as   belarus gives the uk a desu~ belarus
##  3 2033      321   321 that were stuck in ~ belarus and they were after~ belarus
##  4 2637      112   112 wants noah explaine~ belarus president alexander~ belarus
##  5 2945       62    62 the authoritarian p~ belarus and turkmenistan ov~ belarus
##  6 2978      196   196 countries president~ belarus has made the claim ~ belarus
##  7 3656       54    54 sporting plans alth~ belarus burundi tajikistan ~ belarus
##  8 3692       14    14 include thousands t~ belarus for ve day parade d~ belarus
##  9 3694      133   133 looked very differe~ belarus where elderly veter~ belarus
## 10 3901      350   350 action beyond the b~ belarus haaland's desire to~ belarus
## # ... with 56 more rows
```

---

# Keywords in context

Use `phrase()` for multi-word keywords and set window size with `window`:

```r
kwic(guardian_tokens, phrase("champions league"),
     window = 3) %>% 
  as_tibble()
```

```
## # A tibble: 321 x 7
##    docname  from    to pre             keyword      post             pattern    
##    <chr>   <int> <int> <chr>           <chr>        <chr>            <fct>      
##  1 20        126   127 restart of the  champions l~ all competition~ champions ~
##  2 29        171   172 to swap probab~ champions l~ qualification a~ champions ~
##  3 42       1331  1332 performance in~ champions l~ fixture suggest~ champions ~
##  4 96        419   420 the league and  champions l~ and his selecti~ champions ~
##  5 113        45    46 scored in genk~ champions l~ defeat by liver~ champions ~
##  6 138       148   149 qualify for the champions l~ victory against~ champions ~
##  7 138       396   397 rather than the champions l~ however there w~ champions ~
##  8 155       202   203 scored in barc~ champions l~ final defeat to  champions ~
##  9 155       312   313 victory in the  champions l~ final in june    champions ~
## 10 223       480   481 bus carrying l~ champions l~ winners drive p~ champions ~
## # ... with 311 more rows
```

---

# Collocations

*Collocations* define words directly appearing after each other and can be computed with `textstat_collocations()`. The output is sorted by the `$\lambda$` parameter, which increases if *exactly* this combination of words is more common than the same words appearing in other collocations. Note that this can be very computationally expensive, so adjust the `min_count()` parameter accordingly:

```r
guardian_tokens %>% 
  tokens_remove(stopwords("english")) %>% 
  textstat_collocations(min_count = 100) %>% 
  as_tibble()
```

```
## # A tibble: 615 x 6
##    collocation     count count_nested length lambda     z
##    <chr>           <int>        <int>  <dbl>  <dbl> <dbl>
##  1 prime minister   1880            0      2   8.92  169.
##  2 last week        1567            0      2   5.33  168.
##  3 last year        1694            0      2   4.95  167.
##  4 social media     1074            0      2   6.67  157.
##  5 public health    1196            0      2   5.17  149.
##  6 chief executive   986            0      2   8.39  149.
##  7 white house       871            0      2   6.45  145.
##  8 years ago        1081            0      2   6.22  142.
##  9 human rights      756            0      2   7.45  141.
## 10 climate change    733            0      2   6.54  135.
## # ... with 605 more rows
```

---

# Collocations

We can look for multi-word collocations of any size by adjusting the `size` parameter:

```r
guardian_tokens %>% 
  tokens_remove(stopwords("english")) %>% 
  textstat_collocations(min_count = 10, size = 4) %>% 
  as_tibble()
```

```
## # A tibble: 653 x 6
##    collocation                            count count_nested length lambda     z
##    <chr>                                  <int>        <int>  <dbl>  <dbl> <dbl>
##  1 andrés manuel lópez obrador               18            0      4  12.9   2.96
##  2 new york los angeles                      10            0      4  10.9   2.93
##  3 prime minister narendra modi              19            0      4  11.0   2.82
##  4 crown prince mohammed bin                 16            0      4   9.91  2.81
##  5 kenan malik observer columnist            12            0      4  10.0   2.55
##  6 prime minister boris johnson              52            0      4   6.42  2.39
##  7 department education spokesperson said    13            0      4   4.41  2.26
##  8 prime minister viktor orbán               20            0      4   8.51  2.20
##  9 thousands inboxes every weekday           20            0      4   7.51  2.06
## 10 ruby princess cruise ship                 13            0      4   5.81  2.04
## # ... with 643 more rows
```

---

# Cooccurences

*Cooccurences* look for words appearing in the same document (and not just directly after each other).

Cooccurences are best represented as a *feature cooccurence matrix* of size `n_features * n_features`. Create one with `fcm()`. Again, to decrease computational load, some trimming of the DFM may be useful:

```r
guardian_fcm <- guardian_dfm %>% 
  dfm_remove(stopwords("english")) %>% 
  dfm_trim(min_termfreq = 100, max_docfreq = .25, docfreq_type = "prop") %>% 
  fcm()
```

---

# Cooccurences

```r
guardian_fcm
```

```
## Feature co-occurrence matrix of: 6,009 by 6,009 features.
##                features
## features        message everything prime minister  says fires carefully
##   message           293        237   436      567  1206    81        34
##   everything          0        590   468      616  4777   128        77
##   prime               0          0  2576     7549  2154   119       104
##   minister            0          0     0     4361  2928   197       156
##   says                0          0     0        0 42752   430       493
##   fires               0          0     0        0     0  1414         7
##   carefully           0          0     0        0     0     0        21
##   extraordinary       0          0     0        0     0     0         0
##   unprecedented       0          0     0        0     0     0         0
##   skill               0          0     0        0     0     0         0
##                features
## features        extraordinary unprecedented skill
##   message                  76            69    17
##   everything              156            98    51
##   prime                   151           226    21
##   minister                193           271    21
##   says                    696           652   243
##   fires                    41           139     6
##   carefully                21            18     7
##   extraordinary            48            55     6
##   unprecedented             0            68     7
##   skill                     0             0    13
## [ reached max_feat ... 5,999 more features, reached max_nfeat ... 5,999 more features ]
```

---

# Cooccurences

A simple way to get at the most common cooccurences is by transforming the FCM into a Tibble with the `tidy()` function:

```r
guardian_fcm %>% 
  tidy() %>% 
  filter(document != term) %>% 
  arrange(desc(count))
```

```
## # A tibble: 16,598,119 x 3
##    document  term     count
##    <chr>     <chr>    <dbl>
##  1 died      hospital 25139
##  2 died      family   16223
##  3 president trump    15829
##  4 trump     biden    14949
##  5 hospital  family   14809
##  6 trump     trump's  13384
##  7 hospital  covid-19 12021
##  8 died      worked   12013
##  9 trump     election 11424
## 10 died      covid-19 11209
## # ... with 16,598,109 more rows
```

---

# Lexical complexity

*Lexical complexity* may be indicated through a document's readability and lexical diversity. `textstat_readability()` offers several readability measures, by default the `Flesch Reading Ease` which is based on the average sentence length and average syllable count per word (note that we need to use the corpus object in this case, as sentences are preserved here). Lower values indicate a lower readability:

```r
textstat_readability(guardian_corpus) %>% 
  as_tibble()
```

```
## # A tibble: 10,000 x 2
##    document Flesch
##    <chr>     <dbl>
##  1 1          39.6
##  2 2          60.7
##  3 3          48.7
##  4 4          52.5
##  5 5          42.0
##  6 6          46.9
##  7 7          45.8
##  8 8          55.2
##  9 9          59.9
## 10 10         47.6
## # ... with 9,990 more rows
```

---

# Lexical complexity

Accordingly, `textstat_lexdiv()` offers several measures to quantify the lexical diversity of documents. By default, the *Type-Token-Ratio* (unique tokens divided by number of tokens per document) is computed. Note that the *TTR* is heavily influenced by document length:

```r
textstat_lexdiv(guardian_dfm) %>% 
  as_tibble()
```

```
## # A tibble: 10,000 x 2
##    document   TTR
##    <chr>    <dbl>
##  1 1        0.453
##  2 2        0.634
##  3 3        0.438
##  4 4        0.669
##  5 5        0.429
##  6 6        0.427
##  7 7        0.657
##  8 8        0.509
##  9 9        0.508
## 10 10       0.491
## # ... with 9,990 more rows
```

---

# Keyness

Finally, *keyness* (and accordingly `textstat_keyness()`) presents a measure of the distinctivness of words for a certain (group of) documents as compared to other documents. For example, we can group our corpus by the `pillar` (Arts, Lifestyle, News, Opinion, or Sport) and get to the most distinctive terms for Sport documents by:

```r
guardian_dfm %>% 
  dfm_group(pillar) %>% 
  textstat_keyness(target = "Sport") %>% 
  as_tibble()
```

```
## # A tibble: 135,480 x 5
##    feature    chi2     p n_target n_reference
##    <chr>     <dbl> <dbl>    <dbl>       <dbl>
##  1 league   14537.     0     2266         298
##  2 players  12498.     0     1962         270
##  3 game      8593.     0     1813         754
##  4 season    8592.     0     1824         770
##  5 football  6760.     0     1299         420
##  6 team      6221.     0     1770        1309
##  7 cup       6182.     0     1019         184
##  8 club      6046.     0     1292         554
##  9 player    4816.     0      828         181
## 10 ball      4537.     0      803         197
## # ... with 135,470 more rows
```

---

# Text description and word metrics

**Exercise 1: Text description**

`btw_tweets.csv` (on ILIAS) contains 1377 tweets by the three German chancellor candidates Annalena Baerbock, Armin Laschet & Olaf Scholz made in 2021, as obtained by Twitter's Academic API.

- Load the tweets into R and do the necessary preprocessing
- Investigate the tweets using the text and word metrics you just learned
- What are the most common words?
- What are the most common collocations?
- What are the most distinct words per account?

---

# Dictionary-based methods

---

# Basics

*Dictionaries* contain a list of predefined words (or other features) that should represent a latent construct. This is probably the simplest way to automatically anaylze texts for the presence of latent constructs.

At their core, dictionary-based methods are just counting the presence of the dictionary words in the documents. Usually, this is based on two (implicit) assumptions:

- **Bag-of-words**: Just like with many other automated text analysis methods, word order and thus semantical and syntactical relationships are ignored. 
- **Additivity**: The more words from the dictionary are found in a document, the more pronounced the latent construct.

---

# Terminology

Dictionaries are commonly differentiated along two dimensions, the first being the source of the dictionary:

- **Organic** dictionaries are created for the specific research task from scratch, for example by theoretical assumptions about the latent construct(s), investigating the most common features, etc.
- **Off-the-shelf** dictionaries are pre-made, (hopefully) pre-validadated dictionaries used for specific purposes, for example sentiment analysis.

Second, dictionaries may be either categorical or weighted:

- In **categorical** dictionaries, every word is valued the same.
- In **weighted** dictionaries, weights are assigned to words. For example, in a positivity dictionary, "love" may have a higher weight than "like".

---