A little bit of Monica in my life

One analysis in the R community that caught my attention is Hilary Parker’s analysis of the most poisoned baby name in US history. I was surprised that my own name didn’t show up in the analysis. If Hilary had a huge loss in 1993, what happened to Monica in 1999?

For a period when I was a kid the name ‘Monica’ is what made the adults turn down the NPR broadcast. Its entrance into popular culture due to the Clinton Impeachment, a shoutout in Mambo No. 5, and also being name of the the least likable roommate on Friends (#bossy)1, wasn’t exactly the kind of material that was going to make me cool.

But now I am fond of my name again and I’m also looking back on that cultural moment with a new lens thanks to Monica Lewinsky’s awesome talk and essay. Still, with the babynames package on CRAN, I had to take a look at where ‘Monica’ falls on the poisoned names list.

🎙 One, two, three, four, five 🎙

The babynames package

Since Hilary conducted her analysis, it’s much easier to get the baby names data because it is now available as a package on CRAN. The data frame includes the year, sex, name, and frequency of the name. It also includes the proportion, prop, of people of that gender and name born in that year. One other difference is that we can now calculate the relative risk using the tidyverse.

library(babynames)
head(babynames)
## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162
str(babynames)
## tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame)
##  $ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
##  $ sex : chr [1:1924665] "F" "F" "F" "F" ...
##  $ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
##  $ n   : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
##  $ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...

In Hilary’s analysis she looked at the top 1000 names in a given year, so I will follow that methodology.

Data wrangling

First I will limit the data set to the top 1000 names in a year and then calculate the relative risk and percentage loss.

As Hilary explains in her post, the relative risk is a measure used to compare proportions. In public health we use it to compare the proportion of people who get a disease who were exposed to something to the proportion of people who get a disease who were unexposed to something. For example, if 10 out of 100 people (10%) who get a flu shot end up getting the flu in a given year, and 15 out of 100 people (15%) who don’t get a flu show end up getting the flu in a given year, then we can divide .10 by .15 to get the relative risk of 0.67. This means that people who get the flu shot have 0.67 times the risk of getting the flu compared to those who don’t get the flu shot. In Hilary’s analysis, she calculates the relative risk as a loss percent to help us think of it as a decrease. In the flu example, the loss percent would be (1-0.67)*100 = 33% less likely.

Here’s how we can calculate this in the tidyverse.

library(tidyverse)

top1000 <- babynames %>%
  group_by(sex, year) %>%
  arrange(desc(n)) %>%
  mutate(rank = row_number()) %>%
  filter(rank <= 1000) %>%
  group_by(sex, name) %>%
  arrange(year) %>%
  mutate(relrisk = prop/lag(prop),
         loss_pct = (1-relrisk)*100)

The only problem with doing it like this is that if there are gaps that the name made it in the top 1000, the calculation is off. For example, look at the name Aarush. Aarush was in the top 1000 in 2010, then dipped below the top 1000 in 2011 to 2014, and then back in the top 1000 in 2015.

library(knitr)

top1000 %>%
  filter(name == "Aarush") %>%
  kable()
year sex name n prop rank relrisk loss_pct
2010 M Aarush 227 0.0001106 900 NA NA
2015 M Aarush 211 0.0001035 974 0.9358163 6.418369

Ok, let’s try this again.

top1000 <- babynames %>%
  group_by(sex, year) %>%
  arrange(desc(n)) %>%
  mutate(rank = row_number()) %>%
  filter(rank <= 1000) %>%
  group_by(sex, name) %>%
  arrange(year) %>%
  mutate(relrisk = ifelse(year == lag(year)+1, prop/lag(prop), NA),
         loss_pct = (1-relrisk)*100) %>%
  ungroup()

It works!

top1000 %>%
  group_by(sex, name) %>%
  filter(year != lag(year)+1) %>%
  arrange(name, year, sex)
## # A tibble: 13,573 x 8
## # Groups:   sex, name [4,213]
##     year sex   name       n      prop  rank relrisk loss_pct
##    <dbl> <chr> <chr>  <int>     <dbl> <int>   <dbl>    <dbl>
##  1  2017 M     Aaden    240 0.000122    888      NA       NA
##  2  2015 M     Aarush   211 0.000104    974      NA       NA
##  3  1882 M     Ab         5 0.0000410   943      NA       NA
##  4  1885 M     Ab         6 0.0000518   836      NA       NA
##  5  1887 M     Ab         5 0.0000457   921      NA       NA
##  6  1890 M     Abb        6 0.0000501   884      NA       NA
##  7  1891 M     Abbie      5 0.0000458   964      NA       NA
##  8  1937 F     Abbie     52 0.0000472   964      NA       NA
##  9  1942 F     Abbie     65 0.0000468   968      NA       NA
## 10  1957 F     Abbie    120 0.0000572   938      NA       NA
## # … with 13,563 more rows

Biggest percent drops

OK, let’s see who has the biggest drops! Did I reproduce the list?

top1000 %>%
  arrange(desc(loss_pct)) %>%
  filter(sex == "F") %>%
  filter(row_number() <= 14) %>%
  select(name, loss_pct, year, n) %>%
  kable()
name loss_pct year n
Farrah 78.08533 1978 332
Dewey 74.43853 1899 24
Catina 73.58773 1974 329
Khadijah 72.48679 1995 438
Deneen 71.88842 1965 421
Hilary 70.19101 1993 343
Katina 69.31745 1974 765
Renata 69.02648 1981 224
Iesha 68.91567 1992 581
Clementine 68.82256 1881 6
Minna 67.88056 1883 7
Ashanti 67.84702 2003 962
Infant 67.48330 1991 187
Tennille 66.79955 1978 141

These rankings are slightly different than Hilary’s but pretty darn close! I think the difference comes from the fact that I didn’t round the loss percent. Close enough!

🎺 Jump up and down and move it all around 🎺

So where does Monica stand? Let’s take a look at its loss percent compared to the other names.

Finding Monica in babynames

First I filtered the data set to see how the loss percent for Monica compares to the list of the top poisoned names.

top1000 %>%
  filter(name == "Monica") %>%
  select(name, year, loss_pct, n) %>%
  arrange(desc(loss_pct)) 
## # A tibble: 136 x 4
##    name    year loss_pct     n
##    <chr>  <dbl>    <dbl> <int>
##  1 Monica  1999     34.2  2133
##  2 Monica  1902     32.0    33
##  3 Monica  1998     24.7  3229
##  4 Monica  1914     21.8   156
##  5 Monica  1899     18.9    30
##  6 Monica  1919     18.3   178
##  7 Monica  1888     18.0    13
##  8 Monica  1918     17.3   223
##  9 Monica  1936     17.1   273
## 10 Monica  2013     16.8   597
## # … with 126 more rows

Only a 34% loss for Monica in 1999 compared to 70% for Hilary in 1993 and 78% for Farrah in 1978.

Graphing it

Let’s take a look at the names side by side using ggplot2.

top1000 %>%
  mutate(percent = prop*100) %>%
  filter(name == "Monica" | name == "Hilary" | name == "Hillary") %>%
  ggplot(aes(year, percent, colour = name)) +
  geom_line() +
  labs(title = "Percent of girls given the name Hilary/Hillary or Monica over time",
       y = "Percent",
       x = "Year")

Wow, that really changes my perspective. Hilary may have the largest loss percent in a given year, but was the impact of the drop in Monica’s greater because there were more Monica’s to begin with? It’s hard to tell. It looks like there was already a downward trend after the 1970s, but that the trend started to level off in the 1990s (maybe Monica from Friends wasn’t so unpopular after all?!), before plummeting in ’98 and ’99.

🎵 And if it looks like this then you’re doing it right 🎵

The risk difference

In public health we use both the relative risk2 and the risk difference. They provide two different perspectives on the same information, and the usefulness of each measure depends on what question you are hoping to answer3. When the question is about the population-level impact of a factor on an outcome, then the risk difference is a more useful measure. When the question is about the strength of an association, the relative risk is the best option. The question about which baby name was the most poisoned–that is, which name had the strongest drop in popularity–is a question about the strength of an association.

However, when I looked at the plot above, a new question came to mind: did the drop in the baby name Monica have a greater impact in terms of the overall number of babies being named in the 90s?

In more detail, The risk difference (RD) is the difference in proportions. It tells us what the excess risk of disease is among those who have been exposed to something vs. those who have not been exposed. In a study on alcohol use and breast cancer4, a 40-year-old woman has an absolute risk of 1.45 of developing breast cancer in the next 10 years. If she is a light drinker, this risk becomes 1.51 percent. The risk ratio is 1.51/1.45 = 1.04, or a 4% increase in risk, which seems like a non-negligible amount. However, the risk difference is 1.51-1.45 = 0.06, or an excess risk of .06 per 100. That’s 6 people for every 10,000 light drinkers, which is a pretty low population impact. When you put it in absolute terms, the RD brings a different perspective to certain types of risk.

The risk difference in babynames

What would the risk difference say about these poisoned names? I’ll calculate the risk differences in the babynames data set. In this analysis, we can think of the year as the “exposure” and the name as the “outcome.” We can then calculate the “risk” or probability of being named Monica in any given year compared to the prior year. To make this more interpretable, I also calculated the excess risk, or how many people were named Monica in a year per 10,000 people named compared to the previous year, by multiplying the risk difference by 10,000.

rds <- top1000 %>%
  group_by(sex, name) %>%
  arrange(year) %>%
  mutate(riskdiff= ifelse(year == lag(year)+1, prop - lag(prop), NA),
         excessrisk_per10000 = round(riskdiff*10000,1)) %>%
  ungroup()

rds %>%
  filter(sex == "F") %>%
  arrange(riskdiff) %>%
  filter(row_number() <= 14) %>%
  select(year, name, n, relrisk, loss_pct, riskdiff, excessrisk_per10000) %>%
  kable()
year name n relrisk loss_pct riskdiff excessrisk_per10000
1937 Shirley 26816 0.7459047 25.409533 -0.0082912 -82.9
1936 Shirley 35159 0.8372019 16.279809 -0.0063452 -63.5
1950 Linda 80432 0.8821419 11.785807 -0.0061105 -61.1
1951 Linda 73972 0.8755442 12.445582 -0.0056921 -56.9
1985 Jennifer 42650 0.8238794 17.612060 -0.0049392 -49.4
1952 Linda 67088 0.8807161 11.928385 -0.0047766 -47.8
1970 Lisa 38964 0.8326375 16.736249 -0.0042753 -42.8
1957 Deborah 40070 0.8224526 17.754744 -0.0041239 -41.2
1954 Linda 55381 0.8757818 12.421825 -0.0039456 -39.5
1958 Cynthia 31003 0.8008784 19.912163 -0.0037328 -37.3
1883 Mary 8012 0.9475668 5.243321 -0.0036927 -36.9
1977 Amy 26731 0.8150250 18.497499 -0.0036882 -36.9
1938 Shirley 23769 0.8556230 14.437703 -0.0035140 -35.1
1953 Linda 61275 0.9006528 9.934718 -0.0035037 -35.0

So in absolute terms, the biggest drops were among names that have higher frequencies. The overall impact is greater. There is a decrease of 83 Shirley’s per 10,000 babies born in the year 1937 compared to the prior year. Sorry Shirley!

Hilary/Hillary and Monica are nowhere near the top losses if we look at the risk difference.

But what if we compare Hilary to Monica? Who has the biggest drop in absolute terms?

rds %>%
  filter(sex == "F") %>%
  filter(name == "Monica" | name == "Hilary" | name == "Hillary",
         year > 1990) %>%
  arrange(riskdiff) %>%
  filter(row_number() <= 10) %>%
  select(year, name, n, riskdiff, excessrisk_per10000) %>%
  kable()
year name n riskdiff excessrisk_per10000
1993 Hillary 1064 -0.0007180 -7.2
1999 Monica 2133 -0.0005702 -5.7
1998 Monica 3229 -0.0005462 -5.5
1993 Hilary 343 -0.0004097 -4.1
1994 Hillary 408 -0.0003304 -3.3
1993 Monica 3900 -0.0003101 -3.1
1991 Monica 4156 -0.0001239 -1.2
2000 Monica 1990 -0.0000984 -1.0
2003 Monica 1613 -0.0000964 -1.0
2001 Monica 1793 -0.0000920 -0.9

And Hillary wins with an excess risk of -7.2 per 10,000 babies in 1993! In 1993, there were about 7 fewer babies named Hillary than in 1992 for every 10,000 babies born. It’s even larger if you combine Hilary and Hillary. Interestingly, Hillary had a larger risk difference compared to Hilary in 1993. Using the risk ratio and risk difference, Hilary is the more poisoned name.

P.S. Seriously, watch this talk or read this story by Monica Lewinsky. It’s inspiring and adds a lot to current conversations about harassment and bullying.


  1. I’ll admit that I’ve never actually watched Friends, but this is the impression I got from random clips!↩︎

  2. The relative risk is also referred to as the risk ratio or the cumulative incidence ratio in epidemiology. I know, why don’t we just stick to calling it one thing?!↩︎

  3. The usefulness of these measures is often the source of a lot of confusion in the media when there are stories about the benefit of certain treatments and screenings, for example, the benefits of mammography or the harms of alcohol use and birth control. One reason I love The Upshot is because they do an excellent job of explaining this to a non-epidemiologist audience.↩︎

  4. https://www.nytimes.com/2017/11/10/upshot/health-alcohol-cancer-research.html↩︎

Monica Gerber
Monica Gerber
Data Scientist in Public Health
comments powered by Disqus

Related