Spread of the Golden Arches

I decided to take on another small data science and programming project in spite of a few exams approaching. This time, we follow the footsteps of the Golden Arches as they take over the globe.

Inspiration

This data visualization project was inspired by a post I came across a few months ago. The author detailed his process of making this plot in their blog post.

Most of the data scraping and visualizations I've come across on r/dataisbeautiful seem to be made with R. I decided to learn R by adapting and writing code to make a McDonald's version of the Burger King post. After installing RStudio and taking a quick lesson about the syntax, I took on the project.

Process

To start off, we need to import the libraries that will help us scrape and analyze the data. Since my end goal is the same as the original project, I imported the same libraries as the post.

                
                    library(tidyverse)
                    library(rvest)
                    library(sf)
                    library(glue)
                    library(rnaturalearth)
                    library(gifski)
                
            

Web Scraping
You must have the data on hand before being able to plot it. I located this Wikipedia page containing all the countries with McDonald's restaurants.

To scrape the data, we must know the xpath of the element we want. This can be done by through "inspect element" then copying the xpath. We can then read the HTML element from our URL and send the contents to the table variable we created.

                
                        url <- "https://en.wikipedia.org/wiki/List_of_countries_with_McDonald%27s_restaurants"
                        # confirms url
                        url
                        
                        # reads in the HTML table from wikipedia
                        table <- url %>%
                        read_html() %>%
                        html_nodes(xpath = '/html/body/div[3]/div[3]/div[4]/div/table[2]') %>%
                        html_table()  %>% .[[1]]
                
            

Data Manipulation
Now we have the data in a table, but we still have some housekeeping to do before plotting it. We only need two columns from the data scraped: country and year opened. So, we start off by taking the second and third column from the table.

Another problem that I noticed while looking at the data was that, unlike the Burger King page, the full dates was provided in the field. Some cells had citation brackets and additional information.

To fix that problem, I decided to use the stringr package in R to identify the areas to remove. R allows us to iterate through elements by name with the $ symbol, but our column names are not exactly friendly. So we first replace the column names with Country and Year. Then we use the str_replace() function to replace anything that matches the specified strings we want. In our case it's anything in parenthesis or brackets.

A similar method is used to get the year from the full date. We will use the str_sub() function and take the last four characters from each date and convert them to numbers with as.numeric().

                
                    # extracts the country and year columns from the data
                    table <- table[2:3]
                    
                    # learning: checks type of data the table is, changes the column name to something more friendly
                    class(table[2])
                    colnames(table)[1] <- "Country"
                    colnames(table)[2] <- "Year"
                    
                    # removes any additional details about countries in parenthesis
                    table[1]$Country <- str_replace(table[1]$Country, " \\(.*", "")
                    table[1]$Country <- str_replace(table[1]$Country, "\\(.*", "")
                    # remove citations [13]
                    table[2]$Year <- str_replace(table[2]$Year, "\\[.*", "")
                    # extracts only the opening year from the full date
                    table[2]$Year <- as.numeric(str_sub(table[2]$Year, -4, -1))
                
            

There are always exceptions to the pattern (thanks Wikipedia). I inserted the correct year for United States and Germany, and removed all countries with no McDonald's yet.

                
                    # manually inserts the year for Germany and USA
                    table[2]$Year[1] <- 1940
                    table[2]$Year[11] <- 1971
                    table[is.na(table)] = "Soon"
                    # removes any cells listing McDonald's that aren't open yet
                    table = table[ !grepl("Soon", table$Year) , ]

                    table
                
            

As stated in the blog post, we should check if there are any mismatches between the country names we scraped and the country names in the rnaturalearth library.

                
                    > mismatches
                                Country Year
                    1        United States 1940
                    2  U.S. Virgin Islands 1970
                    3              Curacao 1974
                    4              Bahamas 1975
                    5          PR of China 1990
                    6           Martinique 1991
                    7           Guadeloupe 1992
                    8    Northern Marianas 1993
                    9              Reunion 1997
                    10           Gibraltar 1999
                    11       French Guiana 2000
                
            

We then rename the countries that do exist in the rnaturalearth dataset to match them, ignoring the ones that don't. Then we can join both datasets together in one dataframe called joined_data.

                
                    countries_sf <- ne_countries(scale = "medium", returnclass = "sf")

                    mismatches <- table %>%
                      anti_join(countries_sf, by  = c("Country" = "name_en"))
                    mismatches
                    
                    country_match <- tribble(
                      ~Country, ~sf_country,
                      "United States", "United States of America",
                      "U.S. Virgin Islands", "United States Virgin Islands",
                      "Curacao", "Curaçao",
                      "Bahamas", "The Bahamas",
                      "Northern Marianas", "Northern Mariana Islands",
                      "PR of China", "People's Republic of China")
                    country_match
                    
                    
                    # joins the data frames to fix name mismatches
                    joined_data <- 
                      table %>%
                      left_join(country_match) %>%
                      mutate(country = ifelse(is.na(sf_country), Country, sf_country)) %>%
                      left_join(countries_sf, by  = c("country" = "name_en"))
                
            

Map Plotting
We finally reach the stage when we can plot the data. Using code modified from the original post to plot our map and export as a gif file.

                
                    ggplot() + geom_sf(data = countries_sf) 

                    # plots the opening of each mcdonalds on the map, code from https://r-mageddon.netlify.com/post/the-burger-king-pandemic/
                    plot_fun <- function(open_year){
                      p <- ggplot() + 
                        ## Add the background countries, but filter away Antarctica because it is unecessary.
                        geom_sf(data = filter(countries_sf, name_en != "Antarctica")) + 
                        ## Add the countries data, filtering so countries only appear after the first mcdonalds has opened
                        geom_sf(data = filter(joined_data, Year <= open_year), fill = rgb(236, 28, 36, maxColorValue = 255)) + 
                        ## Change the theme so we don't get a background plot or axes. The coord_sf part is a workaround to a current bug that makes gridlines appear.
                        theme_void() + coord_sf(datum=NA) + guides(fill=FALSE) + 
                        ## Stitch the year to the title with glue
                        labs(title = glue("    Year: {open_year}"),
                             subtitle = glue("      Countries with McDonald's: {nrow(filter(joined_data, Year <= open_year))}"))
                      print(p)
                    }
                    
                    # saves visualization as a gif
                    save_gif(walk(min(joined_data$Year):max(joined_data$Year), plot_fun), delay = 0.6, gif_file = "mcdonalds.gif")
                
            

Challenges

The main challenge I had to overcome was learning R and understanding the syntax and libraries. The code posted on the original post made no sense at first, but as I looked more into how to carry out each step of the process it became clearer.

Another challenge was the different formatting on the HTML table on the McDonald's page so I had to scrape the data differently.

Accomplishments

  • Learnt the basics and how to scrape, manipulate, and plot data with R
  • Used the rvest library to scrape a HTML table
  • Plotted the map with rnaturalearth, sf and glue libraries
  • Made an animated visualization with the gifski package

Future Plans

Working more with R to visualize interesting relationships and concepts.