“Gendered” is a new series of posts which look at gender stereotypes with data. The goal is to expose the stereotypes and equip people with tools that will help recognize them in everyday life. Because they are everywhere. Really.
A few months ago a friend posted a picture that was traversing the internets a couple of years earlier: side-by-side covers of two teen magazines – Girls’ life and Boys’ life. The difference was so striking that it caused a modest uproar.
It caught my attention and I visited the websites of these magazines. The situation looked even worse when you compared the covers of dozens of issues over time. To quantify this, because that’s what I do, I analyzed the words that occurred most commonly on these covers and ended up creating my first (ever) infographic. Enjoy and degender.
For those of you who want to look under the hood and see how this was done, here are some details:
Sources – girlslife.com, boyslife.org , magazine-agent.com-sub.info, childstats.gov
Analysis of text on the magazine covers – I first created data files with all the words/sentences from the magazine covers. I then used the R tm (text mining) package to stem the text and remove common words like “a” and “the”. I also removed four irrelevant but common words that appeared almost in every issue and would dominate the cloud and make the other words harder to see: “quiz”, “story”, “scout”, “true” (Boys’ life is a boy scouts journal, so “scout” is a word that appears in every issue). Below is the code I used in case you’d like to do some text mining yourself. Finally, I used the wordcloud package to created the colourful word clouds.
library(tm) library(SnowballC) library(wordcloud) library(RColorBrewer) data <- read.csv('girls_life.csv') docs <- Corpus(VectorSource(data[,2])) #convert the text to lower case docs <- tm_map(docs, content_transformer(tolower)) #remove numbers docs <- tm_map(docs, removeNumbers) #remove common English stopwords docs <- tm_map(docs, removeWords, stopwords('english')) #remove punctuation docs <- tm_map(docs, removePunctuation) #remove extra white spaces docs <- tm_map(docs, stripWhitespace) #stem the words docs <- tm_map(docs, stemDocument) #remove additional stopwords docs <- tm_map(docs, removeWords, c('quiz')) #convert to a data frame dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10) #generate the word cloud par(bg='#FFDD9D') wordcloud(d$word, d$freq, col=brewer.pal(n = length(d$word), name = "PuBuGn"), random.order=FALSE, rot.per=0.3 )