Analyze the text of 100,000 stories to identify gender-associated verbs, terms related to violence, and more.
Also in this series: Examining the Arc of 100,000 Stories.
I was fascinated by my colleague Julia Silge’s blog post on what verbs tend to occur after “he” or “she” in several novels, and what they might imply about gender roles within fictional work. This made me wonder what trends could be found in a larger dataset of stories.
As I usually do for text analysis, I’ ll be using the tidytext package Julia and I developed last year. To learn more about analyzing datasets like this, see our online book Text Mining With R: A Tidy Approach, published by O’ Reilly. I’ ll provide code for the text mining sections so you can follow along. I don’ t show the code for most of the visualizations to keep the post concise, but as with all of my posts, the code can be found here on GitHub.
We’ ll start with the same code from the last post that read in the plot_text variable from the raw dataset. Just as Julia did, we then tokenize the text into bigrams, or consecutive pairs of words, with the tidytext package, then filter for cases where a word occurred after “he” or “she.”
Which words were most shifted towards occurring after “he” or “she”? We’ ll filter for words that appeared at least 200 times.
This can be visualized in a bar plot of the most skewed words.
I think this paints a somewhat dark picture of gender roles within typical story plots. Women are more likely to be in the role of victims — “she screams, ” “she cries, ” or “she pleads.” Men tend to be the aggressor: “he kidnaps” or “he beats.” Not all male-oriented terms are negative — many, like “he saves”/”he rescues” are distinctly positive — but almost all are active rather than receptive.
We could alternatively visualize the data by comparing the total number of words to the difference in association with “he” and “she.” This helps find common words that show a large shift.
There are a number of very common words (“is, ” “has, ” “was”) that occur equally often after “he” or “she” but also some fairly common ones (“agrees, ” “loves, ” “tells”) that are shifted. “She accepts” and “He kills” are the two most shifted verbs that occurred at least a thousand times, as well as the most frequent words with more than a twofold shift.
Women in storylines are not always passive victims. The fact that the verb “stabs” is shifted towards female characters is interesting. What does the shift look like for other words related to violence or crime?
The fact that men are only slightly more likely to “shoot” in fiction is also notable since the article noted that men are considerably more likely to choose guns as a murder weapon than women are.
This data shows a shift in what verbs are used after “he” and “she, ” and therefore what roles male and female characters tend to have within stories. However, it’s only scratching the surface of the questions that can be examined with this data.
I’ d also note that we could expand the analysis to include not only pronouns but first names (for example, not only “she tells, ” but “Mary tells” or “Susan tells”) , which would probably improve the accuracy of the analysis.
Again, the full code for this post is available here and I hope others explore this data more deeply.