Create and Learn
- Dec 10, 2019

Machine learning analyzed 3.5 million books to find top adjectives for men and women

Researches from the University of Copenhagen and the United States analyzed 3.5 million books in English through machine learning and found that "brave" and "rational" are some adjectives used to describe men while "lovely" and "beautiful" are commonly used to describe women.

"We are clearly able to see that the words used for women refer much more to their appearances than the words used to describe men. Thus, we have been able to confirm a widespread perception, only now at a statistical level," says Isabelle Augenstein of the University of Copenhagen’s Department of Computer Science. They have found that negative verbs related to body and appearance are used five times more for females than males. The analyses also demonstrate that positive and neutral adjectives relating to the body and appearance occur approximately twice as often in descriptions of females, while males are most frequently described using adjectives that refer to their personal qualities and behavior.

Linguists used to look at the prevalence of gendered language and bias, by using small data sets. But today, scientists are able to use algorithms to analyze vast sets of data – 11 billion words in this case.

Old gender stereotypes

Books published decades ago, still, play an active role. Data in the form of text feeds the algorithms used to create machines and applications that can understand human language. This is the technology that allows smartphones to recognize our voices, enables Google to provide keyword suggestions and more.

"If the language we use to describe men and women differs, in employee recommendations, for example, it will influence who is offered a job when companies use IT systems to sort through job applications," says Isabelle Augenstein.

It is important to be aware of the gendered language as artificial intelligence and language technology become more prominent.

Augenstein proceeds: "We can try to take this into account when developing machine-learning models by either using less biased text or by forcing models to ignore or counteract bias. All three things are possible."

A top-11 list of most frequently occurring adjectives, distributed in categories.

University of Copenhagen