Biased language models can result from internet training data

The controversy over AI researcher Timnit Gebru’s departure from Google and what partisan language models might mean for the search industry.

Last year, Google announced BERT, calling it the biggest change to its search engine in nearly five years, and now nearly all searches are in English. However, language models like BERT are trained in large datasets and there are potential risks associated with developing language models in this way.

Google’s omission by researcher Timnit Gebru has been linked to these issues, as well as concerns about how biased language models might affect marketers and user searches.

A great researcher of artificial intelligence and her departure from Google

Who is she? Before Gebru left Google, Gebru was best known for publishing a groundbreaking study in 2018 that found facial analysis software had an error rate of nearly 35% for dark-skinned women, compared to less than 1. % For men. good color. She is also a former student of the Stanford Artificial Intelligence Laboratory, which advocates diversity and criticizes the lack of diversity among technology company employees, and is a co-founder of Black in AI, a nonprofit that specializes in the use of the field. I want to educate. of AI technology. She was recruited by Google in 2018, with the promise of full academic freedom, and will become the company’s first black researcher, reports the Washington Post.

Because she no longer works for Google. Following a dispute with Google over a co-authored article (“On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”), Gebru is briefed on the potential risks of training language models with a large number of dice. . that her “resignation” was urgent: she was on vacation at the time and had been promoted to co-leader of the company’s AI team less than two months earlier.

In a public response, Jeff AI, senior vice president of Google AI, said the article “ignores a lot of relevant searches,” “ignores recent searches,” and that the article was only revised one day before the deadline. He also said that Gebru lists a number of conditions that must be met to continue his work at Google, including disclosure to anyone who sees Dean as part of the article review process. “Timnit wrote that if he doesn’t respond to these requests, she will leave Google and reach an end date. We accept and respect her decision to leave Google,” she said.

In a series of tweets, he said, “I didn’t give up – I asked for simple terms first” and said, “I said these are the terms. For a final game, email [Using Jump Jump Manager] for my subordinates. And says he accepts my resignation.

If you want more comments, Google has nothing else to add. Instead, she pointed to Dean’s public reaction and a memo from CEO Sundar Pichai.

While the nature of Google’s divorce is up for debate, Gebru is now among the growing number of former Google employees who feared doing it and would suffer the consequences. The advocate of marginalized groups and leadership in AI ethics and one of the few women in the field also drew attention to Google’s diversity, equality and inclusion.

Gebru’s article may have painted a flattering picture of Google’s technology

The research article, which is not yet publicly available, provides an overview of the risks associated with forming language models with large data sets.

The impact on the environment. One of the concerns Gebru and her co-authors explored was the potential environmental cost, according to the MIT Technology Review. Gebru’s article references a 2019 article by Emma Strubell et al, who found that training a specific type of neural architecture research method would produce about the same amount of 626,155 pounds of CO2 equivalent. Also 315 return flights between San Francisco and New York.

Distorted voices can produce distorted patterns. Language models using educational data on the Internet can include racist, sexist, and partisan language, which can be expressed in any language model used, including search engine algorithms.

Distorted training data can produce distorted language models

“Language models trained by the existing text on the Internet produce absolutely skewed models,” said Rangan Majumder, vice president of search and artificial intelligence at Microsoft. “ Many of these pre-trained models are trained using ‘masking’, which means they learn the linguistic nuances needed to fill in the gaps in the text; Prejudice can arise from many things, but the data they use is certainly one of them. ‘

“You can see the biased data for yourself,” said Britney Muller, a former senior SEO scientist at Moz. In the screenshot above, a T-SNE visualization appears in Google’s Word2Vec corpus, isolated from the relevant entities most associated with the term ‘engineer’, the given names commonly associated with men, such as Keith, George, Herbert, and Michael.

The prejudice of the Internet is of course not limited to the genre: “Economics, prejudice about popularity, prejudice about language (for example, most of the Internet is in English and” English for programmers “is not called” English for programmers “). just to name a few, “said Bertie CEO Dawn Anderson. When these biases appear in the training data and models trained in it, are used in search engine algorithms, these biases can manifest in searches or even during sorting and retrieval.

A “smaller slice of the search engine pie” for marketers. “When these large-scale models are used everywhere, it is striking that they only reinforce these biases in research, simply through the logic of the training materials that the model taught,” Anderson said. Possible injury.

It can also happen with personalized content served by search engines such as Google, with features such as the Discover feed. “It will certainly lead to more bizarre results/prospects,” said Muller. For example, it could be good for Minnesota Vikings fans who only want to see Minnesota Vikings news but can be very divided in terms of politics, conspiracies.For marketers, this can lead to a smaller portion of the search engine because the content gets better displayed, ”he added.

If biased patterns get into the search algorithms (if they aren’t already), it can undermine the purpose of many SEOs. “The entire [SEO] industry is all about ranking websites on Google for lucrative keywords,” said Pete Watson-Wailes, founder of digital consulting firm Tough & Competent. “I would suggest that this means optimizing your sites for models that actively deprive people and encourage human behavior.”

However, this is a relatively well-known problem and companies are trying to mitigate the impact of this bias.

Translate »