Домой United States USA — software Borrowing from the law to filter training data for foundation models

Borrowing from the law to filter training data for foundation models

По

December 24, 2022

123

Using a “Pile of Law» dataset, Stanford researchers explore filtering private or toxic content from training data for foundation models.
Foundation models are often trained on what is essentially the entire internet. By learning from such a vast dataset, they can impressively memorize and reproduce information that we want them to learn. For example, they might learn to accurately answer factual questions such as “Who is the president of the United States?”
At the same time, however, foundation models can memorize and reproduce information that could be harmful. For example, they might disclose people’s Social Security numbers, credit card information, or criminal records, or answer questions about Muslims by suggesting they are terrorists.
These are problems that the creators of foundation models need to fix, says Peter Henderson, a JD/Ph.D. student at Stanford: “We don’t want models to associate people with either their private content or with harmful characteristics.”
To avoid such consequences, the creators of foundation models sometimes try to filter out private or toxic content before using a dataset to train a model. But trying to remove all — or even most — of the private or toxic content from the entirety of the internet is extremely challenging. One reason: Context matters. Privacy expectations differ across cultures and even across time. And deciding if a phrase is toxic might depend on who is speaking, why they are using a particular phrase, and the expectations of the readers. In sum: It’s a balancing act, and different researchers apply different standards.
“We wondered if there was a more principled way to filter pretraining data,” Henderson says. He and his colleagues, including Mark Krass, also a JD/PhD student, had an idea: Look to the law. There’s a long history of courts setting standards for information disclosure, so why not import those standards into the machine learning (ML) environment?
To test their idea, Henderson and his colleagues assembled Pile of Law, a vast dataset of court and administrative opinions, legal code, case books, and other legal documents. They then explored whether Pile of Law could help identify a principled way to filter pretraining data with a particular focus on privacy and toxicity.

Borrowing from the law to filter training data for foundation models

ЕЩЁ БОЛЬШЕ НОВОСТЕЙ

貨物列車の脱線、レール腐食が原因か JR北海道、点検で把握できず

SNSで「山が動く」選挙戦、今年相次ぐ来夏の参院選へ警戒の声も

民家に男が侵入刃物で切りつけられ81歳と78歳がけが男は逃走

ПОПУЛЯРНАЯ КАТЕГОРИЯ

СХОЖИЕ СТАТЬИБОЛЬШЕ ОТ АВТОРА

Gilberto Ramírez beats Bill-Smith in Riyadh and unifies cruiserweight titles

Gemini Live is now on iPhone — here are the 3 best ways to use Google’s AI voice assistant

Get a Walmart+ membership for half off and snag exclusive access to upcoming Black Friday sales

ЕЩЁ БОЛЬШЕ НОВОСТЕЙ

貨物列車の脱線、レール腐食が原因か JR北海道、点検で把握できず

SNSで「山が動く」選挙戦、今年相次ぐ 来夏の参院選へ警戒の声も

民家に男が侵入 刃物で切りつけられ81歳と78歳がけが 男は逃走

ПОПУЛЯРНАЯ КАТЕГОРИЯ

СХОЖИЕ СТАТЬИ БОЛЬШЕ ОТ АВТОРА

SNSで「山が動く」選挙戦、今年相次ぐ来夏の参院選へ警戒の声も

民家に男が侵入刃物で切りつけられ81歳と78歳がけが男は逃走