a media-almost-archaeology on data that is too dirty for "AI" 39C3

a media-almost-archaeology on data that is too dirty for "AI"
.ical
2025-12-29 11:55, Ground
Language: English

when datasets are scaled up to the volume of (partial) internet, together with the idea that scale will average out the noise, large dataset builders came up with a human-not-in-the-loop, cheaper-than-cheap-labor method to clean the datasets: heuristic filtering. Heuristics in this context are basically a set of rules came up by the engineers with their imagination and estimation to work best for their perspective of “cleaning”. Most datasets use heuristics adopted from existing ones, then add some extra filtering rules for specific characteristics of the datasets. I would like to invite you to have a taste together of these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, and on for whom these estimations are good-enough, as it will soon be part our technological infrastructures.

In 1980s, non-white women’s body size data was categorized as dirty data when establishing the first women's sizing system in US. Now in the age of GPT, what is considered as dirty data and how are they removed from massive training materials?

Datasets nowadays for training large models have been expanded to the volume of (partial) internet, with the idea of “scale averages out noise”, these datasets were scaled up by scrabbling whatever available data on the internet for free then “cleaned” with a human-not-in-the-loop, cheaper-than-cheap-labor method: heuristic filtering. Heuristics in this context are basically a set of rules came up by the engineers with their imagination and estimation that are “good enough” to remove “dirty data” of their perspective, not guaranteed to be optimal, perfect, or rational.

The talk will show some intriguing patterns of “dirty data” from 23 extraction-based datasets, like how NSFW gradually equals to NSFTM (not safe for training model), and reflect on these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, and ask for whom these estimations are good-enough, as it will soon be part our technological infrastructures.

jiawen uffline

jiawen(b.~362 PPM) exists as a user most of the time in their life, having little agency as a standard user, both technologically and politically. She seeks for possibilities in internet cracks for queering the given identity of a user, their wish is to be an analogkäse in the digital figurations and drip sizzling fat on the cables and ports.

apart from that, jiawen's work focuses on technology as memory and desire, with contaminated history but appearing pure, sterilized, decontextualized and dehistoricized, operating through reducing rather than relating. jiawen looks into the infrastructure, (counter)history, materiality, locality, materiality, poetics and politics of technology. She sees l̸͓̪̜͕̮̆̎̓e̷͍̲͕̫̦͌a̵̡̝͈̗̐k̴̢͖̮̞̣̄͛̓͛̐ as an instance of the space-time continuum and a definite part of the digital reality, and leaking as a method to survive together.

currently based in Bremen, Germany.

a media-almost-archaeology on data that is too dirty for "AI" .ical 2025-12-29 11:55, Ground Language: English

a media-almost-archaeology on data that is too dirty for "AI"
.ical
2025-12-29 11:55, Ground
Language: English