A team of computer scientists at the University of California, Riverside has introduced a method for removing private and copyrighted information from artificial intelligence models without needing access to the original training data. The research was presented in July at the International Conference on Machine Learning in Vancouver, Canada.
This new approach responds to growing concerns about personal and copyrighted materials remaining accessible in AI models even after efforts by creators to restrict or delete such content. Traditional methods require retraining AI models with the original datasets, which is often costly and consumes significant energy. The UC Riverside method allows targeted information to be erased while preserving the functionality of the remaining model.
“In real-world situations, you can’t always go back and get the original data,” said Ümit Yiğit Başaran, a doctoral student in electrical and computer engineering at UC Riverside and lead author of the study. “We’ve created a certified framework that works even when that data is no longer available.”
The need for this type of technology has increased as tech companies face privacy regulations like the European Union’s General Data Protection Regulation (GDPR) and California’s Consumer Privacy Act, both designed to protect personal data used in large-scale machine learning systems.
Legal actions have also highlighted these issues; for example, The New York Times is currently suing OpenAI and Microsoft over alleged unauthorized use of its articles to train language models such as GPT.
AI models generate responses by predicting word patterns based on large collections of online texts. This sometimes leads to near-verbatim reproductions of original content, potentially allowing users to bypass paywalls or copyright protections.
The UC Riverside team—Başaran, professor Amit Roy-Chowdhury, and assistant professor Başak Güler—developed what they describe as a “source-free certified unlearning” technique. This involves using a surrogate dataset that resembles the original data statistically, adjusting model parameters, and introducing controlled random noise so that specific information can be deleted without reconstructing it later.
Their system builds upon existing optimization techniques in AI that estimate how a model would change if retrained from scratch. They enhanced this process with new noise-calibration mechanisms to account for differences between surrogate and original datasets.
Testing on synthetic and real-world datasets showed their method achieved privacy protection similar to full retraining but with significantly less computing power required.
Currently effective for simpler AI models still widely used today, this technique could eventually apply to more complex systems like ChatGPT. Roy-Chowdhury noted its potential impact extends beyond regulatory compliance: media organizations, healthcare providers, and other entities handling sensitive information embedded in AI could benefit from this tool. It may also give individuals greater control over having their personal or copyrighted content removed from AI systems.
“People deserve to know their data can be erased from machine learning models—not just in theory, but in provable, practical ways,” Güler said.
The research paper is titled “A Certified Unlearning Approach without Access to Source Data.” The project included collaboration with Sk Miraj Ahmed from Brookhaven National Laboratory in Upton, NY. Both Roy-Chowdhury and Güler hold faculty appointments in UC Riverside’s Department of Electrical and Computer Engineering as well as secondary appointments in Computer Science and Engineering.



