Can Old Personal Data Haunt an AI Model Under DPDP?
Why the real privacy risk may not be in the dataset, but inside the model that survived it

A company may delete the old dataset. The harder question is whether the model trained on that dataset has also become legally clean.
That question is now commercially important for Indian businesses. Many companies trained recommendation engines, fraud scoring systems, customer segmentation tools, chatbots, risk engines and analytics models before the Digital Personal Data Protection Act, 2023 became the centre of compliance conversations. Those models may continue to run inside the business today. The legal issue is not merely historical. It is present and operational.
The central question is this: when personal data was used to train an AI model before DPDP applied in practice, can the model continue to be used after DPDP without a fresh privacy analysis?
The wrong question and the better question
The wrong question is whether DPDP retrospectively makes every past act of AI training unlawful. That would be too broad. Indian law generally does not presume retrospective penal consequences unless the statute clearly says so. The better question is whether the present use of the model after DPDP involves continuing processing of personal data.
This distinction matters because AI systems are not one single object. A business may say that the training is complete. But the live AI system may still include model weights, fine tuning layers, vector databases, prompt logs, user profiles, feedback loops, embeddings and output monitoring. Some of these layers may continue to involve identifiable individuals.
Therefore, the legal analysis should not start with the date on which the model was trained. It should start with the technical anatomy of the system being used today.
The DPDP hook: processing is broad enough to capture continued use
The DPDP Act defines personal data as data about an individual who is identifiable by or in relation to such data. It also defines processing broadly enough to include operations such as collection, recording, organisation, storage, adaptation, retrieval, use, alignment, combination, indexing, sharing, disclosure, erasure and destruction.
That breadth is important. If a model or the surrounding AI system continues to use, retrieve, combine, infer from or act upon identifiable personal data, the company may not be dealing only with a past training event. It may be engaged in present processing.
Section 5 of the DPDP Act also matters because it deals with notice, including situations where consent had already been given before commencement. In such cases, the Data Fiduciary is expected to provide notice as soon as reasonably practicable and may continue processing until consent is withdrawn. This shows that old consent and continuing processing are not ignored. They are brought into the new compliance framework.
The practical consequence is simple. A business cannot rely only on the fact that data was collected earlier. It must ask whether the present AI use is covered by a valid ground of processing, a clear specified purpose, appropriate notice and a defensible retention position.
The European treatment under GDPR: the model may itself be the problem
The European Data Protection Board has treated this issue far more directly under GDPR. In Opinion 28/2024 on AI models, the EDPB rejected any automatic assumption that an AI model trained on personal data is anonymous. The assessment depends on whether personal data can be extracted from the model or obtained from queries, outputs or other available means.
The EDPB separates the problem into scenarios. First, a model may have been developed through unlawful processing and may retain personal data when later used by the same controller. Second, it may retain personal data and then be used by another controller. Third, it may be anonymised before later deployment. The legal consequence changes depending on which scenario is true.
This is the intelligent point for Indian companies to understand. Under the European approach, the model is not treated as magically clean merely because the training dataset has been deleted. Regulators ask whether the unlawful data has left a continuing legal residue inside the model or deployment environment.
For India, the same reasoning is not binding, but it is persuasive. DPDP has its own language and structure, but the underlying compliance issue is similar: if identifiable personal data survives inside a model or in the system built around it, continued use may need independent justification.
What EU case law adds to the analysis
The strongest way to analyse this issue is not only through AI guidance, but through older European case law on personal data, identifiability, control and erasure. Four principles are especially useful.
|
Authority |
Principle |
Relevance to AI models |
|
Breyer v Bundesrepublik Deutschland, C 582/14 |
Data may be personal even where identification requires additional information held elsewhere, if reasonable means of identification exist. |
A model output, embedding or log may be personal data if it can be linked back to an individual through available means. |
|
Nowak v Data Protection Commissioner, C 434/16 |
Personal data is not limited to obvious identifiers. Content may relate to a person because of its content, purpose or effect. |
A model output that profiles, scores or affects a person may relate to that person even if it does not display a name. |
|
Google Spain v AEPD, C 131/12 |
A search engine operator was treated as processing personal data by locating, indexing, storing and making information available. |
The analogy is powerful. AI deployment may not be passive. Retrieval, indexing and generation can be present processing. |
|
Wirtschaftsakademie, C 210/16 and Fashion ID, C 40/17 |
A party may be responsible for a processing operation even if it has limited access to raw data, where it influences the means or purposes of that operation. |
A company using third party AI models, APIs, analytics tools or embedded AI features cannot assume that lack of raw data access removes all responsibility. |
India does not yet have a reported judgment deciding whether an AI model trained on historical personal data is itself personal data under DPDP. That absence should not make the analysis shallow. Indian privacy law has a constitutional foundation in Justice K.S. Puttaswamy v Union of India, where the Supreme Court recognised privacy as a fundamental right and located informational control within individual dignity and autonomy.
Puttaswamy is not an AI case. It does not answer questions about model weights, vector stores or machine unlearning. But it gives the direction of travel. Privacy is not only about secrecy. It is also about control over personal information and the consequences of data driven decision making.
For AI systems, that means the inquiry should not be limited to whether a spreadsheet containing names still exists. The more serious question is whether a person remains exposed to profiling, inference, automated treatment or loss of control because of a model trained on their data.
For a meaningful DPDP review, the company should divide the AI system into layers. The risk is different at each layer.
|
System layer |
DPDP concern |
|
Training dataset |
Highest obvious risk if raw personal data is retained without a valid purpose or lawful ground. |
|
Model weights |
The question is whether the model has memorised, can reproduce or can reveal identifiable personal data. |
|
Embeddings and vector stores |
Often overlooked. These may encode user documents, conversations, profiles or behavioural traces in retrievable form. |
|
Fine tuning layers and adapters |
May carry client specific or user specific information even where the base model is generic. |
|
Prompt logs and feedback loops |
Fresh personal data may enter the system every day through user interactions. |
|
Outputs and decisions |
Even where inputs are not displayed, the system may generate profiles, risk labels, recommendations or scores affecting individuals. |
First conclusion: old training does not automatically invalidate the model.
If the training happened before DPDP became operational, that fact alone should not be treated as automatic illegality. There must be a present connection with personal data processing or a continuing compliance obligation.
Second conclusion: continued use may still be fresh processing.
If the system uses fresh inputs, stores logs, retrieves profiles, generates individual level outputs, or relies on embeddings containing personal data, DPDP analysis is triggered today.
Third conclusion: the hardest case is model residue.
If the raw dataset is gone but the model can reproduce, reveal or infer identifiable personal data, the company must treat the model as a live privacy risk and not merely as an old asset.
Before continuing to use a model trained on historical personal data, a company should prepare a short AI data provenance note. It should answer the following questions.
1. What categories of personal data were used for training, fine tuning or evaluation?
2. Was the original notice clear enough to cover model training, profiling, product improvement or analytics?
3. Was consent obtained, and if it was obtained before commencement, has fresh DPDP compliant notice been provided where required?
4. Can the model reproduce, expose or infer information about identifiable individuals?
5. Do embeddings, vector databases or logs retain personal data separately from the model?
6. Can a withdrawal, correction or erasure request be operationally honoured?
7. Is the system making decisions, recommendations or scores about individuals?
8. Are third party AI vendors, processors and APIs contractually bound to security, deletion and purpose restrictions?
9. Is retraining, filtering, model editing, output suppression or retirement required for high risk use cases?
A model trained before DPDP can continue to be used only after the company answers a more precise question: does the present AI system process personal data today?
If the answer is no because the model is demonstrably anonymous, no personal data is retained in surrounding systems, and deployment does not process personal data, the risk is lower. If the answer is yes because the system uses personal inputs, stores logs, retrieves user level embeddings, produces individual profiles or can reproduce personal data, the company needs a DPDP compliant basis for continued use.
In other words, the privacy question has moved from the dataset to the system. The dataset may be gone. The model may remain. The law will increasingly ask whether the individual has also been removed from the system, or only hidden inside it.
Citations:
Digital Personal Data Protection Act, 2023: sections 2, 3, 4, 5, 6, 8, 10 and 12.
Digital Personal Data Protection Rules, 2025: notice, phased commencement and implementation framework.
Justice K.S. Puttaswamy v Union of India, Supreme Court of India, 2017.
EDPB Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models.
Breyer v Bundesrepublik Deutschland, Court of Justice of the European Union, C 582/14.
Nowak v Data Protection Commissioner, Court of Justice of the European Union, C 434/16.
Google Spain v AEPD, Court of Justice of the European Union, C 131/12.
Wirtschaftsakademie, Court of Justice of the European Union, C 210/16.
Fashion ID, Court of Justice of the European Union, C 40/17.
FTC Everalbum and Rite Aid enforcement actions as comparative examples of regulators looking beyond raw datasets toward models and algorithms.
