Back to blog

Scraped Data Is Tainting Your AI Models

Based on research by NOYB

schremspersonal datadata protectiongdprai

A major investigation by noyb into Austrian credit agency CRIF has exposed how public registries are being systematically scraped and repurposed for commercial credit scoring. This isn't just a privacy issue; it is a fundamental breach of data integrity that threatens the reliability of AI-driven financial models. For Swedish tech leaders, this highlights a critical vulnerability in how training data is sourced and validated.

The core problem is the violation of the GDPR principle of purpose limitation under Article 5(1)(b). Public registers like land or company registries are designed to prove ownership or legal standing, not to serve as address books for data brokers. By scraping these sources without technical safeguards like query limits, brokers like AZ Direct and Compass-Verlag are collecting master data for purposes entirely unrelated to the original legal intent. This creates a black box where the provenance of data is unknown, making it impossible for individuals to verify if their data was legally obtained.

For CTOs and CISOs, the risk is twofold. First, using such data for AI model training introduces severe compliance gaps. If the underlying data collection violates purpose limitation, the derived insights or scores may be legally tainted, exposing your organization to regulatory scrutiny and potential fines. Second, the lack of traceability means you cannot guarantee the quality or bias-free nature of your inputs. A credit score based on scraped public data rather than actual financial behavior is statistically unreliable, leading to flawed business decisions and reputational damage.

This case reinforces the urgent need for data sovereignty and local processing. When data flows through opaque, cross-border broker networks, accountability vanishes. Processing data within the EU or Sweden, under strict local governance and technical controls, ensures that purpose limitations are respected and that data provenance is auditable. It shifts the paradigm from trusting third-party brokers to controlling your own data supply chain, ensuring that your AI systems are built on lawful, verifiable, and high-integrity foundations.