Home EconomyThe European Commission’s Search-Data Trust Fall

The European Commission’s Search-Data Trust Fall

by Staff Reporter
0 comments

The European Commission is trying to pull off a difficult trick: force Google to share search-query data with rivals while insisting the shared data is no longer personal data at all.

That is the central tension in the Commission’s April 16 preliminary findings under Article 8(2) of the Digital Markets Act (DMA), which specify how Alphabet must comply with Article 6(11)’s data-sharing obligations. The consultation closed May 1, and a final implementing act is due by July 27.

The proposed measures are detailed, and they reflect a serious effort to reconcile the DMA’s data-access mandate with the General Data Protection Regulation’s (GDPR) anonymization requirement. In particular, the Commission proposes a two-layer regime: a technical anonymization pipeline—attribute suppression, allowlisting, length thresholds, metadata generalization, and “mini-sessionization”—backed by contractual restrictions and recurring audits.

The problem is that the regime works only if both layers hold. And each layer depends heavily on trust in the other.

This post examines two questions the Commission has not adequately answered.

First, are the technical measures sufficient—on their own terms—to render the Search Dataset anonymous under European Union law? Put differently, do they ensure that re-identification “appears in reality to be insignificant,” under the Court of Justice of the European Union’s (CJEU) standard in Breyer, later reaffirmed in Single Resolution Board (SRB)?

Second, are the contractual and audit mechanisms robust enough to handle the realistic range of recipients? That includes recipients who are technically competent, commercially sophisticated, and not formally designated as hostile, but who may still behave adversarially in practice.

My short answer to both questions is no. More importantly, the two weaknesses reinforce each other.

The Commission has shifted meaningful risk-bearing work from the technical layer to the contractual layer, and from the contractual layer to private enforcement after the fact. If the audit process cannot be trusted, the anonymization process cannot be trusted. If the anonymization process cannot be trusted, the data was never lawful to share.

I will explain why both lines of defense are weaker than the consultation document suggests. In doing so, I draw on my earlier analyses of the coming  GDPR/DMA Article 6(11) conflict, my coverage of the 2024 and 2025 DMA compliance workshops, and my comparison of the EU DMA regime with Judge Amit Mehta’s user-side data-sharing remedy in U.S. v. Google. I also draw on the International Center for Law & Economics (ICLE) comments submitted to the Commission during the consultation by Geoffrey Manne, Dirk Auer, and Mario Zúñiga regarding Alphabet’s Article 6(11) obligations.

The Recital Can’t Save the Rule

Start with the legal standard. Article 6(11) of the DMA requires gatekeepers to provide access to “ranking, query, click and view data” on fair, reasonable, and nondiscriminatory (FRAND) terms. It then adds the crucial condition: “[a]ny such query, click and view data that constitutes personal data shall be anonymised.” Recital 61 further states that anonymization should occur “without substantially degrading the quality or usefulness of the data.”

The recital is doing more work in the Commission’s draft than it can bear.

As Peter Craddock has argued, and as Mark Leiser similarly argues in his consultation submission, the operative provision is unconditional: personal data “shall be anonymised.” The recital’s utility caveat operates within the anonymization requirement, not against it. If a given technique sufficiently anonymizes the data, the gatekeeper should prefer the version that preserves more utility. But if no available technique can adequately anonymize the data at a given utility level, the legal answer is to suppress the data—not to weaken the anonymization standard. The Commission’s Article 8(2) specification power does not extend to creating a softer, DMA-specific definition of “anonymisation.”

So what does anonymization require?

Under CJEU case law—most notably Breyer, and now SRB—the test is “relative.” The same dataset can be personal data for one entity and anonymous data for another, depending on the “means reasonably likely to be used” by the recipient, or by anyone else whose capabilities must realistically be considered.

Importantly, Breyer limits the analysis to lawful means of identification. The question is whether the risk of identification “appears in reality to be insignificant.” Opinion 05/2014 of the Article 29 Working Party supplies the operational framework: anonymization must prevent singling out, linkability, and inference.

The Commission’s two-layer approach—technical restrictions plus contractual controls—is, charitably read, an attempt to satisfy SRB by tightly constraining the recipient environment, so that re-identification tools available to recipients no longer count as “reasonably likely” means. The draft Joint Guidelines on the interplay between the DMA and the GDPR endorse this combined approach. (See paragraphs 180-181.)

The hard question—and the one the Preliminary Measures largely glide past—is whether that combination actually delivers what the DMA requires.

The Technical Pipeline’s Four Big Problems

To be fair, the five-step pipeline described in Section 3.1 of the Preliminary Measures is more sophisticated than Google’s earlier frequency-thresholding implementation. Third parties complained that Google’s original approach was so restrictive that it yielded little useful data. At the 2025 compliance workshop, DuckDuckGo and Seznam argued that 99% of distinct queries were excluded. 

As Alba Ribera Martínez explains, the Commission’s newer regime is more nuanced. It combines an entity-based allowlist—more than 50 signed-in users issuing queries containing a given entity over 13 months—with length-based suppression, metadata generalization, and “mini-sessionization.” That is more robust than Google’s original “30 globally signed-in users per exact query” filter, at least for queries that are lexically rare but semantically common.

Still, the pipeline has at least four structural weaknesses the Commission has not adequately addressed.

The detector layer is brittle at scale

The technical pipeline relies heavily on “personal data detectors” to identify names, addresses, and phone numbers before queries are split into entities. (See paragraph 22(a)(1) of the Preliminary Measures.) As Craddock explains, names are not a tractable detection problem at internet scale.

Capitalization conventions vary by user habit. A search for “james brown” may evade a capitalized-name detector. Naming conventions vary by culture: “Charles de Gaulle,” “LeBron James,” and “Sk?odowska-Curie” all behave differently. Meanwhile, countless ordinary words are also common surnames: “Cook,” “Smith,” “Rose,” “Bill.”

Across billions of search queries, even a low false-negative rate produces millions of records in which personal data slips through. Tightening the detectors creates the opposite problem: more false positives and more over-suppression of legitimate queries. Either way, the system degrades.

Notably, the Preliminary Measures contain no acceptable-error-rate metric, no public benchmark against which detector performance can be audited, and no contingency plan for inevitable misclassifications.

There is also a deeper problem, and this is where Leiser’s critique becomes especially powerful. Search queries routinely contain personal data about people other than the searcher: relatives, colleagues, public figures, complainants, alleged wrongdoers, former partners, and so on.

The Commission’s pipeline effectively assumes that the only relevant data subject is the user entering the search query. Article 6(11) does not support that assumption. The operative provision states that “any such query, click and view data that constitutes personal data shall be anonymised.” Under Article 4(1) of the GDPR, “personal data” is defined by reference to the person the data relates to—not the person who generated it.

To be sure, the “such data” referenced in Article 6(11) originates from end-user activity. But the phrase “constitutes personal data” does not limit the relevant data subject to the end user. Recital 61 opens by discussing “personal data of end users,” and the Commission appears to rely heavily on that wording. But the recital’s own anonymization test immediately drops the end-user qualifier, referring instead to whether the information relates to “an identified or identifiable natural person.”

In any event, a recital cannot narrow the operative provision, as Craddock notes. Reading Article 6(11) to authorize disclosure of identifiable third-party information—with no notice, no controller chain, and no remedy for the affected third party—would also sit uneasily with Article 52(1) of the Charter of Fundamental Rights, given the Article 7 privacy rights of those named individuals.

The mechanism is easy to see.

Imagine a user searches: “Jean Dupont infidelity divorce Brussels.” “Jean Dupont” is a placeholder for a moderately well-known professional whose name has appeared often enough in Google Search queries to land on the Commission’s allowlist. The detector recognizes the name. The entity clears the 50-user threshold. The remaining terms—“infidelity,” “divorce,” and “Brussels”—are common enough to survive filtering. The query length falls below the 95th-percentile threshold. The record flows into the export dataset.

Recipients then receive the full query text, along with country, language, device information, and an S2-cell-level location indicator. The searching user’s metadata may satisfy the Commission’s k=50 anonymity threshold. Jean Dupont’s data does not. The pipeline does not meaningfully treat him as a data subject at all.

The result is disclosure of identifiable information about Jean Dupont’s family life—and potentially his sexual conduct, depending on the inferences recipients draw—to entities with which he has no relationship, no contract, and no GDPR controller chain. Under the Commission’s design, he receives neither notice nor remedy.

Nor does the detector layer solve this. Even a perfect name detector would suppress third-party data only if the name itself were blocked, and only if suppression extended to the rest of the query string. The Commission’s pipeline does neither. If the name appears on the allowlist, the surrounding query text remains intact.

In fact, the allowlist may invite leakage. The more publicly salient a person is, the more likely his name appears on the allowlist, and the easier it becomes for any search mentioning him—by any user, for any reason—to flow into the daily export.

The metadata thresholds do not solve composition attacks

The Commission also requires that at least 50 signed-in users share the same combination of inferred language, location, and device type. (See paragraph 31.) This is essentially a k-anonymity guarantee. The problem is that k-anonymity has a familiar weakness: a group can be anonymous without being private.

If all 50 users in a cohort share the same sensitive characteristic, anonymity does little good. For certain query categories—a rare medical condition, for example, or a politically sensitive term within a small linguistic minority—the Commission’s k=50 threshold may still permit effective disclosure of the sensitive attribute across the entire cohort.

The Commission’s implicit answer seems to be that sensitive records can be filtered through entity and length thresholds. But that misses the point. Composition attacks are orthogonal to query rarity or query length.

The regime does not address active influence attacks 

The regime also appears vulnerable to active influence attacks by recipients themselves, a point Craddock and Leiser both emphasize.

Eligible recipients have lawful access to Google Search. They—or their employees, contractors, or paid panel users—can deliberately run searches designed to push particular entities across the 50-user threshold. Once a recipient has reason to believe a seeded query will appear in the Search Dataset, locating the resulting record becomes much easier.

None of this requires hacking, illicit databases, or breach of contract. It involves lawful conduct. That matters because Breyer asks whether identification methods are “reasonably likely to be used.” Lawful conduct that recipients can undertake unilaterally almost certainly qualifies.

The Commission’s contractual prohibition on “re-identification” or “sessionisation” does not solve this problem. (See paragraph 38(b).) Those restrictions matter only if the Commission later discovers the behavior and successfully enforces the rules after the fact.

The auxiliary-data problem is massive 

Finally, the linkage surface with auxiliary data is enormous. As the Chamber of Progress and others note in their consultation submissions, the dataset includes URLs, approximate timestamps, language, location, device information, access-point data, and increasingly detailed click and dwell-time signals.

Eligible recipients—by definition, rival search engines—already possess their own logs of user activity against many of the same URLs, during roughly the same time periods, and often involving overlapping users. Publishers and advertisers possess additional complementary data.

In other words, the DMA-protected dataset is intentionally designed to complement recipients’ own information. That is the entire competitive rationale behind Article 6(11).

Anonymization that might withstand attack by a recipient with no auxiliary data becomes far weaker when the attacker possesses structurally aligned datasets by design. Recital 26 of the GDPR explicitly directs regulators to consider identification means “reasonably likely to be used … by another person.” The Commission has not grappled seriously enough with that reality.

In short, the technical pipeline is more sophisticated than Google’s initial implementation. But it remains brittle along several dimensions the Preliminary Measures barely address. The Commission’s answer is to shift the remaining burden to contractual restrictions. That is where the second concern begins.

Trust Us, We Audited It

The Commission’s contractual measures, set out in Section 3.2 of the Preliminary Measures, are extensive on paper.

Recipients are prohibited from re-identification, data augmentation, record-level linkage with auxiliary datasets, sessionization beyond “mini-sessions,” and onward sharing. The regime also imposes purpose limits: recipients may use the Search Dataset only to optimize online search engine (OSE) services. (Paragraph 40.) There is a 13-month retention cap. (Paragraph 41.) The measures further require encryption at rest and in transit, least-privilege access controls, phishing-resistant multifactor authentication, restrictions on local workstation copying, and logging obligations with one-year retention periods. (Paragraphs 45-49.)

The compliance architecture culminates in a two-tier independent audit cycle under ISAE 3000 or an equivalent standard. Level 1 audits assess the design and suitability of controls; Level 2 audits assess operating effectiveness. (Paragraphs 52-70.)

At first glance, this looks like a serious compliance package. The problem is that the audit system is weaker than it appears. Three points matter in particular.

The audits mostly check boxes

The Level 1 and Level 2 reports focus on whether the recipient has documented controls and whether those controls appear to function as designed. (Paragraphs 56-57.)

Alphabet’s role is remarkably limited. It need only confirm that the report exists, is signed by a qualified practitioner, addresses the required assurance objectives, and satisfies the required format. (Paragraph 111.) The Preliminary Measures explicitly state that Alphabet “shall not reassess the substance or scope of the assurance conclusions.”

Nor does the Commission appear to conduct substantive review of the audits, beyond receiving notification that they occurred. The regime largely treats the existence of an audit as evidence of compliance, without meaningfully interrogating the quality of the audit or the rigor of its conclusions.

‘Independent’ auditors are not necessarily independent

The independence requirement is narrower than it sounds. The Preliminary Measures define auditor independence by reference to ISAE 3000 ethics standards. Those standards address the auditor’s independence from the audited entity itself. They do not address independence from outside commercial or strategic interests that may align with the audited entity.

That distinction matters. A formally qualified ISAE 3000 practitioner who is commercially eager for business, operating in a weak supervisory environment, technically outmatched by the auditee, or simply disinclined to push too hard against a paying client can still produce a clean report.

And that is before the harder cases: auditors who fully understand the weaknesses in the system, but nonetheless issue a reasonable-assurance opinion at the outer edge of what the standard tolerates.

The Commission’s framework assumes that “independent auditor” is a meaningful substitute for adversarial verification. Often, it is not.

The data move faster than the audit cycle

The timing mismatch is the final problem. After the initial Level 1 assessment, the audit cycle becomes annual. (Paragraphs 56-57.) The Search Dataset, by contrast, flows daily.

That means the system relies heavily on retrospective compliance review while disclosure occurs continuously and at scale. Even within an audit period, problematic behavior could persist for months before detection—assuming it is detected at all.

Put differently, the regime is built around after-the-fact accountability, not ex ante technical impossibility.

Those weaknesses become more serious once we stress-test the framework against the kind of recipient the Commission’s first assurance objective is supposed to screen out.

The Regime Is Built for Compliant Recipients

The Commission’s first assurance objective, set out in paragraph 67, comes closest to a substantive eligibility screen.

It excludes recipients that are directly or indirectly:

  • sanctioned under EU restrictive measures;
  • covered by sanctions that otherwise prohibit provision of the Search Dataset; 
  • subject to action under the EU foreign direct investment screening regime  (Regulation 2019/452);
  • designated as a “high-risk supplier” under Union law; or 
  • identified by Union or member-state authorities as cybersecurity, public-security, or public-order threats. 

If a recipient is not on a sanctions list, has not been screened out under foreign-investment rules, and has not been formally designated a high-risk supplier, it clears the first assurance objective.

That sounds reassuring—until one asks what kinds of actors the framework actually excludes.

Consider a third-party “online search engine” incorporated in an EU member state, but beneficially owned through layered structures by actors aligned with an adversarial state. A Chinese or Russian intelligence-linked commercial proxy is the obvious example, although the same logic applies to organized criminal groups or sanctions-evasion networks.

Given the Commission’s expansive definition of “online search engine”—which now reaches AI chatbots with search functionality—setting up such a service is increasingly easy. The recipient applies, signs the license agreement, hires a formally qualified but commercially marginal ISAE 3000 auditor, obtains a Level 1 report, and begins receiving daily Search Dataset exports through the API.

What protects users—and anyone else named in search queries—in this scenario?

Very little.

The auditor verifies that, at the time of the report, the recipient is not formally sanctioned and has not been subject to an EU investment-screening decision. But the auditor need not pierce the corporate veil to identify ultimate beneficial ownership.

Nor is the auditor conducting an intelligence investigation. The recipient’s “credible and documented plans” for search-engine development, required under paragraph 68, are evaluated largely on the basis of the recipient’s own documents. Serious bad actors can produce credible paperwork.

The “change of control” procedure is similarly weak. (Paragraphs 117-120.) Recipients must notify Alphabet within 10 days of a public announcement, or 30 days before the change takes effect, whichever comes earlier. But the ownership transitions that matter most are precisely the ones designed not to become public.

Alphabet’s role remains largely ministerial. Under paragraph 111, it may verify only the formal completeness of the audit materials. It may not reassess the auditor’s conclusions. The Commission receives notice under paragraph 112, but there is no required forward-looking substantive review.

The contractual restrictions are equally fragile under real-world conditions. Yes, the recipient contract prohibits re-identification, augmentation, onward sharing, and related conduct. But contractual enforcement becomes largely aspirational once data has been exfiltrated, especially across borders.

Article 7 of China’s National Intelligence Law, for example, empowers Chinese entities to be compelled to cooperate with state intelligence work. Comparable authorities exist in other adversarial jurisdictions. If a state-linked recipient—or a state-linked subsidiary designed to receive the data—obtains the Search Dataset, the data will move. Any eventual contractual or regulatory penalties will fall on the EU shell entity long after the data has crossed the relevant border.

The expedited-termination mechanism does little to solve this. Paragraph 138 permits suspension only where there is “an urgent risk of serious and irreparable damage to the anonymisation of end users’ personal data.” That is a high evidentiary threshold. By the time the necessary evidence exists, the damage is likely done.

And termination does not rewind disclosure. The contract may require deletion of previously received data. (Paragraph 134.) A bad-faith recipient will simply ignore that obligation, especially where the EU lacks meaningful enforcement leverage.

The contrast with the U.S. District Court for the District of Columbia’s “Qualified Competitor” framework in U.S. v. Google is instructive. As I noted last September, the decree defines a Qualified Competitor as:

[A] Competitor who meets the Plaintiffs’ approved data security standards as recommended by the Technical Committee and agrees to regular data security and privacy audits by the Technical Committee, who makes a sufficient showing to the Plaintiffs, in consultation with the Technical Committee, of a plan to invest and compete in or with the GSE and/or Search Text Ads markets, and who does not pose a risk to the national security of the United States.

Several differences matter here.

First, the audits are conducted by the Technical Committee, not by auditors hired and paid by the recipient. That largely neutralizes the incentive problem on which the Commission’s framework depends.

Second, eligibility is substantively approved by the plaintiffs in consultation with the Technical Committee. It is not granted automatically based on the recipient’s own documents and a formally compliant audit report.

Third, recipient suitability is treated as a continuing supervisory question, overseen by a standing body throughout the life of the decree—not as a one-time compliance screen followed by annual reporting.

The national-security overlay is only the most obvious distinction.

To be clear, most actual recipients will not be hostile actors. Most will be legitimate search engines and AI assistants trying to build competing products. But that is not the relevant design question.

When the dataset may contain highly sensitive personal information, the regime cannot merely be safe for the median recipient. It must be resilient against the worst-case recipient, because the worst-case recipient is the one capable of imposing catastrophic privacy harms on EU users.

The Commission’s own first assurance objective implicitly recognizes this. Otherwise, there would be no sanctions-screening mechanism at all.

The problem is that the Commission has designed the screen around the wrong variable: formal designation status. The harder and more consequential questions—beneficial ownership, jurisdictional control, downstream state-compulsion risk, and operational alignment with adversarial actors—are left almost entirely to the recipient’s own auditor.

Anonymous by Trust Alone

The Commission has plainly moved beyond the crude frequency-thresholding regime that characterized Google’s early implementation, and the Preliminary Measures reflect a genuine effort to take privacy seriously.

But the regime still has two important weaknesses.

First, the technical layer leaves residual identification risks that, by the Commission’s own account, can only be reduced to an “insignificant” level through contractual restrictions. Second, the contractual layer depends heavily on private enforcement and an audit cycle that is formally rigorous but substantively thin. It also lacks anything resembling the continuing recipient-qualification framework the district court imposed in U.S. v. Google.

Together, the two layers produce an uncomfortable result: the legal status of the Search Dataset as “anonymous” depends on a chain of trust. That chain includes the recipient’s good faith, the auditor’s diligence, and Alphabet’s ability to police conduct it cannot directly observe. That is a fragile foundation for compliance with a statutory anonymization requirement.

Two changes would substantially improve the final implementing act.

First, the Commission should strengthen the technical layer, even at some cost to utility. The Preliminary Measures already contemplate multiple “samples” (A, B, and C), while the broader literature—particularly Leiser’s submission—offers a more graduated toolkit: aggregate-only access, controlled API access, clean-room environments, and regulator-supervised escrow arrangements. Differential-privacy budgets for aggregate signals, along with stronger detector-quality assurance backed by published false-negative metrics, are all feasible.

The Commission is right that Article 6(11) data should not be rendered useless. But Recital 61’s instruction that utility not be “substantially degraded” operates within the anonymization requirement, not above it. The recital cannot override the operative command that personal data “shall be anonymised,” and the final implementing act should say so clearly.

Second, the recipient-eligibility framework needs to grapple seriously with the risks the current proposal largely sidesteps—especially the incentive problem created when auditees choose and pay their own auditors, and the inability of formal-designation screening to detect undisclosed hostile ownership or jurisdictional control.

The Commission’s current framework does not do that. The precise mechanics of a more substantive merits-based review can be refined through the specification process. But the direction is already clear from the U.S. v. Google comparison.

At bottom, the Commission is attempting something difficult: forcing broad data sharing while insisting the shared data is no longer personal data. That is a legally and technically precarious balancing act.

And if the system works only so long as everyone behaves, it is probably not an anonymization regime. It is a trust regime.

You may also like

Leave a Comment

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More