Saturday, June 23, 2007

The Working Party


The Working Party is a group of representatives from every European country’s data protection authority plus the European Commission, dedicated to working on the harmonized application of data protection across Europe. I think I have the (perhaps dubious) distinction of being the private sector privacy professional who has worked the most with this group in the last decade. Most of my peers flee the Working Party like the plague, but I agree with Mae West, who said, “Too much of a good thing is wonderful.”

In my many years of privacy practice, I’ve always thought the best strategy is to work constructively with the Working Party. They are thoughtful privacy regulators, trying to improve privacy practices and to enforce often-unclear data protection laws. The companies I worked for are committed to improving their privacy practices and to complying with European laws. And the Working Party itself is committed to becoming more effective at working with the private sector, and in particular with the technology sector. So, based on my many years of experience, how could this all work better? And by the way, if you think I’ll be biased and self-serving in making these observations, feel free to stop reading here.

Here’s my golden rule: when regulators want to change practices across an entire industry, then they shouldn’t just work with one company. To make the point, here’s a little timeline summary of the recent Working Party exchanges with Google.

November 2006: the international data protection authorities issued a resolution calling on all search companies to limit the time periods during which they retain personally-identifiable data. No leading search company publicly disclosed a finite retention period at this time.

March 2007: Google chose to lead the industry by announcing it would anonymize its search server logs after 18-24 months.

This generated considerable positive press, in my opinion quite justified, as the first such move by a leading search company.

May 2007: the Working Party sent Google a letter asking it to explain its retention decisions, and to justify whether this period was “too long” under European data protection principles. This set off a worldwide press storm, as hundreds of newspapers ran headlines like: “Google violates EU data protection laws.” And many of the EU privacy regulators added fuel to the media flames, as they issued comments expressing their concerns about “Google”, or even declaring Google’s practices to be “illegal”, without even waiting for Google to respond to their letter.

June 2007: Various privacy advocates jumped on the publicity bandwagon. One even went so far as to declare Google to be the “worst” in terms of privacy, due to the vagueness of its data collection and data retention practices. But since Google was the only one of the entire list of companies to have publicly stated a finite retention period, I would have thought Google should have been declared the “best.” Of course, that report was thoroughly de-bunked by more thoughtful industry observers, such as Danny Sullivan: “Google Bad on Privacy? Maybe it’s Privacy International’s Report that Sucks.” http://searchengineland.com/070610-100246.php

Nonetheless, the press damage was done. Even my dad called me after reading his small-town Florida newspaper to ask me why I was so bad at my job. Argh.

Then, I published a long open letter explaining the factors Google took into account while announcing a new retention period of 18 months: privacy, security, innovation, retention obligations. http://googleblog.blogspot.com/2007/06/how-long-should-google-remember.html
I wanted us to be transparent about our analysis and the factors that guided it. Of course, I couldn’t really describe all the security reasons for log retention: you can’t describe all your security practices publicly without undermining your security. And you can’t describe all your uses of data for search algorithm improvements without revealing trade secrets to your competitors. But nonetheless, I think we have been remarkably transparent throughout this process. Meanwhile, our competitors have been completely, studiously silent.

Finally, the Working Party realized how unfair all this had become for Google, and told the press that its sub-group, called the Internet Task Force, would consider these issues further in July, and include other search companies in the review.

I’m quite eager to hear from other search companies. I undertook a thorough and thoughtful analysis of Google’s need for logs for these various (sometimes conflicting) purposes. I am intellectually curious to understand whether our peer companies balance these factors in the same way as we did, or differently. Will they announce retention periods too? And will they announce periods that are longer or shorter than ours?

Privacy on the Internet concerns everyone, and all companies. The Working Party has got to learn how to engage with the industry. I continue to remain committed to working with the Working Party, but I fear that other companies in the industry will draw the opposite lesson: keep a low profile and try as hard as possible not to make it onto their radar screen. That would be bad for privacy. Well, the Working Party is a work in progress. And I hope someone tells my dad I’m not doing such a bad job… Or maybe my studiously-silent peers were right, and I was wrong…?

Thursday, June 14, 2007

Server Logs and Security

I recently posted a blog to explain why Google retains search server logs for 18 months before anonymizing them.
http://googleblog.blogspot.com/2007/06/how-long-should-google-remember.html
Security is one of the important factors that went into that decision. Google uses logs to help defend its systems from malicious access and exploitation attempts. You cannot have privacy without adequate security. I've heard from many people, all agreeing that server logs are useful tools for security, but some asking why 18 months of logs are necessary. One of my colleagues at Google, Daniel Dulitz, explained it this way:

"1. Some variations are due to cyclical patterns. Some patterns operate on hourly cycles, some daily, some monthly, and others...yearly. In order to detect a pattern, you need more data than the length of the pattern.

2. It is always difficult to detect illicit behavior when bad actors go to great lengths to avoid detection. One method of detecting _new_ illicit behaviors is to compare old data with new data. If at time t all their known characteristics are similar, then you know that there are no _new_ illicit behaviors visible in the characteristics known at time t. So you need "old" data that is old enough to not include the new illicit behaviors. The older the better, because in the distant past illicit behaviors weren't at all sophisticated.

3. Another way of detecting illicit behaviors is to look at old data along new axes of comparison, new characteristics, that you didn't know before. But the "old" data needs to run for a long interval because of (1). So its oldest sample needs to be Quite Old. The older the data, the more previously undetected illicit behaviors you can detect.

4. Some facts can be learned from new data, because they weren't true before. Other facts have been true all along, but you didn't know they were facts because you couldn't distinguish them from noise. Noise comes in various forms. Random noise can be averaged out if you have more data in the same time interval. That's nice, because our traffic grows over time; we don't need old data for that. But some noise is periodic. If there is an annual pattern, but there's a lot of noise that also has an annual period, then the only way you'll see the pattern over the noise is if you have a lot of instances of the period: i.e. a lot of years.

This probably isn't very surprising. If you're trying to learn about whether it's a good idea to buy or rent your house, you don't look only at the last 24 months of data. If you're trying to figure out what to pay for a house you're buying, you don't just look at the price it sold for in the last 24 months. If you have a dataset of house prices associated with cities over time, and someone comes along and scrubs the cities out of the data, it hasn't lost all its value, but it's less useful than it was."

Monday, June 4, 2007

Did you mean Paris France or Paris Hilton?

Here's an OpEd I contributed to the Financial Times.
http://www.ft.com/cms/s/560c6a06-0a63-11dc-93ae-000b5df10621.html

Published: May 25 2007

There was a survey conducted in America in the 1980s that asked people a deceptively simple question: "Who was shot in Dallas?" For many who had lived through the national trauma of 1963, the deliberations of the Warren Commission, the theories about the grassy knoll and the magic bullet, there was only one answer: JFK. For others, who followed every twist of the Ewing family, the oil barons' ball and Cliff Barnes's drink problem, there was also only one answer: JR.

The point of the survey was to show how the same words can have very different meanings to different people depending on their background and their interests. It is the same idea that is driving Google's personal search service.

Our search algorithm is pretty sophisticated and most people end up with what they want. But there is inevitably an element of guesswork involved. When someone searches for "Paris" are they looking for a guide to the French capital or for celebrity gossip? When someone types in "golf" are they looking to play a round on the nearest course or to buy a Volkswagen car? An algorithm cannot provide all the answers.

But if an algorithm is built to take into account an individual's preferences it has much more chance of guessing what that person is looking for. Personalised search uses previous queries to give more weight to what each user finds relevant to them in its rankings. If you have searched for information about handicaps or clubs before, a search for "golf" is more likely to return results about the game than the car. If you have been checking out the Louvre, you are less likely to have to wade through all the details of a particular heiress's personal life.

This makes search more relevant, more useful and much quicker. But it is not for everybody. As the Financial Times has pointed out this week, personalised search does raise privacy issues. In order for it to work, search engines must have access to your web search history. And there are some people who may not want to share that information because they believe it is too personal. For them, the improved results that personalised search brings are not matched by the "cost" of revealing their web history.

The question is how do we deal with this challenge? Stop all progress on personalised search or give people a choice? We believe that the responsible way to handle this privacy issue is to ask users if they want to opt in to the service. That is why Google requires people to open an account and turn on their personalised search functionality. They do not have to give a real name to open a Google account, but even if they cannot be identified, we think they should have to give explicit consent before their web history is used. Unless they do, they will simply have the standard Google search service.

Our policy puts the user in charge. It is not something Google seeks to control. At any time they can turn off personal search, pause it, remove specific web history items or remove the whole lot. If they want, they can take the whole lot to another search engine. In other words personalised search is only available with the consent ofthe user.

If you think of search as a 300chapter book, we are probably still only on chapter three. There are enormous advances to be made. In the future users will have a much greater choice of service with better, more targeted results. For example, a search engine should be able to recommend books or news articles that are particularly relevant - or jobs that an individual user would be especially well suited to.

Developing more personalised search results is crucial given how much new data is coming online every day.The University of California Berkeley estimates that humankind createdfive exabytes of information in 2002 - double the amount generated in 1999. An exabyte is a one followed by 18 noughts. In a world of unlimited information and limited time, more targeted and personal results can really add to people's quality of life.

If you type "Who was shot in Dallas?" into Google today, the results are as divided as the survey's respondents a quarter of a century ago. But with personalised search you are more likely to get the "right" result for you. Giving users the chance to choose a search that is better for them as individualsis something we are proud of andwill continue to build on. After all, the web is all about giving people - you and me - more choice and more information.