Poisoning the Data Stream via Firefox Plugin

20 comments

Ever since the AOL data spill and the subsequent "investigation" showing how easy it was to tie searches to real people I've been a lot more careful about my searching habits. I delete almost all cookies when I close a browser, and use proxy servers much more regularly. We even looked at a few ways of hiding your tracks with various methods but like many of you I found the solutions less than satisfying. I recently came across an alpha release of a firefox plugin called track me not which issues 'fake queries' to Google, AOL, Yahoo, and MSN in the background while you browse.

I've been running the plugin for almost a week and haven't had any browser issues so it seems pretty stable. My only complaint is looking at the queries it seems to be combining words which often result in gibberish like phrases. You can create your own list of searches, and the program does have plans to update the keyword lists dynamically.

I can't imagine the seach engines being to pleased about a wide scale adoption of a program like this, as it would significantly pollute the data stream. If it reached critical enough mass and had "normal search terms" it could throw a serious monkey wrench into the works. I know Google's TOS forbids automated queries, and imagine the other engines have similar policies.

Now that google has set a precedent for turning data over to the government, I know I feel a lot less comfortable with the full cavity search that data mining will become. Throwing garbage queries into my data string gives me plausible deniability should the men in black show up on my doorstep.

So where's the middle ground if one even exists? Can searchers be happy giving away a little and feel safe their queries won't end up part of some searchable public display. Can search engines get the data they want without without users creating thousands of fake rabbit holes for them to look down?

So where do you stand, do you think we can find a happy middle ground? What ideas and suggestions do you have? Let's try to keep politics out of it if possible.

Comments

bad idea

now when your random querry is "underage child porn" or "how to make a bomb for airplanes" and your house gets raided good luck telling them it must have been that "random querry generator"

Random

By the looks of it you load a list of words, it then randomly makes phrases for you - just make sure you dont have any dodge words included in your list and you should be fine.

Seems kinda dumb to me. Just

Seems kinda dumb to me. Just use proxies and clear your cache often if your concerned with that.

Flock and FF both have a

Flock and FF both have a privacy setting that will clear cookies, cache, etc upon closing. Set that and it works. The IP is the problem, though. It won't take much for Google to say to itself "gee... if the same IP gets a new cookie all the time, check if it appears to be a single user over time. If the queries suggest one user, tie them together". Statistically, it is trivial to identify the big shared IPs (like AOL)... when was the last time you stayed up 24x7 running meaningful queries through Google? I bet it's almost as easy to see the "edges" of a leased IP as it changes owners.

What I'm saying is the IP-as-unqiue-identifier is the privacy key. It is the one ID that has been removed from consumer control, and handed to the "infrastructure" guys. They can and will be corrupted.

Proxies are less and less effective nowadays, with the minimal amount of effort put into blocking them. That suggests thay can be instantly useless if a concerted effort to block them started. Not a strong bet.

"Random" queries are not random enough to be undetectable. A computer is very very good at running stats on large data sets... so any approach that tries to make a "too large" or "too random" data set is short-sighted. While it may be silly (today) to think Gogle will look that closely at your own data set to strip out those pseudo-random queries and see the real you, it is not silly to expect that a "closer look" would involve such a statistical inspection. It is trivial. Anyone of you can do it on the AOL data set with freely available stats software today (and many of you are, from what I understand). That "closer look" at your data will happen any time someone wnats to look at you (or your queries) and so, Shoemoney is correct to say "that's dumb" about adding someone else's queries to your own in this way. If there's one thing I love about Shoemoney, it's how his feet are planted firmly in the practical soil of the real world.

If you want to get esoteric about it (yet still possible, and some say promising), you can look at cyclostationarity. Invented many years ago (by William Gardner?), it is theoretically a way to identify determinism in a random signal. In other words, if a computer generated a signal, there should be some characteristics of that signal (perhaps unrelated to signal content), that is "constant" and not random. It might be releated to the timing, the pauses, the energy levels, the way it self tunes, or of course the content (headers, date/timestamps, etc). SOMETHING will be non-random, and revealing of a programmed process. This was successfully used by the military to determine the brand and model of radios used to transmit enemy messages...you don't know what they said, but you know if was a GE Model XXX transmiting. When Gardner did it, almost npobody understood it but these days it's an active research area. Of course, you signals and submarine guys know more about this stuff than I do, and the counter measures as well I bet ;-)

My point is that compuertized approaches to obfuscating data are short-sighted. Peer-to-peer, baby. Mix up the user queries, delayed in time and mixed, and make that "contextual mixing" at a later date when the countermeasures show up ;-) Community surfing... that sort of thing. That's what they are hiding, and that's what we need to use. We SEO's can mine it for keywords, too!

By the way I love the title of this thread, and expected a discussion of how bad this is for the "industry". It is really, really bad for search, and the investments in search. I have a hard time believing G or MSN aren't stepping in for this chance to "save the integrity of the web" with a high-profile promise to protect the consumer. Such a move by yahoo! would hurt Yahoo's stock, IMHO, but MSN can't be hurt much more, and G never gets hurt by such outlandish claims (Google stock would almost certainly rise on the news of Google protecting the commercialization of search!)

Disclaimer: I am not a terrorist. Some of this post probably violates the Patriot Act, because it suggests that consumers take actions to protect their interests against monopolies and unaccountable, privacy-abusing commercial corporations. I hereby declare that I am not a pedophile nor a terrorist, have never been a card-carrying member of the Communist Party.

Wouldn't this hurt seo?

I'd think something like this would make it more difficult to know how to optimize your web sites for search engines.

How will you know what people are *really* searching for? You'd find wordpress data polluted with phrases that you believe people are hunting for, but are not.

If the men in black turn up

If the men in black turn up on your doorstep the police state proper will be here and plausible deniability will just make them think you are a smartass. They'll probably beat you twice as hard.

Not much point in trying to get away. They can track you through your credit card data, travel tickets, bank teller withdrawals, phone calls and once e-cash arrives the cycle will be complete and the stage is set for total dictatorship.

One better

Seems like this system would fuzzy your searches, but not decrease the likelihood of a personally identifiable search. A search for 'Aaron Wall's long lost rich uncle' would still only be attributable to Aaron. Maybe something better would be this same system that uses a p2p network to share search queries. That way you end up with some random mix of other people's personally identifiable search terms. THen the data might show 'Aaron Wall's long lost rich uncle' query coming from a dozen sources. Once could probably still mine such data but it could muddy the results.

Even better I guess would be the search query swapping in conjunction with some sort of smart swapping - where you get search queries from people that search 'similiar' things to you. That'd screw things up good.

Not really John

Proxies are less and less effective nowadays, with the minimal amount of effort put into blocking them. That suggests thay can be instantly useless if a concerted effort to block them started. Not a strong bet.

I disagree with that, thats a pretty sweeping statement. If its so easy to block then why hasn't it been done ? With all the whining googlers make about people scraping their results don't you think they have tried ?

Its impossible to stop them, proxy lists are always changing and easy to come by. Furthermore with very little effort you can come up with your own.

if you are worried about the feds..

this plugin is useless
It isn't about the random shit it is about what can be extracted. diluting the pool of queries isn't going to make any difference. For those of you who do not understand RegEx or grep here is an illustrate.

scenario #1
blah blah blah tax evasion blah blah blah

scenario #2
blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah
tax evasion blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah
blah blah blah blah blah blah blah blah blah

In short, don't waist your bandwidth.

You will be much better off flushing your cookies and switching IPs. Of course, you can't undo the last 5 years, and only the very paranoid (like me) were covering there ass back then.

On the other hand

If there were a grass roots movement where people were purposely polluting the query pool with controversial phrases such as 'how to make a bomb' and 'kiddy porn' and 'how to overthrow the government' it would make the query data useless.

blocking proxies

Quote:
If its so easy to block then why hasn't it been done ? With all the whining googlers make about people scraping their results don't you think they have tried ? Its impossible to stop them, proxy lists are always changing and easy to come by. Furthermore with very little effort you can come up with your own.

Web professor:

Q: what % of webmasters use a proxy blocker?
Q: what % of webmasters use rewrite deny rules? of those, how many ever unban a banned IP?
Q: what % of email users use a spam filter other than the one their ISP auto installs?

very, very low. I agree about zombies as proxies, because whatever centralized system will eventually blacklist open proxies will at first make the SpamHaus and RBL people look like liberals. But such actions sell software & services so it will work out eventually. Look what happened to open mail servers. It took a few years, but the blame landed where it belonged - on the abused sys admin. So zombies, too will be shut down.

I doubt very much the "whining Googlers" will stop the open proxy abusers. But I have little doubt their web hosts will, and as they do, the flow gets concentrated and the economics are in place to encourage those serving the bandwidth to shut it down (appropriate or not). Hell even Comcast eventually saw the value of throttling compromised machines (and other high-bandwidth users along the way).

Littleman: I doubt very much

Littleman: I doubt very much any grassroots movement would even know what terms to use for searching such things. Of course I am sure there would be people providing that info, but those people would almost certainly have ulterior motives. Even John Q. Public has a gut feeling about how risky that kind of thing would be.

>ulterior motives

Protecting privacy would be enough motivation for many.

>Proxies

If your goal is just to trip up search data you could do it via tapping into vast PPPoE networks like ATT. You do not have to use proxies at all, just flush your cookies reissue your lease.

Aaron,

how is your uncle, btw?

This could work

From the Track Me Not Website:

Future versions are likely to include larger (distributed) query databases, dynamically generated and/or web-harvested queries, as well as grammar-generated natural-language queries. Suggestions for other ways of improving TMN are always welcome!

What they could do is every time user X does a search, the query is uploaded to a central server to be used as random queries by other users. The idea is if you do search "xyz" and then hundreds of other people do the bogus search, it would make it very hard to figure out the real one.

Hopefully at some point the search engines would be forced to allow users to opt out of being tracked.

I may be missing the point...

But.. what is the point of confusing your searches? Google / ISPs will just rat you out on the sites you actually visited.

That data can be faked too.

Your browser can make request to the links on the SERPs so it looks like you visited the page. It won't slow down your web surfing by much if you do 1 random query a minute and visit 1 or a couple of SERP's link per query.

This would cause problems for website owners though....

Wouldn't this hurt seo?

Wouldn't this hurt seo?

I'd think something like this would make it more difficult to know how to optimize your web sites for search engines.

More power to those that know what they are doing. I am all for it (obstuficating KW's that is, personally don't really care about randomising crap and sending it to Goog, just cover your tracks if you must).

LOL, I have been using this

LOL, I have been using this plugin and never thought of it in this light. I support any tool that helps allow good folks to go as stealth as possible because internet travel is our personal business.

Just don't forget to leave the backdoor in new programs for terrorist tracking, that is important. ;)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.