Spam Justice - Scraper Sites Generate $$$'s

Story Text:

Scraper sites are sites comprised of content taken from other websites. Usually scraper sites contain short excerpts of content, since republishing excerpts is more legal from a copyright perspective than republishing entire works. By scraping a site that has semantically well crafted content in a given topical niche, you are basically mimicking a successful competitor's keyword presence. Lately they have become silent partners for free, generating sales for me at no cost!

Mimicry is the highest form of flattery, so they say. It is also "content theft", according to others. Scraper sites are often credited to "scraper scum", and scraper sites are often called SPAM. In my opinion scraper sites are only SPAM when the material scraped is search results. Certain search engines love to eat their own dog food -- that is, they give preference in the SERPs to scraper pages comprised of.. those very same SERPs. Go figure.

Now other people's scraper sites are making me money. I love it.

How can that be? Evolution may take time, but it's a wonderfully efficient process of rewarding the fittest and weeding out the weak. On a few projects, my content ranks at the top for nearly every worthwhile search phrase in my niche in the major search engines. I got there through a deliberate, carefully-executed campaign based on almost a year of research and trials. Unless you copied each of my pages and put together a copy of my site, including it's back links, you could not replicate my results. I have been examined repeatedly by webmasters (lots of referral traffic from sophisticated linkdomain constructs and the like) and scraped relentlessly. I was 302'd to death a few months ago, from all around the world. By the way my site is ugly and appears haphazard. I can understand why it is so heavily researched by others... they probably can't see the obvious because the details they view as important are so seemingly random. Affiliate links out in plain site, lots (too many?) banners, HTML that won't validate, etc.

After a while at the top of the SERPs my affiliate links starting to appear in the SERPs. That's an affiliate link, like http://www.vendorsite.tld?id=my_affiliate_id I can surmise that with the success of my pages at the top of the SERPs, and because those pages included multiple back links to affiliate landing pages which included my affiliate IDs as shown, those vendor landing pages started to rank for relevance. Not at the top, mind you, but certainly high enough to start generating affiliate sales for me, independent of my own websites.

Think about it. That's lead generation for FREE. No bandwidth, no use of my web pages, no liability (?), just pure profit. Sounds good so far.

Those experienced in the audience have already recognized the obvious... as scraper scum build scraper sites from the SERPs, my web pages are naturally included (as evidenced by the 302 hijacking of a few months ago) but SO ARE THE VENDOR PAGES WITH MY AFFILIATE IDs. Count those as backlinks to me, and backlinks to that vendor site, but with my affiliate ID. More scraper sites means more backlinks, and more references to me as being "important". Plus, more references to my vendor as being "important".

Now the theoretical questions.

Is a URL with query string treated by the search engines as a unique page, or as a variant of the "canonical" page (without the query string)? This is argued frequently. As the SEs "improve" the engines, and start crawling those "dynamic pages" using some compromise along these lines, I win. My pages rank, and my vendor ransk FOR MY AFFILIATE LANDING PAGE. The rich get richer, no?

What about *other* gazillion affiliates whose backlinks to the same vendor site are supposedly signaling the hierachical importance of that vendor site for the topic of interest? Does it work on the canonical domain, on the page, or is the URL-with-query-string treated as a page? Is some "page rank"passed upwards to the vendor's domain (root), and if so, is it then distributed back downwards to other pages on that domain, including my affiliate landing page? More affiliates means more relevance for my pages again. The rich get richer again. I'm liking this for sure.

When we hear about the search engines addressing the affiliate situation, it is usually limited to those who link directly to the vendor site via PPC ads and such (not those who build pages which may link to the vendor site). I understand the desire of SEs to avoid representing the same vendor multiple times in the SERPs; that is a basic SEO strategy to be moderated. But if landing pages with affiliate IDs are now raked independently from the sites that house them, they compete with the vendor's own pages. Why am I winning that competition? I am not generating the majority of traffic to that vendor (I may be one of their super affiliates, but my traffic is not larger than the combined traffic from all other sources -- not even close).

I can hope that the search engines have bigger problems than this, so they won't try and track affilaite IDs, or revert to not indexing "dynamic URLs". There's no need, because this current situation serves the user well. They're buying via the links provided, so they are being served.

So scrape away, scraper scum. Continue generating leads and sales for my vendor under my affiliate ID, and I will continue to not share the profits with you. It's been working well so far! I look forward to continuing our unilaterally beneficial relationship, and finding more like it.


Congratulations on your success... should now commence on making your own scraper sites of the scraper sites. :) Or at least more scraper sites of your main site. Then we will be inundated even further by way of your success.

Othan than

from the search engine users perspective or if you got screwed by a 302, why are there so many SEO's bitching about scraper sites? I mean, if you're being outranked across the board by scrapers WTF are you doing in this biz?

Nothing new under the sun

Was a time (way before Goo) when you could submit you own Amazon affiliate links to the then #1 AltaVista which would achieve fantastic rankings. There was even some specialized software available to scrape the most expensive products off Amazon's site - you'd generate a list, insert your own affiliate link via search and replace, and bingo!

There was - in fact still is: - software available to put a search box on your own web site which will directly scrape Amazon content by search term and enhance the results with your affiliate code.

So the basic concept has been around for quite some time.

not so fast

Thanks for highlighting the prior art and saving me the patent application fees. Another example of the SEs generating their own problems in this case, where no submissions or scraping is required.

I can't let the Anaconda reference pass without a warning, though. Take a close look at Anaconda's Digital Windmill demo and see if you can spot the affiliate insertions. Can ya see em? Huh? Sure you can! I don't see any disclaimers admitting that the listings you serve may include unlabeled sponsored listings, so maybe I am just seeing things.


I have ranted a few times about scraper stuff, not because I care that they exist, but only because Google is completely hypocritical to tell people to make quality content and then throw AdSense ads on any old crap.

Anaconda's out

They haven't updated their site for ages and appear to be giving everything for free (if they care) except for their Foundation Amazon program. Just pointed it out to underline that this sort of thing's been going on since the late 90s at least.


> Is a URL with query string treated by the search engines as a unique page

A unique URL is usually seen as a unique page and that is one of the problems crawling dynamic websites - because often it isn't. So, what happens when many unique URLs (i.e. using affiliate IDs) actually show the same page? At first nothing. I've seen over 200,000 such identical pages indexed from one site (front page x 200,000 times!). After a while though, engines often find out (manually or automatically) and the result have each time I've seen this been that it all gets dumped.

However, from the publishers side of it, it is very easy to avoid - or to put a stop on affiliates that o this. Simply block all indexing of URls with the affiliate ID :)

Google needs to overhaul their AdSense policies and enforcement

I recently became aware that some ScraperScumbag scraped my blog homepage, modified it slightly (but noticeably -- the repeated inclusion of "big tits" was clearly out of context!), and plastered it with AdSense ads.

A four year old could have seen that this site was CLEARLY in violation of the spirit and letter of AdSense policies, and so I figured it'd be a quick fix to just contact AdSense and have 'em pull that jerk's account. Hit 'em in the pocketbook, I figured :)

But the AdSense staffer wrote back saying that if I had a copyright concern, I needed to file a hardcopy DMCA take-down request and then wait for it to be evaluated, blah blah blah.

I wrote back saying, I'm sorry, I think you misunderstood... this isn't mainly a copyright issue; rather, I'm pointing out an AdSense site that is grossly and very obviously flouting your rules... please deal with it in that context.

I got a rather terse and completely unhelpful and unfriendly note saying, basically, look, file the damn DMCA report and shut your trap in the meantime.


I ended up contacting the Webmaster of the offending site (admittedly disingenuously) threatening to get their AdSense account pulled, and within 2 days, the scumscraper replaced my full page content with exerpts from my site and others'. Still sleazy, but I no longer feel like dealing with it.

With that said, though, it's damn tempting to write the advertisers appearing on that jerk's page and say, "Hey, you like where your ad is being shown? Just curious."

And Google wonders why I advise most of my clients against content match. Ha! Hey clients, want some big tits with your luxury travel and international telecommunications?

Adsense has no conscience

Adsense's lax attitude towards obvious content theft seriously undermines Google's "do no evil" philosophy.

Back to scrapers

it's probably useful to bear in mind that the first and foremost scrapers/scraping sites are the search engines themselves. Which goes to explain a lot of what's happening ...

A substantive difference

C'mon, there is a SUBSTANTIVE difference between a search engine borrowing snippets of someone's content for the purpose of helping people find the original page, and what the "big tits" copier was doing!!!!!!


I'm not sure I agree. if it's just ONE Google, and it's a 200million/day search engine, it is easy to say so. But what if it is 17000 small Google's? Still "k" compared to a scraper?

Let's see if you can pass this Quiz:

Q: A website gets visitors from somewhere looking for something, and present them with a set of scraped snippets mostly related to the topic of interest, with added advertisements. The snippets were collected by a program that visits websites and caches (saves) their content. If the user clicks a snippet, the website passes it along to the snippet site. If the user clicks an ad, the website owner collects a royalty and passes the user to a sponsor's site.

The website so described is (choose all that apply):

A. Google
B. A scraper scum website

You can fancy it up all you like, and see if it gets any easier to answer.

You forgot ....

You forgot the part about distorting the content by adding unrelated and deceptive verbiage.

That is one HUGE difference between a legit search engine and scraper scum.

Plus: How about some SUBSTANTIVE caching

as an ongoing, massively money spawning scraper operation (without the original publishers' express permission, let's not forget, not to mention total lack of remuneration ...) - plus taking said third party content and displaying it in a substantially altered manner (again: without publishers' permission) for fun and profit?

No scraper? Not? Guess who gave that "big tits copier" some funny ideas in the first place, demonstrating to him and all the world that you can actually get away with it for years?

Heh heh ...

"not to mention total lack of remuneration"

Gosh, all the people around here who work so hard to make Google etc. like their sites must be the world's greatest altruists.

Some of the lines can indeed be fuzzy, but I will not accept that it's a valid parallel to compare a collection of excerpts that are presented in response to a search query with content that has been copied and distorted by having irrelevant verbiage sprinkled throughout the text (or even relevant verbiage, for that matter).

I do agree that caching is problematic, but neither will I acccept caching as anywhere near parallel to what the "big tits" copier is doing.

Intent isn't everything, you know

Of course that titty copier is, to all probability, acting in full knowledge of what he's doing, that he's stealing content, etc. etc.

But legally, that doesn't necessarily make a difference: caching actually buggers up sites, it breaks relative links, and hell, nobody ever asked you whether you'd be content with having a Goo header splat above your carefully crafted web design.

Also, I'm not saying that culling snippets from sites and displaying them is technically illegal - but scraping it is nevertheless.

As for "altruism" - people will do all sorts of things to make ends meet, adjusting to situations widely or even totally beyond their control (earthquakes, floods, draughts, etc.) That doesn't invalidate the point, however, that search engines are essentially parasites, living off other people's content without giving anything in return.

And no, they aren't: if you want to make a profit from their doing so, it'll be yet more effort, time, resources, expense for you, unless you're prepared to leave it up to chance.

And if you don't opt for the latter, they'll even bash you and persecute your sites for trying to circumvent the bloody rules they set up, not you - rules, mind you, governing what they want to do with your content, not theirs.

If that isn't blatant exploitation pure and unadulterated, I don't know what is ...


"essentially parasites, living off other people's content without giving anything in return"

What portion of your traffic comes from those "parasites"?

I think symbiosis would be a much better metaphor to describe our relationship with legitimate search engines.


n : the relation between two different species of organisms that are interdependent; each gains benefits from the other

What Buckworks said, 100%

Fantomaster, if you're quite so displeased with the Evil Google, why not simply add a line to your robot.txt file to tell the googlebot to sod off?

Or is it, maybe just possibly, that you're getting rather a substantial benefit from Google traffic?

I'm not in SEO to please Google

or to curry any other search engine's favor for that matter. For me, it's about leveling the playing field. About taking measures to prevail against what is essentially a hand of stacked cards.

In any case, I know it may sound paradoxical but while we do get a fair bit of traffic from the search engines, it's hardly essential to our business. (Search engines themselves obviously are, but the traffic they generate isn't very much.) Of course, that relates more than anything else to our specific target group who tend to find and approach us not so much via the engines as via other conduits.

And while Goo may generate the lion's share of search engine based visitor streams, as any experienced web marketer knows they're also the prime source of junk traffic for the very reason that they are attracting so many surfers who can't be bothered to buy anything, at least not in the industry we're involved in. (Obviously, this cannot be generalized as everything depends on what you're promoting.) So the plain answer is no, we're getting no "substantial benefits from Google traffic".

But even if we were, that wouldn't change the scenario outlined above one bit: it would, at the very best, be a form of compensation which is not the same thing as the "partnership" so many people in the SEO industry seem to crave for.

As for that time worn "symbiosis" metaphor, buckworks, while I would cordially wish to agree,
I have yet to meet the search engine rep who would endorse your view in public. In fact, if anything they still tend to treat SEOs as parasites, whereas it's really quite the other way round IMV because the anciennity is quite clear: no content, no need for search, period.

As long as the very best you can expect from them is a modicum of grudging tolerance,
"symbiosis" in any positive sense is nothing but a pipe dream. Nor do I expect this to ever happen: there's a fundamental conflict of interests involved here, and no moralistic "white hat vs. black hat" debate will ever change that one bit. So it will, to all probability, always be a case of two parties fighting it out.

Okay, Fantomaster, I see your point :)

I see myself as a Webmaster first and an SEO guy second (largely because I've specialized in PPC stuff and am not anywhere near as knowledgeable about 'natural SEO' stuff as you guys! :D)

And I think that changes my perspective. As a Webmaster, I DO feel as though I have a symbiotic relationship with Google (and, to be fair, with other search engines as well). I can definitely perceive, however, an occasionally more-antagonistic relationship between the search engines and those who optimize for them. In a way, it's a bit similar to the way journalists see PR folks... but probably less positive :D.


it's a bit similar to the way journalists see PR folks... but probably less positive

Now that metaphor I can subscribe to! :-)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.