Google Credits News Aggregator

22 comments

Many people seem to be complacent about news and thread aggregators that hijack people's SERPs and most are ambivalent about it with comments like "What's the harm?" in the recent Repackaging Your Sites Content discussion. The harm is that other websites gain more popularity off of your work and the first shot at collecting the ad revenue before your site even gets exposure, if ever, to the actual surfer.

Unfortunately our good buddy Aaron Pratt of SeoBuzzBox is learning this the hard way as Google's duplicate content filter has put his original content in the "supplemental results" as if he were a content scraper. It would seem that even Google should be able to figure out that snippets linking to another web site aren't the originals, considering that's Google's original business model, and should be a lower priority but in this case page rank seems to be more important than being the author.

Maybe they should hire someone from Yahoo! that managed to get this right.

Comments

Major problem indeed

Great story Bill.

It surprises me how many _experts_ think that stealing the content from the author is a great thing to do. Google needs to find some way to locate the originator and give the originator the credit for the find or post.

Just think, that if the news media or even college ran the same way Google does, the person with the original thought would never get the credit because the bigger or more popular entity would receive the credit.

Essentially, this is nothing more than making a dollar off plagerism. This needs to end. Google needs to figure out the solution and do it quickly.

Very Little Buzz

You'll notice that this problem will get very little attention because the aggregators that have taken the position of power sure won't want to give it up.

I'm starting to consider new techniques like a delayed RSS feed which will wait until the Google/Yahoo/etc. crawls my new content before releasing it to the RSS feed meaning I'll always have a head start at being indexed before the content becomes aggregated.

IMO this "problem" gets very

IMO this "problem" gets very little attention because users dont care. the typical web site visitor does not care if the web site he/she visited has scraped content. if the typical user doesn't care, there isnt much incentive for the search engines to care either.

Scraped vs Aggregator

There's still millions of webmasters that care and if Google is paying attention to the WMW AdSense and copyright forums people are always screaming about this.

The problem in this case wasn't exactly a scraper but an aggregator, yes, a thin line I'll admit, but Google thinks they're the originator of the content and dumping SEOBUZZBOX into supplemental results. FWIW, in the few cases I checked from his site Yahoo and MSN are doing a better job than Google on this one so someone somewhere must care, just not Google.

I'm thinking Crawl Delayed RSS feeds might solve the problem, we just need to test it, and if it does work then it's up to the webmaster to decide if he wants to take control of his content or completely relinguish all control to others.

I agree with kidmercury,

I agree with kidmercury, until your average user sees this as a problem then Google don’t need to do anything about it. Users may even prefer the aggregators. If anything a *good* aggregator which offers users a choice is better than individual source. Google news?

Scrapers probably go unnoticed by your average user, or they just really hate clicking back and rather click on an ad. There’s an easy way for the public to kill off most scrapers without any search engine algo changes or changes to AdSense, simply get them to stop clicking ads on sites they don’t like.

Spamming SERPS

If anyone can explain why the news aggregators need to let the search engines index all of their content, which is ridiculous, then maybe we can make some headway on this issue.

The problem as I see it is if I'm looking for a news reader or news aggregator I'm not looking for individual articles, does that make any sense? Once using the news reader you can search within the news reader so there is no value FOR US of letting Google index their site as it's redundant. For instance, if I'm looking for a "technical news source" then it would make sense that technorati would show up in the SERPs, but I wouldn't expect them to show up for everything they have ever indexed.

Remember, we aren't discussing scrapers who steal just to get into the SERPs, we're talking about aggregators which also spam the SERPs as they have no real legitimate reason to let that happen as we already know they're a news reader, like My Yahoo, and we go to them for the articles.

The fact that they're undermining the source of the articles as an authority is a serious issue and we're all webmasters so this is an issue for our long term viability unless we all want to end up working for the news aggregators.

delayed rss

I'm already running a delayed RSS feed on a site using the Feedburner service.

Feedburner respects a RSS (2.0) tag called "TTL" which states the number of minutes between each fetch of the feed. Just set it at a greater interval than your publishing frequency :)

Delayed feeds...

Headlines and a short summary until the major SEs have fetched and indexed the content page, switching to full content afterwards, would suffice. Easy to automate too.

Not that kind of delay

Not the delay between fetch, we're talking the delay between CRAWLS as you have no clue when the Google will crawl so the TTL is just a shot in the dark.

My idea is the following, a CRAWL update delay:

a) Blogger posts new articles but RSS feed not updated
b) Google eventually crawls blog
c) Blog updates RSS feed with all the new articles just crawled

See what I'm trying to solve here?

Being first to get indexed with your content might make the original source more authoritative and give other sources that get crawled after the RSS feed update duplicate content status.

The question is whether or not dupe content status is determined solely based on PR or when the content is first encountered, because if it's strictly PR then the lowly bloggers have already lost to the news aggregators.

Ah, I see

Great ideas, Bill and Sebastian! :)

So, you should have a User-Agent sniffer on each page and when G-bot, Y-bot, or M-bot has seen the page then you set a flag in the database that the article is ready for RSS-publishing. Nice and simple.

While it would not affect scrapers it would still be helpful. I suspect that scrapers are likely to have lower PR than aggregators anyway, but perhaps I'm wrong.

Fetch, crawl ... index

Bill,
My CMS knows exactly when each page has been crawled (fetched by the bot). Probably there are spider tracking plug-ins for blogs avail, if not that's a simple hack. Also I know the average time-to-index for each engine so I can skip the lookup in the SE indexes. With this info it's easy to automate the RSS release for the crawled (or crawled and indexed) pages, or a change from a summary to full content. It's even possible to do that for both Googlebots to ensure that the page is indexed in (or crawled for) both index types.
Exactly what you said :) I just added a tidbit.

Although --from my experience-- in some cases PR still is the major "indicator of originality" (at least that's how it looks from the query engine's results), Google has changed a lot recently and often the first crawled page gets the source bonus, that is it appears in unfiltered SERPs instead of the dupes with (assumed) higher PR. But that's not the full picture, and there is too much speculation and everflux involved.

More important: sites with an overall high PR get crawled more frequently, so chances are good the high-PR-aggregator's copy gets crawled before the low-PR source. It looks like exactly that happened to Aaron. So your idea is definitely worth the coding efforts.

Scrapers

It might even slow scrapers because RSS scrapers, from watching my site, have automated tools that scrape based on the latest RSS feed. They come to my site trying to snag all the latest articles only so this would at least slow them down a little as well.

In any other industry, people would go to jail for most of this activity!

Can you imagine someone "aggregating" or "scraping" food at the grocery store and yelling "fair use" when the manager tries to stop them from leaving without paying?

Thanks dude!

Will, I really appreciate you making this an issue because I believe it is a major one. I noticed this stuff happening more and more recently, it appears to be a problem with the current algorithm/datacenters and I wonder if they will adapt?

Sebastian, yes that is correct, whomever Google grabs the article from wins and we all know how harsh Google is on duplicate content. Ask anyone who owns and "article" database. Me and a SEO friend did one for fun and got our asses handed too us, (note this was a article submission database, NOT aggregator robbery) After the jagger updates it literally lost 100% of it's traffic and was removed from the index, we checked the people submitting articles, they were submitting them all over the place, remember the latest lamest thing, "article submission"? Booo

I just signed up for feedburner, will check that thanks, is that where they grab it from or do they just visit the regular Wordpress RSS feed?

Hitslog was answering questions until I mentioned I would burn his ass, he offered to remove all my stuff from his site but now I am interested in seeing how Google handles this, they SHOULD handle this not me right? Hitslog is not the problem, Google is because they ignore this, or do they? Anyone care at Google?

And much thanks, if you guys can figure out something that normal humans can understand to fix this I want to interview you on www.seobuzzbox.com, really tired of the praise of spam thing though I find daveN and others to be extremely amusing, they like those who "aggregate" just push buttons, Google needs to react and adapt.

WordPress / Feedburner

For Wordpress, the first step is to edit the file called "wp-rss2.php" - it's found in the main wordpress folder, not in a subfolder. Start by saving a copy of the file as a backup, just in case.

Here, line 28 goes something like this:

<language><?php echo get_option('rss_language'); ?></language>

After that -- as a new line 29 -- insert something like this:

720

In this example, the number "720" means "720 minutes", ie. "only fetch the feed every 12 hours". Just take whatever number of hours you prefer and multiply it with 60.

When this is done, test that you can still see your feed. The above edit is the only one you need to do.

After that, sign up for feedburner and download + install + enable the feedburner wordpress plugin.

That should be it.

Quick and dirty fix

Aaron,

on blog/wp-admin/options-reading.php change "full text" to "summary".

Or hack the wp-rss2.php... files and replace the rss_use_excerpt condition by something like post_time + 5 days > today. This would give Google 5 days to index your stuff before the feed's description attribute is populated with the full post instead of the initial summary. Disclaimer: I don't use WP and thought that could work from a quick view on the code. I'm sure there is a more elegant way to achieve the goal.

Is this idea "something that normal humans can understand"?

Yes indeed

Yes indeed, I have done #1 but it doesn't seem to work so now I will try #2 thanks S.

#1 works

I've just updated your feed and get summaries now.

:-)

:-)

but does hitslog? the guy told me he likes my content so much he picks and chooses, this would allow him time to cut and paste. :-(

Matt Cutts not even touched by hitslog

http://hitslog.com/articles/archive/Shark-Jumping-A-Historical-Perspective-1139792763.php

Look how quickly Matt Cutts takes credit for his "shark jumping" post, it's definitely a weenie thing, fine with me in my world there is no such thing as a shark jump and I will just carry on building my machine. What do they call that little fish that rides around underneath sharks and whales?

Cut n Paste?

I would suggest you update your terms of service and copyright notice such that only the contents of the RSS feed may be posted as released, and alert him to your new terms, then if he cuts and pastes beyond your release dates all bets are off.

:-(

If that hitslog guy visits your blog to copy&paste your posts you can't stop him with RSS snippets. Check his source for your markup. He even hotlinks your smilies, and look at the tag URIs.

Interesting

I turned on partial feeds around the 8th, no posts since then, looks like he is feeding off Matt Cutts now, how funny is this? (Boy I wish Matt would get a supplimental but no dice there, he is not a weenie)

Bigdaddy owning: phrase: "Blog Scraper Links" in http://www.tony-hill.net/bigdaddywatch/ yes I know it is not a high comp. phrase but it shows no change in Google's current algorithm.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.