The Junk Content Problem

30 comments
Source Title:
WidgetBaiting Free Automated Content Generation Software
Story Text:

Aaron Wall is talking about auto generated content, something that doesn't bother me particularly, but does annoy me a little when searching for information. Really, the last thing i want to find is a bunch of gobbledygook when im researching something...

I don't necissarily think that one needs to use many automated tools to do well. I do well enough with this site without using any of those tools, but it does not mean someone is a bad person if they use a tool which may not comply with the search engine's TOS.

A sentiment i agree wholeheartedly with. I've created a lot of crap in my time, but never generated pure junk, but as i said, i don't really have an issue with those that do other than getting a bit miffed when it intefers with my own research.

Such tools for generating junk content abound, one such is WidgetBaiting (email req'd) from TW member JasonD, but my question would be, why bother?

As time goes by, i find myself more and more leaning toward a mixture of quality, topical content and good old fashioned link hunting as the way forward, and less and less creating low grade crap for engines.

It's an interesting tool though, and those with an interest in such stuff should check it out...

Comments

 

Firstly thanks Nick for the commentary. It's much appreciated.

Quote:
As time goes by, i find myself more and more leaning toward a mixture of quality, topical content and good old fashioned link hunting as the way forward, and less and less creating low grade crap for engines.

IMO nothing can beat your recipe above. High quality, human written, topical content shows amazing results but the down side is that it takes time, money and scant resources to deliver it.

The aim for widgetbaiting was to assist in the creation of content for those seeking both quality and quantity.

If you want quantity the quality definately suffers but if you want to "uniqueify" some existing content that may be seen as duplicate then the Evolve tool assists massively while retaining the quality.

Just be careful

I think the search engines (especially Google) are going to be more and more aggressive when it comes to banning sites that engage in different types of spam. I don't know if random text is a bannable offense, but I'm guessing it will be very soon if it isnt't already. The term search engine spam can be pretty wide, and will probably be broadened more and more.

 

First things first, they have to be able to detect it...

Hi SpamHuntress

Hey, it isn't just a random text thing :)

But we do come from different sides of the same coin so I understand your concern.

I don't believe you'll get banned for what you put on your own pages but caveat empor etc

 

>> I don't know if random text is a bannable offense

anything is a bannable offence since they can list or not list whoever they like on their site.

I would love to see them ban all pages with crap content - I'd like to volunteer as the arbiter. While they're at it could they also ditch sites with cruddy design, poor products and anything which flashes purely because it can? :)

Hehe

Some funny comments here - specially Gurtie.

If you combine random text - probably worse than Jason's tool, with cloaking. You better believe you'll be banned.

I've seen Google take a hard look at link farms. Any hint of hidden links, and it'll be banned. I guess it's a weight thing. The more spammy techniques get poured into the same site, the more likely it'll get banned or at least penalized. In fact, one domain that got mentioned for linkfarming where GoogleGuy was hovering, suddenly lost a bit of PR. Could be coincidence, but I kind of doubt it. Another almost virtual copy of the same site with similar domain name hasn't lost any PR, but there's no cache in Google.

eh?

and a lot of misinformation.

When it doubt, run your own tests. Don't just take others word for it.

 

Quote:
lost a bit of PR

Define:Page Rank

Green bar measurement designed to con people who want text link ads on your site, into paying more than they are actually worth.

Not true

> If you combine random text - probably worse than Jason's tool, with cloaking. You better believe you'll be banned.

I think you need to modify this to: ..."you CAN get banned"
It is cetainly not true that you WILL get banned. I can give you several examples when you won't - for example:

- If the pages created don't rank for anything
- If the pages only rank for queries only very few perform
- If the crappy cloaked pages actually lead users to relevtn end pages

- and then the most important reason: If Google don't find it

Neither Google or any other major engine do cloaking detection very well. And for a good reason: You'll end up with far too many false possitives if you automate it.

So, please don't listen too much to the "scare campaigns" engines promote. Most is not true but hey, if they can make webmasters think so thats an easy way for them to clean up their index. Much easier than "spam-detection" :)

 

Quote:
Green bar measurement designed to con people who want text link ads on your site, into paying more than they are actually worth.

Shhhhh! Or i’ll have to go back to a proper job.

 

I’ve used the widgetbating tool and using the Markov chain really does produce some great results. Readable text, but significantly different.

Be nice to have tool like this to alter some data feeds, and avoid any duplicate content issues. I’m sure i can think of some other uses for it too.

Good or bad, when would you use a tool like this?

 

Quote:
If you combine random text - probably worse than Jason's tool, with cloaking. You better believe you'll be banned.

Im guessing you're quite new to all of this...

 

>quite new to all of this

Believes in SE fud.

>could they also ditch sites with cruddy design

Hey, Gurtie, no need to get personal.

ah

sorry, was that your flashing, singing, single page 10ft long, unstylesheeted H1, all in comic sans, yellow-on-black, PR9 (honest guvner) make-free-money-with-no-effort site I reported the other day?

 

no, i use yellow, orange, and lime green --never on black. (it's true, tell her Nick)

it's also true that i just give them what they respond to.

~added~

BTW, I know I have a reputation for doing auto-generated stuff. I don't do junk, just ugly.

 

You're giving Bob way too much credit, Gurtie.

(And look where it got him.)

oh but lime green and orange is classy.

just please tell me you don't play the national anthem when the site opens?

 

> where it got him

I manage to stay out of the bread lines, yes.

on topic: I do know some of the auto-gen crowd and, quite frankly (if we're scoring by money) they are beating the hell out of whites, greys, and 'ordinary' blacks. In short, they're winning.

off topic: no anthem, gurtie. not that it matters to me, I never have speakers put on my desktops and mute the laptops.

 

Quote:
The more spammy techniques get poured into the same site, the more likely it'll get banned or at least penalized.

Webaward for most "deep thought" comment.

RC, omigod, are those your hundreds of sites submitted to my link directories that all have the same awesome Starbust candy colors? :-)

Adding to what Mikkel said, if done correctly (depends on the shelflife of the project) autogenerated can be perfectly useful. It just depends how much is put into it. If it's something whipped up in an hour, it's probably complete crap. If it's something that you spent a year developing the technology for and your putting valid information into its "AI" to work with, then you just may have something solid that adds value to the SE's.

 

>autogenerated can be perfectly useful. It just depends how much is put into it

My first 'auto-gen' site is known by senior SE reps. It is approx. 12k pages and went up in '98 or early '99. I built it because people emailed and asked for wider coverage than my high-content site provided. But to be completely candid the intent was, ummm, grey. Truthfully, I spent months on development --building the feed database.

It has been listed from time-to-time by Salon, CNN, numerous newspapers, dozens upon dozens of schools and universities, and even boasts a book credit by a well-known US novelist. It's still out there plugging away and does around 4500 uniques/day. It is now MUCH copied (in concept, not too many scrapers) and needs an overhaul. I've been collecting more material to add to the database for 3 years now.

It is also extraordinarily ugly and (to steal a line from craigslist) has the visual impact of a pipe wrench.

hey Jason

Nice tool! I've been playing with it and it's pretty good.

I don't think there is any way in the foreseable future that a search engine can fight against junk content on the content front. They need to look for signals of quality elsewhere.

Content is too easily scraped and recycled. In fact, plain old gibberish works well too if linked together properly.

 

yes, you win a ticket to the roadshow, jason.

 

Thanks everyone for the kind words. It's still being worked on and more features will come onboard as the great feedback keeps coming in.

hey RC. Roadshow! ?

 

http://www.seoroadshow.com/

Back on topic, unless you actually cloak these junk pages, are they any good out of the box?

I can't imagine as a surfer that if i came upon a bunch of gobbledegook i'd be particularly inclined to either a) click on an ad, or b) click on anything else at all

I think (and have done, actually) that i'd leave in disgust...

 

Quote:
unless you actually cloak these junk pages, are they any good out of the box

I would say yes and no.

A crap answer but I'll try to explain what I mean. You aren't an average surfer Nick, in fact I doubt very few of us are here.

It depends on your sites' business model. Let's presume for a moment that you sell PPC advertising. Adsense, Overture or any one of the myriad of others out there.

My tests (and I know others have similar data agreeing with me) is that the fewer options you give someone on page the more likely they are to perform the action you want them to. If you sell advertising (clicks) you want them onto your site then off to one of your advertisers as soon as possible.

If the site holds no other option for them than to read crap content or click a link that will hopefully take them to the info they are looking for then (and please take my word for this) you will see you CTR increase exponentially. People generally do NOT like clicking back on their browser and searching again. They trust the search engine to deliver high quality sites and their brain power in picking the right page to choose from the SERPs.

You need some content to get the theme of the page worth indexing but you don't want it so great that there is only reason to read (and deliver a cost rather than income) pages from your site. Crap content is an amazing way to get people off site clicking the links you offer.

 

I know one guy, an adsense master, who deliberately created geocities like pages, almost complete with spinning logo, with deliberately awful navigation and garish colors..

his adsense is the only thing that is simple and intuitive to click.

He does very well indeed at it :)

 

That's the perfect example Nick

Nice tool

Nice tool Jason. I've messed about in the area and my view has always been that you had to cloak because the 're-engineered', or as you euphemistically put 'evolved', text was too much like junk to let users read. But this tool does a nice job or making it unique and maintain readability.

Um

greater minds than SE algos are being fooled by generated content

Quote:
A bunch of computer-generated gibberish masquerading as an academic paper has been accepted at a scientific conference in a victory for pranksters at the Massachusetts Institute of Technology.

Jeremy Stribling said on Thursday that he and two fellow MIT graduate students questioned the standards of some academic conferences, so they wrote a computer program to generate research papers complete with nonsensical text, charts and diagrams.

The trio submitted two of the randomly assembled papers to the World Multiconference on Systemics, Cybernetics and Informatics (WMSCI), scheduled to be held July 10-13 in Orlando, Florida.

To their surprise, one of the papers -- "Rooter: A Methodology for the Typical Unification of Access Points and Redundancy" -- was accepted for presentation.

from Reuters

Not surprising

Many papers presented at conferences are junk in the first place, the ones that haven't been accepted or have no chance of being accepted for publication in journals. It's in these you find most of the off-the-wall theories that proponents can then say were "presented" at a conference.

(Now, where can I get that MIT program?)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.