Dependency Based Word Similarity Tool


While the name may be something only a college professor could love, this one is actually pretty neat. Try plugging in words like [auto] and [laptop]. I found the results for [health] and [hat] curiously intriguing, so go ahead try it out

Hat tip to DG


Fun, fun, fun...

...cURL/libcurl time.


interesting, but relevant?

Yes, always interesting. But how much research time does one have?

These tools work off a "newspaper corpus" collected over time (I imagine..when?) and not the web. If you want a picture of how useful this can be (or disappointing, if not relevant to the web) use the two-word co-dependency tool on the tools page. Pick two "theme" words you know have a common subset on the web, and see if the tool detects that subset. The red words are common to both sets.

In each case I tried (a few) it did not.

That means it did not reveal the key words needed to mimic an established set overlap on the web.

Lots of words and interesting tool, but worthy of kw research time?

how much research time?

I've got loads of research time. ;) The application isn't meant to help anyone with keyword research and the corpus needs to be larger.

The scientist that created it is now working for Google. Whether the corpus consists of web documents or scanned books or magazines is irrelevant. The corpus won't change word dependency, but a larger corpus would allow for a larger vocabulary for study.

perhaps I misunderstood

Whether the corpus consists of web documents or scanned books or magazines is irrelevant. The corpus won't change word dependency,

DG, can you explain word-word dependency for the non-linguist? I had thought it was tied to the frequency of words in the context of others, but now am suspecting it is a grammatical structure relationship (and not tied to semantic)? If that is correct, how could the corpus not matter (we write quite differently on the web than in newspapers, no?)


It was explained to me like

It was explained to me like this:

Describe a coffee mug.

Did you use the word 'handle'? Or 'ceramic'? What about 'logo' or 'chipped'?

Now describe a plane.

Did you use the words 'wing'? 'fly' 'tail'? 'jet' 'prop'?

Or did you use 'wood'? 'chips' 'smooth' 'sharp'?

For a given subject, certain words are inescapable. The problem that I saw is that the corpus used isn't as useful as we'd like it to be because 'bald', 'rogaine', 'obesity' 'diet' 'travel' 'motel', etc aren't represented well in the corpus. Adding to the corpus would solve the problem.

Wouldn't matter if you used travel websites or travel agency literature to flesh out the corpus. Personally, I would use both, and ideally, you want as large a body of documents as possible.

>>we write quite differently

Of course, but everyone writes differently, and 'we' make up a small segment of people producing copy for websites. (assuming you're talking about optimized copy)

Additionally, when things like co-occurrence come into play, 'we' need to take another look at our 'optimized' copy, in short, less can be more when we're speaking of keywords.

The risk is only using highly specialized documents to build the corpus. Say, corporate mission statements. ;) Though those might work to build a BS detector...

I always found that when

I always found that when people focus too heavily on optimizing the content sounds mechanical and sometimes people strip out important modifiers when they think it terms of keyword density for a main phrase or two.

A while ago I posted that I thought there was a shift from content optimization to content generation. I really believe that for most people having one or a few well read channels or many uber niche channels that are wrote without much thought towards search engines (other than maybe clean CMS, cross referencing related information, and using descriptive titles on some pages) will prove to be far more effective than content which is tweaked over and over again trying to match algorithms.

downloadable databases

Forget curl or LWP. Check out the rest of the site. He has two downloadable databases -- one is a dependency thesaurus and the other is a word proximity thesaurus. Pure gold, baby :)

pure gold.... how about platinum?

DG thanks for clarifying. It renews my interest, because that is what I thought.

Given that, this type of work with the proper corpus would be useful for keyword research (broadly defined, not your typical SEM definition). That current dataset, for the markets I checked, is not helpful.

Addressing Aaron, I believe times have changed w/r to content despite all the talk about writing content, user-submitted content, etc. I don't go into detail because there's no benefit for doing so (sorry - another topic). However I expect DG agrees that semantics play a role in today's G, and I think crafted content is the "platinum" to the button-pusher's gold.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.