Optimizing Bullshit: A Duplicate Content Experiment

I’ve been spending a good deal of effort getting bullshit indexed into Google. By this I mean unoriginal content which tons of other sites are displaying, and doesn’t pass copyscape. There’s a lot of advantages to having a cache of indexed pages, but even more so by experimenting with content which should trigger whatever duplicate content filters are in place I hope to learn more about how they work.

All of the sites I put into this category have an important characteristic: they run themselves.

There’s basically four types of these sites that I’ve been playing with:

  • Content database driven sites (jokes, cheat codes, recipes, etc)
  • Video blogs / aggregators
  • Flash arcades
  • Yahoo! Answers content sites

Below is a bit about each type of site, and a rundown of the sites I will be optimising of each type. I’m not going to be linking or disclosing the URLs of these sites, for fairly obvious reasons.

Content database driven sites

These are my favorite types of automatic sites to work on, because sometimes they end up being decent resources - even if there are thousands like them. These are sites which derive their content from cheap databases purchased from the DigitalPoint forums or even found freely on the internet. This would include cheat codes sites, drink recipes, jokes, etc.

Side note: I have 2 basically identical cheat cheat codes sites, running the same database of 9250 games. One has 1900 pages indexed, the other only 200. There difference between the two is that the more successful site has an XML sitemap and unique meta tags for each page. Nothing revolutionary there, but worth noting.

Video Blogs / aggregators

These are sites which index videos from YouTube, AOL, Break, Metacafe, iFilm, etc. Usually they pull content from RSS feeds, which can be any tag for YouTube or a list of top videos from most other sites.

These take the form of a videoblog, with a new video posted every X hours, or an aggregator which attempts to be a portal of some level.

Flash Arcades

I like flash games, and if the number of flash arcades is any indicator so do many others. These sites do not produce their own content, but rather have a few thousand flash games compiled from other sites. There’s not a lot of difference between these and content database sites, except that they offer viral linking opportunities (real or simulated) and consume much more bandwidth.

Yahoo! Answers content sites

Yahoo! Answers provides a platform for users to post questions, have other members answer them, and select the correct / most helpful answer. Yahoo also provides an API to allow 3rd parties to access this data and mash it up. This provides a limitless supply of content to automatic sites, all for the cost of a “Powered by Yahoo Answers” link in the footer, which should be simple for a spider to recognize.

I have 2 Yahoo Answers driven sites which I will be experimenting with, and when I have time will look at remixing the content a bit more to improve indexing. My current Y!A sites run a custom CMS which employs caching to reduce the number of request to the Yahoo! API, and maintains an XML sitemap to encourage full indexing. However, as of now neither has more than 50 pages in the index and visits from the crawlers are relatively infrequent, so there’s a lot of work to be done here.

Project Goals

The goal of this experiment is not necessarily ranking highly in SERPs, but rather having as many pages as possible included in the index. Through this process, I hope to gain insight into deep indexing strategies, and generate a cache of indexed pages to use for future projects (and maybe even make some cash).

I’ll be experimenting with breaking the bullshit content up, mixing it with other automatic content sources, inner linking, and meta tag and sitemap optimization to maximize inclusion in the index. If you have any suggestions or have tried a similar experiment, share it in the comments.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.