Monthly Archives: March 2010

Creating a duplicate checker and filter for articles

Last week we’ve been trying to find ways on how to create a duplicate content checker for articles we’re importing. We found an open source script written in C which we’ll use for this. We’ll do the duplicate checking in two phases.

In the first phase we will check for the duplicates in the big article pack itself (115.000+ articles). We already have a beta version working but it has a limited capacity (about 23.000 total files), so our tech team will have to modify it in order to be able to compare all of the articles in one run. In the first test run on a batch of 23.000 articles we found out that about 10% of the articles are probably duplicates. These include badly spinned (only a few sentences) articles or articles with only their title changed.

In the second phase we will build a database of current articles. From that point on every new article will be compared to the existing ones, before we will add it to the database of articles.

The next thing we’ll do is an article filter. This will be done with the help of another script, which filters the articles based on some keywords or key phrases. We want to avoid articles which have copyright on them (yes, you can find copyright articles in article packs!), articles which are just a useless promotional material (with links to websites), those which have no value due to their length (less than 300 words) or articles with questionable content (i.e. promoting hate speech, stereotypes or similar).

We strive to have high quality articles, which won’t get anyone into trouble. Even though the filtering will be as thorough as possible, we know that a few articles will probably slip through. This is why we also created a button with which you can mark an article as “unsuitable”. In this case, we’ll review it and delete it if necessary. You can already do this now by clicking “Report article as spam!” at the top of the article page.

Survey results are in

For about 14 days now. 🙂 We have a lot of other things going on, so I haven’t had time to write a blog post. Well, the beta testers have expressed their opinions and here are the questions and answers:

1. How do you rate the quality of the search function?

  • Fantastic, I always find what I’m looking for!

– 33% (2 users)

  • Good, most of the times I find what I’m looking for.

– 67% (4 users)

  • Bad, usually I can’t find what I’m looking for.

2. How do you rate the quality of articles?

  • Fantastic, these are very good PLR articles!
  • Good, most of the articles are very solid.

– 100% (6 users)

  • Bad, most of the articles are bad.

3. How do you rate the ease of use and user friendlines

  • Great

– 67% (4 users)

  • Good

– 33% (2 users)

  • Bad

4. How much would you pay per month for this application?
Keep in mind the live version will have over 100.000 articles, a few thousand eBooks and over 1000 fresh articles each month. Additionally there will be some new functionalities like download cart (batch downloading), article spinner, saving RSS feed of searches, etc.

  • I would pay max 67$ per month

– 17% (1 user)

  • I would pay max 37$ per month

– 67% (4 users)

  • I would pay max $ per month, enter below

– 17% (1 user)

I would pay max USD (number only): – 19.95 $

Now, we’re fully aware these are not statistically significant results. But if they in any way reflect the future users, we’re on the right path. Especially since 1 user stated he/she would be willing to pay up to 67$ per month.

Beta test poll

We received a lot of feedback and it was mostly very positive. We’ve received great thorough feedback especially from one tester which really took his time to analyze our application and also brought up quite a few good ideas.

Before we start planning for the next phases I’d also like to give our beta testers a quick, 30 second poll so we can get at least basic feedback from the ones that otherwise haven’t given any. And we also added a very important question about pricing.

You can find the survey here.

March System Update

A very painful weekend is behind us. We had a server transfer planned and instead of a painless 2-hour downtime we had to transfer the whole site to three different servers before we got a reliable server working (the first two crashed and never booted again). This is the main reason behind the 1+ day outage.

Then we had some problems as DNS didn’t refresh as soon as we hoped and a lot of members today still had problems with accessing the members area.

This can be resolved with editing your hosts file. Here’s our email from earlier:

As you probably know we had some issues with updating the server and the website through the weekend. For some members the login still doesn’t work and this is the solution we suggest:
Go to your hosts file and add a new IP to the domain:
184.106.220.147 members.bigcontentsearch.com
If you don’t know how to change your hosts files check this YouTube video: http://bit.ly/f5jYKY

As far as we know now, all members should have normal access to the site. If you’re not one of them, let us know.

Good News

What’s been done in this update:

  • updated free article spinner (way more useful and with a spintax export button),
  • article preview in the search results page (click the magnifying glass),
  • added 7000 new PLR articles,
  • added quick support button to the bottom of the page,
  • fixed minor bugs,
  • system update.

If you have any comments or would just like to leave a feedback, you can do this with the help of the button in the right corner of your browser.

We’re already planning the next update for which we’ll do everything in our power that it goes smoother than this one.

Thank you all for your patience.

First feedback from beta testing is in

We’ve invited a couple of Warriors to enter our beta phase testing and we’ve already received some feedback. It’s just basic first impression comments but it’s all positive so we’re really glad people find our service useful.

We hope the beta testers will continue testing and give us even more thorough reviews.

We’ve also found a bug (some articles are not showing) and are already resolving it.

Wanted Feedback

We’d like to hear your feedback on the following topics:

  • quality of search results
  • quality of articles
  • ease of use
  • any problems you encountered
  • wanted future functions