Last week we’ve been trying to find ways on how to create a duplicate content checker for articles we’re importing. We found an open source script written in C which we’ll use for this. We’ll do the duplicate checking in two phases.
In the first phase we will check for the duplicates in the big article pack itself (115.000+ articles). We already have a beta version working but it has a limited capacity (about 23.000 total files), so our tech team will have to modify it in order to be able to compare all of the articles in one run. In the first test run on a batch of 23.000 articles we found out that about 10% of the articles are probably duplicates. These include badly spinned (only a few sentences) articles or articles with only their title changed.
In the second phase we will build a database of current articles. From that point on every new article will be compared to the existing ones, before we will add it to the database of articles.
The next thing we’ll do is an article filter. This will be done with the help of another script, which filters the articles based on some keywords or key phrases. We want to avoid articles which have copyright on them (yes, you can find copyright articles in article packs!), articles which are just a useless promotional material (with links to websites), those which have no value due to their length (less than 300 words) or articles with questionable content (i.e. promoting hate speech, stereotypes or similar).
We strive to have high quality articles, which won’t get anyone into trouble. Even though the filtering will be as thorough as possible, we know that a few articles will probably slip through. This is why we also created a button with which you can mark an article as “unsuitable”. In this case, we’ll review it and delete it if necessary. You can already do this now by clicking “Report article as spam!” at the top of the article page.