官术网_书友最值得收藏!

Eliminate near duplicates with the Jaccard distance

It often happens that the data has duplicates or near duplicates that should be filtered. Twitter data has lots of duplicates that can be quite frustrating to work with even with the -filter:retweets option available for the search API. A quick way to see this is to sort the text in the spreadsheet, and tweets with common prefixes will be neighbors:

Duplicate tweets that share a prefix

This sort only reveals shared prefixes; there are many more that don't share a prefix. This recipe will allow you to find other sources of overlap and threshold, the point at which duplicates are removed.

How to do it…

Perform the following steps to eliminate near duplicates with the Jaccard distance:

  1. Type in the command prompt:
    java -cp lingpipe-cookbook.1.0.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.DeduplicateCsvData
    
  2. You will be overwhelmed with a torrent of text:
    Tweets too close, proximity 1.00
     @britneyspears do you ever miss the Disney days? and iilysm please follow me. kiss from Turkey #AskBritneyJean ??
     @britneyspears do you ever miss the Disney days? and iilysm please follow me. kiss from Turkey #AskBritneyJean ??? 
    Tweets too close, proximity 0.50
     Sooo, I want to have a Disney Princess movie night....
     I just want to be a Disney Princess
    
  3. Two example outputs are shown—the first is a near-exact duplicate with only a difference in a final ?. It has a proximity of 1.0; the next example has proximity of 0.50, and the tweets are different but have a good deal of word overlap. Note that the second case does not share a prefix.

How it works…

This recipe jumps a bit ahead of the sequence, using a tokenizer to drive the deduplication process. It is here because the following recipe, for sentiment, really needs deduplicated data to work well. Chapter 2, Finding and Working with Words, covers tokenization in detail.

The source for main() is:

String inputPath = args.length > 0 ? args[0] : "data/disney.csv";
String outputPath = args.length > 1 ? args[1] : "data/disneyDeduped.csv";  
List<String[]> data = Util.readCsvRemoveHeader(new File(inputPath));
System.out.println(data.size());

There is nothing new in the preceding code snippet, but the following code snippet has TokenizerFactory:

TokenizerFactory tokenizerFactory = new RegExTokenizerFactory("\\w+");

Briefly, the tokenizer breaks the text into text sequences defined by matching the regular expression \w+ (the first \ escapes the second one in the preceding code—it is a Java thing). It matches contiguous word characters. The string "Hi, you here??" produces tokens "Hi", "you", and "here". The punctuation is ignored.

Next up, Util.filterJaccard is called with a cutoff of .5, which roughly eliminates tweets that overlap with half their words. Then, the filter data is written to disk:

double cutoff = .5;
List<String[]> dedupedData = Util.filterJaccard(data, tokenizerFactory, cutoff);
System.out.println(dedupedData.size());
Util.writeCsvAddHeader(dedupedData, new File(outputPath));
}

The Util.filterJaccard() method's source is as follows:

public static List<String[]> filterJaccard(List<String[]> texts, TokenizerFactory tokFactory, double cutoff) {
  JaccardDistance jaccardD = new JaccardDistance(tokFactory);

In the preceding snippet, a JaccardDistance class is constructed with a tokenizer factory. The Jaccard distance divides the intersection of tokens from the two strings over the union of tokens from both strings. Look at the Javadoc for more information.

The nested for loops in the following example explore each row with every other row until a higher threshold proximity is found or until all data has been looked at. Do not use this for large datasets because it is the O(n2)algorithm. If no row is above proximity, then the row is added to filteredTexts:

List<String[]> filteredTexts = new ArrayList<String[]>();
for (int i = 0; i < texts.size(); ++i) {
  String targetText = texts.get(i)[TEXT_OFFSET];
  boolean addText = true;
  for (int j = i + 1; j < texts.size(); ++j ) {
    String comparisionText = texts.get(j)[TEXT_OFFSET];
    double proximity = jaccardD.proximity(targetText,comparisionText);
    if (proximity >= cutoff) {
      addText = false;
      System.out.printf(" Tweets too close, proximity %.2f\n", proximity);
      System.out.println("\t" + targetText);
      System.out.println("\t" + comparisionText);
      break;
    }
  }
  if (addText) {
    filteredTexts.add(texts.get(i));
  }
}
return filteredTexts;
}

There are much better ways to efficiently filter the texts at a cost of extra complexity—a simple reverse-word lookup index to compute an initial covering set will be vastly more efficient—search for a shingling text lookup for O(n) to O(n log(n)) approaches.

Setting the threshold can be a bit tricky, but looking a bunch of data should make the appropriate cutoff fairly clear for your needs.

主站蜘蛛池模板: 柳林县| 喜德县| 武川县| 河津市| 榆中县| 苍南县| 陇南市| 宁海县| 白朗县| 长兴县| 莱西市| 韩城市| 三门峡市| 宝坻区| 利津县| 乐平市| 高碑店市| 绿春县| 景泰县| 冕宁县| 弥勒县| 正宁县| 额济纳旗| 灵丘县| 衡阳市| 长沙市| 遵化市| 疏附县| 绵竹市| 秦安县| 康乐县| 阳曲县| 定西市| 桦南县| 固镇县| 桦甸市| 镇原县| 龙山县| 大悟县| 奇台县| 阳城县|