官术网_书友最值得收藏!

Applying a classifier to a .csv file

Now, we can test our language ID classifier on the data we downloaded from Twitter. This recipe will show you how to run the classifier on the .csv file and will set the stage for the evaluation step in the next recipe.

How to do it...

Applying a classifier to the .csv file is straightforward! Just perform the following steps:

  1. Get a command prompt and run:
    java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar:lib/twitter4j-core-4.0.1.jar:lib/opencsv-2.4.jar com.lingpipe.cookbook.chapter1.ReadClassifierRunOnCsv
    
  2. This will use the default CSV file from the data/disney.csv distribution, run over each line of the CSV file, and apply a language ID classifier from models/ 3LangId.LMClassifier to it:
    InputText: When all else fails #Disney
    Best Classified Language: english
    InputText: ES INSUPERABLE DISNEY !! QUIERO VOLVER:(
    Best Classified Language: Spanish
    
  3. You can also specify the input as the first argument and the classifier as the second one.

How it works…

We will deserialize a classifier from the externalized model that was described in the previous recipes. Then, we will iterate through each line of the .csv file and call the classify method of the classifier. The code in main() is:

String inputPath = args.length > 0 ? args[0] : "data/disney.csv";
String classifierPath = args.length > 1 ? args[1] : "models/3LangId.LMClassifier";
@SuppressWarnings("unchecked") BaseClassifier<CharSequence> classifier = (BaseClassifier<CharSequence>) AbstractExternalizable.readObject(new File(classifierPath));
List<String[]> lines = Util.readCsvRemoveHeader(new File(inputPath));
for(String [] line: lines) {
  String text = line[Util.TEXT_OFFSET];
  Classification classified = classifier.classify(text);
  System.out.println("InputText: " + text);
  System.out.println("Best Classified Language: " + classified.bestCategory());
}

The preceding code builds on the previous recipes with nothing particularly new. Util.readCsvRemoveHeader, shown as follows, just skips the first line of the .csv file before reading from disk and returning the rows that have non-null values and non-empty strings in the TEXT_OFFSET position:

public static List<String[]> readCsvRemoveHeader(File file) throws IOException {
  FileInputStream fileIn = new FileInputStream(file);
  InputStreamReader inputStreamReader = new InputStreamReader(fileIn,Strings.UTF8);
  CSVReader csvReader = new CSVReader(inputStreamReader);
  csvReader.readNext();  //skip headers
  List<String[]> rows = new ArrayList<String[]>();
  String[] row;
  while ((row = csvReader.readNext()) != null) {
    if (row[TEXT_OFFSET] == null || row[TEXT_OFFSET].equals("")) {
      continue;
    }
    rows.add(row);
  }
  csvReader.close();
  return rows;
}
主站蜘蛛池模板: 象州县| 旺苍县| 根河市| 沅陵县| 寿光市| 搜索| 崇州市| 阜阳市| 梧州市| 宝坻区| 陈巴尔虎旗| 宜川县| 双辽市| 石棉县| 高台县| 萝北县| 利川市| 南部县| 两当县| 班玛县| 堆龙德庆县| 太仆寺旗| 德清县| 汉中市| 咸阳市| 广元市| 揭西县| 平邑县| 无锡市| 台东市| 桐梓县| 绥化市| 平罗县| 株洲市| 正镶白旗| 犍为县| 松原市| 岳普湖县| 绥中县| 贵德县| 平阳县|