官术网_书友最值得收藏!

Applying a classifier to a .csv file

Now, we can test our language ID classifier on the data we downloaded from Twitter. This recipe will show you how to run the classifier on the .csv file and will set the stage for the evaluation step in the next recipe.

How to do it...

Applying a classifier to the .csv file is straightforward! Just perform the following steps:

  1. Get a command prompt and run:
    java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar:lib/twitter4j-core-4.0.1.jar:lib/opencsv-2.4.jar com.lingpipe.cookbook.chapter1.ReadClassifierRunOnCsv
    
  2. This will use the default CSV file from the data/disney.csv distribution, run over each line of the CSV file, and apply a language ID classifier from models/ 3LangId.LMClassifier to it:
    InputText: When all else fails #Disney
    Best Classified Language: english
    InputText: ES INSUPERABLE DISNEY !! QUIERO VOLVER:(
    Best Classified Language: Spanish
    
  3. You can also specify the input as the first argument and the classifier as the second one.

How it works…

We will deserialize a classifier from the externalized model that was described in the previous recipes. Then, we will iterate through each line of the .csv file and call the classify method of the classifier. The code in main() is:

String inputPath = args.length > 0 ? args[0] : "data/disney.csv";
String classifierPath = args.length > 1 ? args[1] : "models/3LangId.LMClassifier";
@SuppressWarnings("unchecked") BaseClassifier<CharSequence> classifier = (BaseClassifier<CharSequence>) AbstractExternalizable.readObject(new File(classifierPath));
List<String[]> lines = Util.readCsvRemoveHeader(new File(inputPath));
for(String [] line: lines) {
  String text = line[Util.TEXT_OFFSET];
  Classification classified = classifier.classify(text);
  System.out.println("InputText: " + text);
  System.out.println("Best Classified Language: " + classified.bestCategory());
}

The preceding code builds on the previous recipes with nothing particularly new. Util.readCsvRemoveHeader, shown as follows, just skips the first line of the .csv file before reading from disk and returning the rows that have non-null values and non-empty strings in the TEXT_OFFSET position:

public static List<String[]> readCsvRemoveHeader(File file) throws IOException {
  FileInputStream fileIn = new FileInputStream(file);
  InputStreamReader inputStreamReader = new InputStreamReader(fileIn,Strings.UTF8);
  CSVReader csvReader = new CSVReader(inputStreamReader);
  csvReader.readNext();  //skip headers
  List<String[]> rows = new ArrayList<String[]>();
  String[] row;
  while ((row = csvReader.readNext()) != null) {
    if (row[TEXT_OFFSET] == null || row[TEXT_OFFSET].equals("")) {
      continue;
    }
    rows.add(row);
  }
  csvReader.close();
  return rows;
}
主站蜘蛛池模板: 内黄县| 乐陵市| 渑池县| 福州市| 淳化县| 乌鲁木齐市| 象山县| 安阳市| 丹东市| 枣强县| 定西市| 刚察县| 剑阁县| 梅州市| 上虞市| 丰都县| 宜城市| 修武县| 汝城县| 锡林郭勒盟| 阿图什市| 吐鲁番市| 辉县市| 汉川市| 仁寿县| 长葛市| 舞阳县| 广饶县| 彭阳县| 赞皇县| 宁津县| 遂溪县| 都兰县| 托里县| 黎城县| 临武县| 金川县| 通州区| 通渭县| 临桂县| 鱼台县|