官术网_书友最值得收藏!

Ingesting the data

Now, without much further ado, let's write some code to ingest the data. First, we need a data structure of a training example:

// Example is a tuple representing a classification example
type Example struct {
Document []string
Class
}

The reason for this is so that we can parse our files into a list of Example. The function is shown here:

func ingest(typ string) (examples []Example, err error) {
switch typ {
case "bare", "lemm", "lemm_stop", "stop":
default:
return nil, errors.Errorf("Expected only \"bare\", \"lemm\", \"lemm_stop\" or \"stop\"")
}

var errs errList
start, end := 0, 11

for i := start; i < end; i++ { // hold 30% for crossval
matches, err := filepath.Glob(fmt.Sprintf("data/lingspam_public/%s/part%d/*.txt", typ, i))
if err != nil {
errs = append(errs, err)
continue
}

for _, match := range matches {
str, err := ingestOneFile(match)
if err != nil {
errs = append(errs, errors.WithMessage(err, match))
continue
}

if strings.Contains(match, "spmsg") {
// is spam
examples = append(examples, Example{str, Spam})
} else {
// is ham
examples = append(examples, Example{str, Ham})
}
}
}
if errs != nil {
err = errs
}
return
}

Here, I used filepath.Glob to find a list of files that matches the pattern within the specific directory, which is hardcoded. It doesn't have to be hardcoded in your actual code, but hardcoding the path makes for simpler demo programs. For each of the matching filenames, we parse the file using the ingestOneFile function. Then we check whether the filename contains spmsg as a prefix. If it does, we create an Example that has Spam as its class. Otherwise, it will be marked as Ham. In the later sections of this chapter, I will walk through the Class type and the rationale for choosing it. For now, here's the ingestOneFile function. Take note of its simplicity:

func ingestOneFile(abspath string) ([]string, error) {
bs, err := ioutil.ReadFile(abspath)
if err != nil {
return nil, err
}
return strings.Split(string(bs), " "), nil
}
主站蜘蛛池模板: 龙南县| 弥渡县| 黔西县| 巧家县| 腾冲县| 柳州市| 佛山市| 韶关市| 陈巴尔虎旗| 宁城县| 徐闻县| 台前县| 阿拉尔市| 隆尧县| 梁山县| 兴城市| 郴州市| 班玛县| 高青县| 绥阳县| 石屏县| 尉氏县| 阳信县| 商水县| 舞钢市| 临汾市| 富源县| 都匀市| 安西县| 松潘县| 那曲县| 盱眙县| 陆良县| 新民市| 岳阳县| 定结县| 吉木萨尔县| 大新县| 彰武县| 乐昌市| 砚山县|