官术网_书友最值得收藏!

Interacting with data in text format

Text is a great medium and it's a simple way to exchange information. The following statement is taken from a quote attributed to Doug McIlroy: Write programs to handle text streams, because that is the universal interface.

In this section we will start reading and writing data from and to text files.

Reading data from text format

Normally, the raw data logs of a system are stored in multiple text files, which can accumulate a large amount of information over time. Thankfully, it is simple to interact with these kinds of files in Python.

Pandas supports a number of functions for reading data from a text file into a DataFrame object. The most simple one is the read_csv() function. Let's start with a small example file:

$ cat example_data/ex_06-01.txt
Name,age,major_id,sex,hometown
Nam,7,1,male,hcm
Mai,11,1,female,hcm
Lan,25,3,female,hn
Hung,42,3,male,tn
Nghia,26,3,male,dn
Vinh,39,3,male,vl
Hong,28,4,female,dn

Tip

The cat is the Unix shell command that can be used to print the content of a file to the screen.

In the above example file, each column is separated by comma and the first row is a header row, containing column names. To read the data file into the DataFrame object, we type the following command:

>>> df_ex1 = pd.read_csv('example_data/ex_06-01.txt')
>>> df_ex1
 Name age major_id sex hometown
0 Nam 7 1 male hcm
1 Mai 11 1 female hcm
2 Lan 25 3 female hn
3 Hung 42 3 male tn
4 Nghia 26 3 male dn
5 Vinh 39 3 male vl
6 Hong 28 4 female dn

We see that the read_csv function uses a comma as the default delimiter between columns in the text file and the first row is automatically used as a header for the columns. If we want to change this setting, we can use the sep parameter to change the separated symbol and set header=None in case the example file does not have a caption row.

See the below example:

$ cat example_data/ex_06-02.txt
Nam 7 1 male hcm
Mai 11 1 female hcm
Lan 25 3 female hn
Hung 42 3 male tn
Nghia 26 3 male dn
Vinh 39 3 male vl
Hong 28 4 female dn

>>> df_ex2 = pd.read_csv('example_data/ex_06-02.txt',
 sep = '\t', header=None)
>>> df_ex2
 0 1 2 3 4
0 Nam 7 1 male hcm
1 Mai 11 1 female hcm
2 Lan 25 3 female hn
3 Hung 42 3 male tn
4 Nghia 26 3 male dn
5 Vinh 39 3 male vl
6 Hong 28 4 female dn

We can also set a specific row as the caption row by using the header that's equal to the index of the selected row. Similarly, when we want to use any column in the data file as the column index of DataFrame, we set index_col to the name or index of the column. We again use the second data file example_data/ex_06-02.txt to illustrate this:

>>> df_ex3 = pd.read_csv('example_data/ex_06-02.txt',
 sep = '\t', header=None,
 index_col=0)
>>> df_ex3
 1 2 3 4
0
Nam 7 1 male hcm
Mai 11 1 female hcm
Lan 25 3 female hn
Hung 42 3 male tn
Nghia 26 3 male dn
Vinh 39 3 male vl
Hong 28 4 female dn

Apart from those parameters, we still have a lot of useful ones that can help us load data files into Pandas objects more effectively. The following table shows some common parameters:

Besides the read_csv() function, we also have some other parsing functions in Pandas:

In some situations, we cannot automatically parse data files from the disk using these functions. In that case, we can also open files and iterate through the reader, supported by the CSV module in the standard library:

$ cat example_data/ex_06-03.txt
Nam 7 1 male hcm
Mai 11 1 female hcm
Lan 25 3 female hn
Hung 42 3 male tn single
Nghia 26 3 male dn single
Vinh 39 3 male vl
Hong 28 4 female dn

>>> import csv
>>> f = open('data/ex_06-03.txt')
>>> r = csv.reader(f, delimiter='\t')
>>> for line in r:
>>> print(line)
['Nam', '7', '1', 'male', 'hcm']
['Mai', '11', '1', 'female', 'hcm']
['Lan', '25', '3', 'female', 'hn']
['Hung', '42', '3', 'male', 'tn', 'single']
['Nghia', '26', '3', 'male', 'dn', 'single']
['Vinh', '39', '3', 'male', 'vl']
['Hong', '28', '4', 'female', 'dn']

Writing data to text format

We saw how to load data from a text file into a Pandas data structure. Now, we will learn how to export data from the data object of a program to a text file. Corresponding to the read_csv() function, we also have the to_csv() function, supported by Pandas. Let's see an example below:

>>> df_ex3.to_csv('example_data/ex_06-02.out', sep = ';')
 

The result will look like this:

$ cat example_data/ex_06-02.out
0;1;2;3;4
Nam;7;1;male;hcm
Mai;11;1;female;hcm
Lan;25;3;female;hn
Hung;42;3;male;tn
Nghia;26;3;male;dn
Vinh;39;3;male;vl
Hong;28;4;female;dn
 

If we want to skip the header line or index column when writing out data into a disk file, we can set a False value to the header and index parameters:

>>> import sys
>>> df_ex3.to_csv(sys.stdout, sep='\t',
 header=False, index=False)
7 1 male hcm
11 1 female hcm
25 3 female hn
42 3 male tn
26 3 male dn
39 3 male vl
28 4 female dn

We can also write a subset of the columns of the DataFrame to the file by specifying them in the columns parameter:

>>> df_ex3.to_csv(sys.stdout, columns=[3,1,4],
 header=False, sep='\t')
Nam male 7 hcm
Mai female 11 hcm
Lan female 25 hn
Hung male 42 tn
Nghia male 26 dn
Vinh male 39 vl
Hong female 28 dn

With series objects, we can use the same function to write data into text files, with mostly the same parameters as above.

主站蜘蛛池模板: 彭山县| 盐亭县| 布拖县| 平顺县| 成安县| 乌审旗| 克山县| 陈巴尔虎旗| 德州市| 永福县| 东安县| 双鸭山市| 正定县| 聂拉木县| 郁南县| 新巴尔虎左旗| 抚松县| 昌吉市| 沁阳市| 开封市| 三江| 赤水市| 垣曲县| 甘孜县| 固原市| 丘北县| 望都县| 郓城县| 文山县| 闵行区| 伊宁市| 江西省| 班戈县| 吉林省| 金寨县| 汝州市| 闽侯县| 耒阳市| 德州市| 郑州市| 民勤县|