- Python Data Analysis
- Ivan Idris
- 748字
- 2021-08-05 17:31:55
pandas DataFrames
A pandas DataFrame
is a data structure, which is a labeled two-dimensional object and is similar in spirit to an Excel worksheet or a relational database table. A similar concept, by the way, was invented originally in the R programming language. (For more information, refer to ways:
- From another
DataFrame
. - From a NumPy array or a composite of arrays that has a two-dimensional shape.
- Likewise, we can create a
DataFrame
out of another pandas data structure calledSeries
. We will learn aboutSeries
in the following section. - A
DataFrame
can also be produced from a file, such as a CSV file.
As an example, we will use data that can be retrieved from http://www.exploredata.net/Downloads/WHO-Data-Set. The original datafile is quite big and has many columns, so we will use an edited file instead, which only contains the first nine columns and is called WHO_first9cols.csv
; the file is in the code bundle of this book. These are the first two lines including the header:
Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) total Afghanistan,1,1,151,28,,,,26088
In the next steps, we will take a look at pandas DataFrames and its attributes:
- To kick off, load the datafile into a
DataFrame
and print it on the screen:from pandas.io.parsers import read_csv df = read_csv("WHO_first9cols.csv") print "Dataframe", df
The printout is a summary of the
DataFrame
. It is too long to be displayed entirely, so we will just grab the last few lines:57 1340 58 81021 59 833 ... [202 rows x 9 columns]
- The
DataFrame
has an attribute that holds its shape as a tuple, similar tondarray
. Query the number of rows of aDataFrame
as follows:print "Shape", df.shape print "Length", len(df)
The values we obtain comply with the printout of the preceding step:
Shape (202, 9) Length 202
- Check the column's header and data types with the other attributes:
print "Column Headers", df.columns print "Data types", df.dtypes
We receive the column headers in a special data structure:
Column Headers Index([u'Country', u'CountryID', u'Continent', u'Adolescent fertility rate (%)', u'Adult literacy rate (%)', u'Gross national income per capita (PPP international $)', u'Net primary school enrolment ratio female (%)', u'Net primary school enrolment ratio male (%)', u'Population (in thousands) total'], dtype='object')
The data types are printed as follows:
- The pandas
DataFrame
has an index, which is like the primary key of relational database tables. We can either specify the index or have pandas create it automatically. The index can be accessed with a corresponding property as follows:print "Index", df.index
An index helps us search for items quickly, just like the index in this book. The index in this case is a wrapper around an array starting at
0
, with an increment of one for each row:Index Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
- Sometimes, we wish to iterate over the underlying data of a
DataFrame
. Iterating over column values can be inefficient if we utilize the pandas iterators. It's much better to extract the underlying NumPy arrays and work with those. The pandasDataFrame
has an attribute that can aid with this as well:print "Values", df.values
Please note that some values are designated
nan
in the output, for not a number. These values come from empty fields in the input datafile:Values [['Afghanistan' 1 1 ..., nan nan 26088.0] ['Albania' 2 2 ..., 93.0 94.0 3172.0] ['Algeria' 3 3 ..., 94.0 96.0 33351.0] ..., ['Yemen' 200 1 ..., 65.0 85.0 21732.0] ['Zambia' 201 3 ..., 94.0 90.0 11696.0] ['Zimbabwe' 202 3 ..., 88.0 87.0 13228.0]]
The code for the following example can be located in the df_demo.py
file of this book's code bundle:
from pandas.io.parsers import read_csv df = read_csv("WHO_first9cols.csv") print "Dataframe", df print "Shape", df.shape print "Length", len(df) print "Column Headers", df.columns print "Data types", df.dtypes print "Index", df.index print "Values", df.values
- AutoCAD繪圖實用速查通典
- Visualforce Development Cookbook(Second Edition)
- Google Cloud Platform Cookbook
- 三菱FX3U/5U PLC從入門到精通
- Managing Mission:Critical Domains and DNS
- 計算機應用復習與練習
- Getting Started with MariaDB
- 影視后期制作(Avid Media Composer 5.0)
- 樂高創意機器人教程(中級 下冊 10~16歲) (青少年iCAN+創新創意實踐指導叢書)
- B2B2C網上商城開發指南
- 控制系統計算機仿真
- INSTANT Autodesk Revit 2013 Customization with .NET How-to
- 大數據時代
- 人工智能趣味入門:光環板程序設計
- 工業機器人安裝與調試