3. Data Preparation¶
3.1. Data Preparation with Pandas¶
This chapter will show you how to use the pandas package to import and preprocess data. Preprocessing is the process of pre-analyzing data before converting it to a standard and normalized format. The following are some of the aspects of preprocessing:
missing values
data normalization
data standardization
data binning
We’ll simply be dealing with missing values in this session.
3.1.1. Importing data¶
We will utilize the Iris dataset in this tutorial, which can be downloaded from the UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/iris.
In the pattern recognition literature, this is probably the most well-known database. Fisher’s paper is considered a classic in the subject and is still cited frequently. (See, for example, Duda & Hart.) The data collection has three classes, each with 50 instances, each referring to a different species of iris plant. The three classes include
Iris Setosa
Iris Versicolour
Iris Virginica.
To begin, use the pandas library to import data and transform it to a dataframe.
import pandas as pd
#iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None,
# names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])
iris = pd.read_csv('/Users/Kaemyuijang/SCMA248/Data/iris.data', header=None, names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])
#url = 'https://raw.githubusercontent.com/pairote-sat/SCMA248/main/Data/iris.data'
#iris = pd.read_csv(url, header=None, names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])
#iris = pd.read_csv('https://raw.githubusercontent.com/pairote-sat/SCMA248/main/Data/iris.data', header=None,
# names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])
type(iris)
pandas.core.frame.DataFrame
Here we specify whether there is a header (header
) and the variable names (using names
and a list).
The resulting object iris
is a pandas DataFrame.
We will print the first 10 rows and the last 10 rows of the dataset using the head(10) method to get an idea of its contents.
iris.head(10)
sepal_length | sepal_width | petal_length | petal_width | class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
iris.tail(10)
sepal_length | sepal_width | petal_length | petal_width | class | |
---|---|---|---|---|---|
140 | 6.7 | 3.1 | 5.6 | 2.4 | Iris-virginica |
141 | 6.9 | 3.1 | 5.1 | 2.3 | Iris-virginica |
142 | 5.8 | 2.7 | 5.1 | 1.9 | Iris-virginica |
143 | 6.8 | 3.2 | 5.9 | 2.3 | Iris-virginica |
144 | 6.7 | 3.3 | 5.7 | 2.5 | Iris-virginica |
145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
To get the names of the columns (the variable names), you can use columns
method.
iris.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')
To extract the class column, you can simply use the following commands:
iris['class']
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
Name: class, Length: 150, dtype: object
The Pandas Series is a one-dimensional labeled array that may hold any type of data (integer, string, float, python objects, etc.)
type(iris['class'])
pandas.core.series.Series
iris.dtypes
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
class object
dtype: object
Tab completion for column names (as well as public attributes), iris.<TAB>
, is enabled by default if you’re using Jupyter.
For example, type iris.
and then follow with the TAB key. Look for the shape
attribute.
The shape
attribute of pandas.DataFrame stores the number of rows and columns as a tuple (number of rows, number of columns).
iris.shape
(150, 5)
iris.info
<bound method DataFrame.info of sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
[150 rows x 5 columns]>
import pandas as pd
values = {'dates': ['20210305','20210316','20210328'],
'status': ['Opened','Opened','Closed']
}
demo = pd.DataFrame(values)
demo
dates | status | |
---|---|---|
0 | 20210305 | Opened |
1 | 20210316 | Opened |
2 | 20210328 | Closed |
demo['dates'] = pd.to_datetime(demo['dates'], format='%Y%m%d')
demo
dates | status | |
---|---|---|
0 | 2021-03-05 | Opened |
1 | 2021-03-16 | Opened |
2 | 2021-03-28 | Closed |
demo.to_csv('demo_df.csv')
3.2. Data Selection¶
We’ll concentrate on how to slice, dice, and retrieve and set subsets of pandas objects in general. Because Series and DataFrame have received greater development attention in this area, they will be the key focus.
The axis labeling information in pandas objects is useful for a variety of reasons:
Data is identified (metadata is provided) using established indicators, which is useful for analysis, visualization, and interactive console display.
Allows for both implicit and explicit data alignment.
Allows you to access and set subsets of the data set in an intuitive way.
Three different forms of multi-axis indexing are currently supported by pandas.
The indexing operators [] and attribute operator. in Python and NumPy offer quick and easy access to pandas data structures in a variety of situations.
.loc
is mostly label-based, but it can also be used with a boolean array. .When the items are not found, .loc will produce a KeyError..iloc
works with an integer array (from 0 to length-1 of the axis), but it can also work with a boolean array.
Except for slice indexers, which enable out-of-bounds indexing, .iloc will throw IndexError if a requested indexer is out-of-bounds. (This is in line with the Python/NumPy slice semantics.)
In this section we will use the Iris dataset. First we obtain the row and column names by using the following commands:
print(iris.index)
print(iris.columns)
RangeIndex(start=0, stop=150, step=1)
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')
iris.head()
sepal_length | sepal_width | petal_length | petal_width | class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
3.2.1. The indexing operators []¶
To begin, simply indicate the column and line (by using its index) you’re interested in.
You can use the following command to get the sepal width of the fifth line (index is 4):
iris['sepal_width'][4]
3.6
# Not working
# iris[4]['sepal_width']
Note: Be careful, because this is not a matrix, and you might be tempted to insert the row first, then the column. Remember that it’s a pandas DataFrame, and the [] operator operates on columns first, then the element of the pandas Series that results.”
Sub-matrix retrieval is a simple procedure that requires only the specification of lists of indexes rather than scalars.
iris['sepal_width'][0:4]
0 3.5
1 3.0
2 3.2
3 3.1
Name: sepal_width, dtype: float64
iris[['petal_width','sepal_width']][0:4]
petal_width | sepal_width | |
---|---|---|
0 | 0.2 | 3.5 |
1 | 0.2 | 3.0 |
2 | 0.2 | 3.2 |
3 | 0.2 | 3.1 |
iris['sepal_width'][range(4)]
0 3.5
1 3.0
2 3.2
3 3.1
Name: sepal_width, dtype: float64
3.2.2. .loc()¶
You can use the .loc()
method to get something similar to the other approach (as in a matrix) of obtaining data.
iris.loc[4,'sepal_width']
3.6
# rows at index labels between 0 and 4 (inclusive)
# See https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different
iris.loc[0:4,'sepal_width']
0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
Name: sepal_width, dtype: float64
iris.loc[range(4),['petal_width','sepal_width']]
petal_width | sepal_width | |
---|---|---|
0 | 0.2 | 3.5 |
1 | 0.2 | 3.0 |
2 | 0.2 | 3.2 |
3 | 0.2 | 3.1 |
3.2.3. .iloc()¶
Finally, there is .iloc()
, which is a fully optimized function that defines the positions (as in a matrix). It requires you to define the cell using the row and column numbers.
iris.iloc[4,1]
3.6
The following commands produce the same output as iris.loc[0:4,'sepal_width']
and iris.loc[range(4),['petal_width','sepal_width']]
# rows at index locations between 0 and 4 (exclusive)
# See https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different
iris.iloc[0:4,1]
0 3.5
1 3.0
2 3.2
3 3.1
Name: sepal_width, dtype: float64
iris.iloc[range(4),[3,1]]
petal_width | sepal_width | |
---|---|---|
0 | 0.2 | 3.5 |
1 | 0.2 | 3.0 |
2 | 0.2 | 3.2 |
3 | 0.2 | 3.1 |
Note: .loc, .iloc, and also [] indexing can accept a callable as indexer as illustrated from the following examples. The callable must be a function with one argument (the calling Series or DataFrame) that returns valid output for indexing.
iris.loc[:,lambda df: ['petal_length','sepal_length']]
petal_length | sepal_length | |
---|---|---|
0 | 1.4 | 5.1 |
1 | 1.4 | 4.9 |
2 | 1.3 | 4.7 |
3 | 1.5 | 4.6 |
4 | 1.4 | 5.0 |
... | ... | ... |
145 | 5.2 | 6.7 |
146 | 5.0 | 6.3 |
147 | 5.2 | 6.5 |
148 | 5.4 | 6.2 |
149 | 5.1 | 5.9 |
150 rows × 2 columns
iris.loc[lambda df: df['sepal_width'] > 3.5, :]
sepal_length | sepal_width | petal_length | petal_width | class | |
---|---|---|---|---|---|
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
10 | 5.4 | 3.7 | 1.5 | 0.2 | Iris-setosa |
14 | 5.8 | 4.0 | 1.2 | 0.2 | Iris-setosa |
15 | 5.7 | 4.4 | 1.5 | 0.4 | Iris-setosa |
16 | 5.4 | 3.9 | 1.3 | 0.4 | Iris-setosa |
18 | 5.7 | 3.8 | 1.7 | 0.3 | Iris-setosa |
19 | 5.1 | 3.8 | 1.5 | 0.3 | Iris-setosa |
21 | 5.1 | 3.7 | 1.5 | 0.4 | Iris-setosa |
22 | 4.6 | 3.6 | 1.0 | 0.2 | Iris-setosa |
32 | 5.2 | 4.1 | 1.5 | 0.1 | Iris-setosa |
33 | 5.5 | 4.2 | 1.4 | 0.2 | Iris-setosa |
44 | 5.1 | 3.8 | 1.9 | 0.4 | Iris-setosa |
46 | 5.1 | 3.8 | 1.6 | 0.2 | Iris-setosa |
48 | 5.3 | 3.7 | 1.5 | 0.2 | Iris-setosa |
109 | 7.2 | 3.6 | 6.1 | 2.5 | Iris-virginica |
117 | 7.7 | 3.8 | 6.7 | 2.2 | Iris-virginica |
131 | 7.9 | 3.8 | 6.4 | 2.0 | Iris-virginica |
3.3. Dealing with Problematic Data¶
3.3.1. Problem in setting index in pandas DataFrame¶
It may happen that the dataset contains an index column. How to import it correctly with Pandas?
We will use a very simple dataset, namely demo_df.csv (the file can be download from my github repository), that contains an index column (this is just a counter and not a feature).
import pandas as pd
# How to read CSV file from GitHub using pandas
# https://stackoverflow.com/questions/55240330/how-to-read-csv-file-from-github-using-pandas
# url = 'https://raw.githubusercontent.com/pairote-sat/SCMA248/main/demo_df'
url = 'https://raw.githubusercontent.com/pairote-sat/SCMA248/main/Data/demo_df.csv'
df1 = pd.read_csv(url)
print(df1.head())
df1.columns
---------------------------------------------------------------------------
SSLCertVerificationError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1349 h.request(req.get_method(), req.selector, req.data, headers,
-> 1350 encode_chunked=req.has_header('Transfer-encoding'))
1351 except OSError as err: # timeout error
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py in request(self, method, url, body, headers, encode_chunked)
1261 """Send a complete request to the server."""
-> 1262 self._send_request(method, url, body, headers, encode_chunked)
1263
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
1307 body = _encode(body, 'body')
-> 1308 self.endheaders(body, encode_chunked=encode_chunked)
1309
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py in endheaders(self, message_body, encode_chunked)
1256 raise CannotSendHeader()
-> 1257 self._send_output(message_body, encode_chunked=encode_chunked)
1258
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py in _send_output(self, message_body, encode_chunked)
1027 del self._buffer[:]
-> 1028 self.send(msg)
1029
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py in send(self, data)
967 if self.auto_open:
--> 968 self.connect()
969 else:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py in connect(self)
1431 self.sock = self._context.wrap_socket(self.sock,
-> 1432 server_hostname=server_hostname)
1433
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
422 context=self,
--> 423 session=session
424 )
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py in _create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)
869 raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 870 self.do_handshake()
871 except (OSError, ValueError):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py in do_handshake(self, block)
1138 self.settimeout(None)
-> 1139 self._sslobj.do_handshake()
1140 finally:
SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)
During handling of the above exception, another exception occurred:
URLError Traceback (most recent call last)
/var/folders/kl/h_r05n_j76n32kt0dwy7kynw0000gn/T/ipykernel_2309/3690297895.py in <module>
7 url = 'https://raw.githubusercontent.com/pairote-sat/SCMA248/main/Data/demo_df.csv'
8
----> 9 df1 = pd.read_csv(url)
10 print(df1.head())
11 df1.columns
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
584 kwds.update(kwds_defaults)
585
--> 586 return _read(filepath_or_buffer, kwds)
587
588
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
480
481 # Create the parser.
--> 482 parser = TextFileReader(filepath_or_buffer, **kwds)
483
484 if chunksize or iterator:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
809 self.options["has_index_names"] = kwds["has_index_names"]
810
--> 811 self._engine = self._make_engine(self.engine)
812
813 def close(self):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
1038 )
1039 # error: Too many arguments for "ParserBase"
-> 1040 return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
1041
1042 def _failover_to_python(self):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
49
50 # open handles
---> 51 self._open_handles(src, kwds)
52 assert self.handles is not None
53
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
227 memory_map=kwds.get("memory_map", False),
228 storage_options=kwds.get("storage_options", None),
--> 229 errors=kwds.get("encoding_errors", "strict"),
230 )
231
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
612 compression=compression,
613 mode=mode,
--> 614 storage_options=storage_options,
615 )
616
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/common.py in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
310 # assuming storage_options is to be interpreted as headers
311 req_info = urllib.request.Request(filepath_or_buffer, headers=storage_options)
--> 312 with urlopen(req_info) as req:
313 content_encoding = req.headers.get("Content-Encoding", None)
314 if content_encoding == "gzip":
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/common.py in urlopen(*args, **kwargs)
210 import urllib.request
211
--> 212 return urllib.request.urlopen(*args, **kwargs)
213
214
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
--> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
523 req = meth(req)
524
--> 525 response = self._open(req, data)
526
527 # post-process response
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py in _open(self, req, data)
541 protocol = req.type
542 result = self._call_chain(self.handle_open, protocol, protocol +
--> 543 '_open', req)
544 if result:
545 return result
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
501 for handler in handlers:
502 func = getattr(handler, meth_name)
--> 503 result = func(*args)
504 if result is not None:
505 return result
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py in https_open(self, req)
1391 def https_open(self, req):
1392 return self.do_open(http.client.HTTPSConnection, req,
-> 1393 context=self._context, check_hostname=self._check_hostname)
1394
1395 https_request = AbstractHTTPHandler.do_request_
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1350 encode_chunked=req.has_header('Transfer-encoding'))
1351 except OSError as err: # timeout error
-> 1352 raise URLError(err)
1353 r = h.getresponse()
1354 except:
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
## Uncomment these commands if the CSV dateset is stored locally.
# df1 = pd.read_csv('/Users/Kaemyuijang/SCMA248/demo_df.csv')
# print(df1.head())
# df1.columns
We want to specify that ‘Unnamed: 0’ is the index column while loading this data set with the following command (with the parameter index_col
):
df1 = pd.read_csv(url, index_col = 0)
df1.head()
dates | status | |
---|---|---|
0 | 2021-03-05 | Opened |
1 | 2021-03-16 | Opened |
2 | 2021-03-28 | Closed |
The dataset is loaded and the index is correct after performing the command.
3.3.2. Convert Strings to Datetime¶
However, we see an issue right away: all of the data, including dates, has been parsed as integers (or, in other cases, as string). If the dates do not have a particularly unusual format, you can use the autodetection routines to identify the column that contains the date data. It works nicely with the following arguments when the data file is stored locally.
# df2 = pd.read_csv('/Users/Kaemyuijang/SCMA248/demo_df.csv', index_col = 0, parse_dates = ['dates'])
# print(df2.head())
# df2.dtypes
For the same dataset downloaded from Github, if a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv:
pd.to_datetime(df['DataFrame Column'], format=specify your format)
Remember that the date format for our example is yyyymmdd.
The following is a representation of this date format format = '%d%m%Y'
(or format = '%Y%m%d'
).
See https://datatofish.com/strings-to-datetime-pandas/ for more details.
df2 = pd.read_csv(url, index_col = 0, parse_dates = ['dates'])
print(df2)
df2.dtypes
dates status
0 2021-03-05 Opened
1 2021-03-16 Opened
2 2021-03-28 Closed
dates datetime64[ns]
status object
dtype: object
df2['dates'] = pd.to_datetime(df2['dates'], format='%d%m%Y')
print(df2)
df2.dtypes
dates status
0 2021-03-05 Opened
1 2021-03-16 Opened
2 2021-03-28 Closed
dates datetime64[ns]
status object
dtype: object
3.3.3. Missing values¶
We will concentrate on missing values, which is perhaps the most challenging data cleaning operation.
It’s a good idea to have an overall sense of a data set before you start cleaning it. After that, you can develop a plan for cleaning the data.
To begin, I like to ask the following questions:
What are the features?
What sorts of data are required (int, float, text, boolean)?
Is there any evident data missing (values that Pandas can detect)?
Is there any other type of missing data that isn’t as clear (and that Pandas can’t easily detect)?
Let’s have a look at an example by using a small sample data namely property_data.csv. The file can be obtained from Github: https://raw.githubusercontent.com/pairote-sat/SCMA248/main/property_data.csv.
In what follows, we also specify that ‘PID’ (personal indentifier) is the index column while loading this data set with the following command (with the parameter index_col):
url = 'https://raw.githubusercontent.com/pairote-sat/SCMA248/main/Data/property_data.csv'
df = pd.read_csv(url, index_col = 0)
df
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
PID | ||||||
10101.0 | 104.0 | Khao San | Y | 3 | 1 | 1000 |
10102.0 | 197.0 | Silom | N | 3 | 1.5 | -- |
10103.0 | NaN | Silom | N | NaN | 1 | 850 |
10104.0 | 201.0 | Sukhumvit | 12 | 1 | NaN | 700 |
NaN | 203.0 | Sukhumvit | Y | 3 | 2 | 1600 |
10106.0 | 207.0 | Sukhumvit | Y | NaN | 1 | 800 |
10107.0 | NaN | Thonglor | NaN | 2 | HURLEY | 950 |
10108.0 | 213.0 | Rama 1 | Y | -- | 1 | NaN |
10109.0 | 215.0 | Rama 1 | Y | na | 2 | 1800 |
We notice that the PID (personal identifiers) as the index name has a missing value, i.e. NaN (not any number). We will replace this missing PID with 10105 and also convert from floats to integers.
rowindex = df.index.tolist()
rowindex[4] = 10105.0
rowindex = [int(i) for i in rowindex]
df.index = rowindex
print(df.loc[:,'ST_NUM'])
10101 104.0
10102 197.0
10103 NaN
10104 201.0
10105 203.0
10106 207.0
10107 NaN
10108 213.0
10109 215.0
Name: ST_NUM, dtype: float64
Alternatively, one can use Numpy to produce the same result. Simply run the following commands. Here we use .astype()
method to convert the type of an array.
df = pd.read_csv(url, index_col = 0)
df
import numpy as np
rowindex = df.index.to_numpy()
rowindex[4] = 10105.0
df.index = rowindex.astype(int)
print(df.loc[:,'ST_NUM'])
10101 104.0
10102 197.0
10103 NaN
10104 201.0
10105 203.0
10106 207.0
10107 NaN
10108 213.0
10109 215.0
Name: ST_NUM, dtype: float64
Now I can answer my first question: what are features? The following features can be obtained from the column names:
ST_NUM is the street number
ST_NAME is the street name
OWN_OCCUPIED: Is the residence owner occupied?
NUM_BEDROOMS: the number of rooms
We can also respond to the question, What are the expected types?
ST_NUM is either a float or an int… a numeric type of some sort
ST_NAME is a string variable.
OWN_OCCUPIED: string; OWN_OCCUPIED: string; OWN _OCCUPIED N (“No”) or Y (“Yes”)
NUM_BEDROOMS is a numeric type that can be either float or int.
3.3.4. Standard Missing Values¶
So, what exactly do I mean when I say “standard missing values?” These are missing values that Pandas can detect.
Let’s return to our initial dataset and examine the “Street Number” column.
There are an empty cell in the third row (from the original file). A value of “NaN” appears in the seventh row.
Both of these numbers are obviously missing. Let’s see how Pandas handle these situations. We can see that Pandas filled in the blank space with “NaN”.
We can confirm that both the missing value and “NA” were detected as missing values using the isnull() method. True for both boolean responses.
df['ST_NUM'].isnull()
10101 False
10102 False
10103 True
10104 False
10105 False
10106 False
10107 True
10108 False
10109 False
Name: ST_NUM, dtype: bool
Similarly, for the NUM_BEDROOMS column of the original CSV file, users manually entering missing values with different names “n/a” and “NA”. Pandas also recognized these as missing values and filled with “NaN”.
df['NUM_BEDROOMS']
10101 3
10102 3
10103 NaN
10104 1
10105 3
10106 NaN
10107 2
10108 --
10109 na
Name: NUM_BEDROOMS, dtype: object
3.3.5. Missing Values That Aren’t Standard¶
It is possible that there are missing values with different formats in some cases.
There are two other missing values in this column of different formats
na
--
Putting this different format in a list is a simple approach to detect them. When we import the data, Pandas will immediately recognize them. Here’s an example of how we might go about it.
# Making a list of missing value types
missing_values = ["na", "--"]
df = pd.read_csv(url, index_col = 0, na_values = missing_values)
df
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
PID | ||||||
10101.0 | 104.0 | Khao San | Y | 3.0 | 1 | 1000.0 |
10102.0 | 197.0 | Silom | N | 3.0 | 1.5 | NaN |
10103.0 | NaN | Silom | N | NaN | 1 | 850.0 |
10104.0 | 201.0 | Sukhumvit | 12 | 1.0 | NaN | 700.0 |
NaN | 203.0 | Sukhumvit | Y | 3.0 | 2 | 1600.0 |
10106.0 | 207.0 | Sukhumvit | Y | NaN | 1 | 800.0 |
10107.0 | NaN | Thonglor | NaN | 2.0 | HURLEY | 950.0 |
10108.0 | 213.0 | Rama 1 | Y | NaN | 1 | NaN |
10109.0 | 215.0 | Rama 1 | Y | NaN | 2 | 1800.0 |
3.3.6. Unexpected Missing Values¶
We have observed both standard and non-standard missing data so far. What if we have a type that is not expected?
For instance, if our feature is supposed to be a string but it’s a numeric type, it’s technically a missing value.
Take a look at the column labeled “OWN_OCCUPIED” to understand what I’m talking about.
df['OWN_OCCUPIED']
PID
10101.0 Y
10102.0 N
10103.0 N
10104.0 12
NaN Y
10106.0 Y
10107.0 NaN
10108.0 Y
10109.0 Y
Name: OWN_OCCUPIED, dtype: object
We know Pandas will recognize the empty cell in row seven as a missing value because of our prior examples.
The number 12 appears in the fourth row. This number type should be a missing value because the result for Owner Occupied should clearly be a string (Y or N). Because this example is a little more complicated, we will need to find a plan for identifying missing values. There are a few alternative routes to take, but this is how I’m going to go about it.
Loop through The OWN OCCUPIED column.
Convert the entry to an integer.
If the entry may be transformed to an integer, enter a missing value.
We know the number cannot be an integer if it cannott be an integer.
df = pd.read_csv(url, index_col = 0)
df
import numpy as np
rowindex = df.index.to_numpy()
rowindex[4] = 10105.0
df.index = rowindex.astype(int)
# Detecting numbers
cnt=10101
for row in df['OWN_OCCUPIED']:
try:
int(row)
df.loc[cnt, 'OWN_OCCUPIED']=np.nan
except ValueError:
pass
cnt+=1
df['OWN_OCCUPIED']
10101 Y
10102 N
10103 N
10104 NaN
10105 Y
10106 Y
10107 NaN
10108 Y
10109 Y
Name: OWN_OCCUPIED, dtype: object
In the code, we loop through each entry in the “Owner Occupied” column. To try to change the entry to an integer, we use int(row). If the value can be changed to an integer, we change the entry to a missing value using np.nan from Numpy. On the other hand, if the value cannot be changed to an integer, we pass it and continue.
You will notice that I have used try and except ValueError. This is called exception handling, and we use this to handle errors.
If we tried to change an entry to an integer and it could not be changed, a ValueError would be returned and the code would terminate. To deal with this, we use exception handling to detect these errors and continue.
3.4. Data Manipulation¶
We have learned how to select the data we want, we will need to learn how to manipulate it. Using aggregation methods to work with columns or rows is one of the most straightforward things we can perform.
All of these functions always return a number when applied to a row or column.
We can specify whether the function should be applied to
the rows for each column using the
axis=0
keyword on the function argument, orthe columns for each row using the
axis=1
keyword on the function argument.
Function |
Description |
---|---|
df.describe() |
Returns a summary statistics of numerical column |
df.mean() |
Returns the average of all columns in a dataset |
df.corr() |
Returns the correlation between columns in a DataFrame |
df.count() |
Returns the number of non-null values in each DataFrame column |
df.max() |
Returns the highest value in each column |
df.min() |
Returns the lowest value in each column |
df.median() |
Returns the median in each column |
df.std() |
Returns the standard deviation in each column |
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None,
names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])
type(iris)
pandas.core.frame.DataFrame
iris.describe()
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
You have the number of observations, their average value, standard deviation, minimum and maximum values, and certain percentiles for all numerical features” (25 percent, 50 percent, and 75 percent). This offers you a fair picture of how each feature is distributed.
The following command illustrate how to use the max()
function.
iris.max(axis = 0)
sepal_length 7.9
sepal_width 4.4
petal_length 6.9
petal_width 2.5
class Iris-virginica
dtype: object
We can perform operations on all values in rows, columns, or a subset of both.
The following example shows how to find the standardized values of each column of the Iris dataset. We need to firstly drop the class column, which is a categorical variable using drop()
function with axis = 1
.
Recall that the Z-scores and standardized values (sometimes known as standard scores or normal deviates) are the same thing. When you take a data point and scale it by population data, you get a standardized value. It informs us how distant we are from the mean in terms of standard deviations.
iris_drop = iris.drop('class', axis = 1)
print(iris_drop.mean())
print(iris_drop.std())
sepal_length 5.843333
sepal_width 3.054000
petal_length 3.758667
petal_width 1.198667
dtype: float64
sepal_length 0.828066
sepal_width 0.433594
petal_length 1.764420
petal_width 0.763161
dtype: float64
def z_score_standardization(series):
return (series - series.mean()) / series.std(ddof = 1)
#iris_normalized = iris_drop
#for col in iris_normalized.columns:
# iris_normalized[col] = z_score_standardization(iris_normalized[col])
iris_normalized = {}
for col in iris_drop.columns:
iris_normalized[col] = z_score_standardization(iris_drop[col])
iris_normalized = pd.DataFrame(iris_normalized)
iris_normalized.head()
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
0 | -0.897674 | 1.028611 | -1.336794 | -1.308593 |
1 | -1.139200 | -0.124540 | -1.336794 | -1.308593 |
2 | -1.380727 | 0.336720 | -1.393470 | -1.308593 |
3 | -1.501490 | 0.106090 | -1.280118 | -1.308593 |
4 | -1.018437 | 1.259242 | -1.336794 | -1.308593 |
Alternatively, the following commands produce the same result.
del(iris_normalized)
iris_normalized = (iris_drop - iris_drop.mean())/iris_drop.std(ddof=1)
iris_normalized.head()
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
0 | -0.897674 | 1.028611 | -1.336794 | -1.308593 |
1 | -1.139200 | -0.124540 | -1.336794 | -1.308593 |
2 | -1.380727 | 0.336720 | -1.393470 | -1.308593 |
3 | -1.501490 | 0.106090 | -1.280118 | -1.308593 |
4 | -1.018437 | 1.259242 | -1.336794 | -1.308593 |
Alternatively, we can standardize a Pandas Column with Z-Score Scaling using scikit-learn.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(iris_drop)
iris_scaled = scaler.fit_transform(iris_drop)
iris_scaled = pd.DataFrame(iris_scaled, columns=iris_drop.columns)
print(iris_scaled)
sepal_length sepal_width petal_length petal_width
0 -0.900681 1.032057 -1.341272 -1.312977
1 -1.143017 -0.124958 -1.341272 -1.312977
2 -1.385353 0.337848 -1.398138 -1.312977
3 -1.506521 0.106445 -1.284407 -1.312977
4 -1.021849 1.263460 -1.341272 -1.312977
.. ... ... ... ...
145 1.038005 -0.124958 0.819624 1.447956
146 0.553333 -1.281972 0.705893 0.922064
147 0.795669 -0.124958 0.819624 1.053537
148 0.432165 0.800654 0.933356 1.447956
149 0.068662 -0.124958 0.762759 0.790591
[150 rows x 4 columns]
Note scikit-learn uses np.std which by default is the population standard deviation (where the sum of squared deviations are divided by the number of observations), while the sample standard deviations (where the denominator is number of observations - 1) are used by pandas. This is a correction factor determined by the degrees of freedom in order to obtain an unbiased estimate of the population standard deviation (ddof). Numpy and scikit-learn calculations utilize ddof=0 by default, whereas pandas uses ddof=1 (docs).
As a result, the above results are different.
In the next example, we simply can apply any binary arithmetical operation (+,-,*,/) to an entire dataframe.
iris_drop**2
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
0 | 26.01 | 12.25 | 1.96 | 0.04 |
1 | 24.01 | 9.00 | 1.96 | 0.04 |
2 | 22.09 | 10.24 | 1.69 | 0.04 |
3 | 21.16 | 9.61 | 2.25 | 0.04 |
4 | 25.00 | 12.96 | 1.96 | 0.04 |
... | ... | ... | ... | ... |
145 | 44.89 | 9.00 | 27.04 | 5.29 |
146 | 39.69 | 6.25 | 25.00 | 3.61 |
147 | 42.25 | 9.00 | 27.04 | 4.00 |
148 | 38.44 | 11.56 | 29.16 | 5.29 |
149 | 34.81 | 9.00 | 26.01 | 3.24 |
150 rows × 4 columns
Any function can be applied to a DataFrame or Series by passing its name as an argument to the apply
method. For example, in the following code, we use the NumPy library’s floor
function to return the floor of the input, element-wise of each value in the DataFrame.
import numpy as np
iris_drop.apply(np.floor)
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
0 | 5.0 | 3.0 | 1.0 | 0.0 |
1 | 4.0 | 3.0 | 1.0 | 0.0 |
2 | 4.0 | 3.0 | 1.0 | 0.0 |
3 | 4.0 | 3.0 | 1.0 | 0.0 |
4 | 5.0 | 3.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... |
145 | 6.0 | 3.0 | 5.0 | 2.0 |
146 | 6.0 | 2.0 | 5.0 | 1.0 |
147 | 6.0 | 3.0 | 5.0 | 2.0 |
148 | 6.0 | 3.0 | 5.0 | 2.0 |
149 | 5.0 | 3.0 | 5.0 | 1.0 |
150 rows × 4 columns
If we need to design a specific function to apply it, we can write an in-line function, commonly known as a \(\lambda\)-function. A \(\lambda\)-function is a function without a name.
It is only necessary to specify the parameters it receives, between the lambda keyword and the colon (:).
In the next example, only one parameter is needed, which will be the value of each element in the DataFrame.
iris_drop.apply(lambda x: np.log10(x))
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
0 | 0.707570 | 0.544068 | 0.146128 | -0.698970 |
1 | 0.690196 | 0.477121 | 0.146128 | -0.698970 |
2 | 0.672098 | 0.505150 | 0.113943 | -0.698970 |
3 | 0.662758 | 0.491362 | 0.176091 | -0.698970 |
4 | 0.698970 | 0.556303 | 0.146128 | -0.698970 |
... | ... | ... | ... | ... |
145 | 0.826075 | 0.477121 | 0.716003 | 0.361728 |
146 | 0.799341 | 0.397940 | 0.698970 | 0.278754 |
147 | 0.812913 | 0.477121 | 0.716003 | 0.301030 |
148 | 0.792392 | 0.531479 | 0.732394 | 0.361728 |
149 | 0.770852 | 0.477121 | 0.707570 | 0.255273 |
150 rows × 4 columns
3.4.1. Add A New Column To An Existing Pandas DataFrame¶
Adding new values in our DataFrame is another simple manipulation technique. This can be done directly over a DataFrame with the assign operator (=
).
For example, we can assign a Series to a selection of a column that does not exist to add a new column to a DataFrame.
You should be aware that previous values will be overridden if a column with the same name already exists.
In the following example, we create a new column entitled sepal_length_normalized by adding the standardized values of the sepal_length column.
iris['sepal_length_normalized'] = (iris['sepal_length'] - iris['sepal_length'].mean()) / iris['sepal_length'].std()
iris.head()
sepal_length | sepal_width | petal_length | petal_width | class | sepal_length_normalized | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | -0.897674 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | -1.139200 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | -1.380727 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | -1.501490 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | -1.018437 |
iris['sepal_length_normalized'] = (iris['sepal_length'] - iris['sepal_length'].mean()) / iris['sepal_length'].std(ddof=0)
iris.head()
sepal_length | sepal_width | petal_length | petal_width | class | sepal_length_normalized | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | -0.900681 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | -1.143017 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | -1.385353 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | -1.506521 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | -1.021849 |
Alternatively, we can use concat
function to add a Series (s) to a Pandas DataFrame (df) as a new column with an arguement axis = 1
. The name of the new column can be set by using Series.rename
as in the following command:
df = pd.concat((df, s.rename('CoolColumnName')), axis=1)
https://stackoverflow.com/questions/39047915/concat-series-onto-dataframe-with-column-name
pd.concat([iris, (iris['sepal_length']-1).rename('new_name') ], axis = 1)
sepal_length | sepal_width | petal_length | petal_width | class | sepal_length_normalized | new_name | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | -0.900681 | 4.1 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | -1.143017 | 3.9 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | -1.385353 | 3.7 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | -1.506521 | 3.6 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | -1.021849 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica | 1.038005 | 5.7 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica | 0.553333 | 5.3 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica | 0.795669 | 5.5 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica | 0.432165 | 5.2 |
149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica | 0.068662 | 4.9 |
150 rows × 7 columns
We can now use the drop
function to delete a column from the DataFrame; this removes the indicated rows if axis=0, or the indicated columns if axis=1.
All Pandas functions that change the contents of a DataFrame, such as the drop function, return a duplicate of the updated data rather than overwriting the DataFrame. As a result, the original DataFrame is preserved. Set the keyword inplace
to True
if you do not wish to maintain the old settings. This keyword is set to False
by default, which means that a copy of the data is returned.
The following commands remove the column, namely ‘sepal_length_normalized’, we have just added to the Iris dataset.
iris.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class',
'sepal_length_normalized'],
dtype='object')
print(iris.drop('sepal_length_normalized', axis = 1).head())
print(iris.head())
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
sepal_length sepal_width petal_length petal_width class \
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
sepal_length_normalized
0 -0.900681
1 -1.143017
2 -1.385353
3 -1.506521
4 -1.021849
iris.drop('sepal_length_normalized', axis = 1, inplace = True)
print(iris.head())
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
3.4.2. Appending a row to a dataframe and specify its index label¶
In this section, we will learn how to add/remove new rows and remove missing values. We will use the dataset, namely property_data.csv.
A general solution when appending a row to a dataframe and specify its index label is to create the row, transform the new row data into a pandas series, name it to the index you want to have and then append it to the data frame. Don’t forget to overwrite the original data frame with the one with appended row. The following commands produce the required result.
See https://stackoverflow.com/questions/16824607/pandas-appending-a-row-to-a-dataframe-and-specify-its-index-label for more details.
See also https://www.geeksforgeeks.org/python-pandas-dataframe-append/ and https://thispointer.com/python-pandas-how-to-add-rows-in-a-dataframe-using-dataframe-append-loc-iloc/#3 .
import pandas as pd
filepath = '/Users/Kaemyuijang/SCMA248/Data/property_data.csv'
df = pd.read_csv(filepath, index_col = 0)
df.head()
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
PID | ||||||
10101.0 | 104.0 | Khao San | Y | 3 | 1 | 1000 |
10102.0 | 197.0 | Silom | N | 3 | 1.5 | -- |
10103.0 | NaN | Silom | N | NaN | 1 | 850 |
10104.0 | 201.0 | Sukhumvit | 12 | 1 | NaN | 700 |
NaN | 203.0 | Sukhumvit | Y | 3 | 2 | 1600 |
url = 'https://raw.githubusercontent.com/pairote-sat/SCMA248/main/Data/property_data.csv'
df = pd.read_csv(url, index_col = 0)
df.head()
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
PID | ||||||
10101.0 | 104.0 | Khao San | Y | 3 | 1 | 1000 |
10102.0 | 197.0 | Silom | N | 3 | 1.5 | -- |
10103.0 | NaN | Silom | N | NaN | 1 | 850 |
10104.0 | 201.0 | Sukhumvit | 12 | 1 | NaN | 700 |
NaN | 203.0 | Sukhumvit | Y | 3 | 2 | 1600 |
df.index
Float64Index([10101.0, 10102.0, 10103.0, 10104.0, nan, 10106.0, 10107.0,
10108.0, 10109.0],
dtype='float64', name='PID')
#new_observation = {'ST_NUM': 555 , 'OWN_OCCUPIED': 'Y', 'NUM_BEDROOMS': 2,'NUM_BATH': 1,'SQ_FT': 200}
new_observation = pd.Series({'ST_NUM': 555 , 'OWN_OCCUPIED': 'Y', 'NUM_BEDROOMS': 2,'NUM_BATH': 1,'SQ_FT': 200}, name = 10110.0)
new_observation
ST_NUM 555
OWN_OCCUPIED Y
NUM_BEDROOMS 2
NUM_BATH 1
SQ_FT 200
Name: 10110.0, dtype: object
df.append(new_observation)
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
PID | ||||||
10101.0 | 104.0 | Khao San | Y | 3 | 1 | 1000 |
10102.0 | 197.0 | Silom | N | 3 | 1.5 | -- |
10103.0 | NaN | Silom | N | NaN | 1 | 850 |
10104.0 | 201.0 | Sukhumvit | 12 | 1 | NaN | 700 |
NaN | 203.0 | Sukhumvit | Y | 3 | 2 | 1600 |
10106.0 | 207.0 | Sukhumvit | Y | NaN | 1 | 800 |
10107.0 | NaN | Thonglor | NaN | 2 | HURLEY | 950 |
10108.0 | 213.0 | Rama 1 | Y | -- | 1 | NaN |
10109.0 | 215.0 | Rama 1 | Y | na | 2 | 1800 |
10110.0 | 555.0 | NaN | Y | 2 | 1 | 200 |
Note In case that, we do not define a name for our pandas series, i.e. we simply define next_observation = pd.Series({'ST_NUM': 999 , 'OWN_OCCUPIED': 'Y', 'NUM_BEDROOMS': 2,'NUM_BATH': 1,'SQ_FT': 200})
without the flag name = 10110.0
as an argument. We must set the ignore_index
flag in the append method to True
, otherwise the commands will produce an error as follows:
next_observation = pd.Series({'ST_NUM': 999 , 'OWN_OCCUPIED': 'N', 'NUM_BEDROOMS': 5,'NUM_BATH': 4,'SQ_FT': 3500})
print(next_observation)
print(df)
ST_NUM 999
OWN_OCCUPIED N
NUM_BEDROOMS 5
NUM_BATH 4
SQ_FT 3500
dtype: object
ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
PID
10101.0 104.0 Khao San Y 3 1 1000
10102.0 197.0 Silom N 3 1.5 --
10103.0 NaN Silom N NaN 1 850
10104.0 201.0 Sukhumvit 12 1 NaN 700
NaN 203.0 Sukhumvit Y 3 2 1600
10106.0 207.0 Sukhumvit Y NaN 1 800
10107.0 NaN Thonglor NaN 2 HURLEY 950
10108.0 213.0 Rama 1 Y -- 1 NaN
10109.0 215.0 Rama 1 Y na 2 1800
# Without setting the ignore_index flag in the append method to True, this produces an eror.
df.append(next_observation)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/kl/h_r05n_j76n32kt0dwy7kynw0000gn/T/ipykernel_1214/4163196459.py in <module>
1 # Without setting the ignore_index flag in the append method to True, this produces an eror.
2
----> 3 df.append(next_observation)
~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in append(self, other, ignore_index, verify_integrity, sort)
8933 if other.name is None and not ignore_index:
8934 raise TypeError(
-> 8935 "Can only append a Series if ignore_index=True "
8936 "or if the Series has a name"
8937 )
TypeError: Can only append a Series if ignore_index=True or if the Series has a name
# Setting the ignore_index flag in the append method to True
df.append(next_observation,ignore_index=True)
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
0 | 104.0 | Khao San | Y | 3 | 1 | 1000 |
1 | 197.0 | Silom | N | 3 | 1.5 | -- |
2 | NaN | Silom | N | NaN | 1 | 850 |
3 | 201.0 | Sukhumvit | 12 | 1 | NaN | 700 |
4 | 203.0 | Sukhumvit | Y | 3 | 2 | 1600 |
5 | 207.0 | Sukhumvit | Y | NaN | 1 | 800 |
6 | NaN | Thonglor | NaN | 2 | HURLEY | 950 |
7 | 213.0 | Rama 1 | Y | -- | 1 | NaN |
8 | 215.0 | Rama 1 | Y | na | 2 | 1800 |
9 | 999.0 | NaN | N | 5 | 4 | 3500 |
Note The resulting (new) DataFrame’s index is not same as original dataframe because ignore_index is passed as True in the append()
function.
The next complete example illustrates how to add multiple rows to dataframe.
print(new_observation)
print(next_observation)
ST_NUM 555
OWN_OCCUPIED Y
NUM_BEDROOMS 2
NUM_BATH 1
SQ_FT 200
Name: 10110.0, dtype: object
ST_NUM 999
OWN_OCCUPIED N
NUM_BEDROOMS 5
NUM_BATH 4
SQ_FT 3500
dtype: object
listOfSeries = [new_observation,next_observation]
new_df = df.append(listOfSeries, ignore_index = True)
Finally, to remove the row(s), we can apply the drop method once more. Now we must set the axis to 0 and specify the row index we want to remove.
new_df.drop([9,10], axis = 0, inplace = True)
new_df
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
0 | 104.0 | Khao San | Y | 3 | 1 | 1000 |
1 | 197.0 | Silom | N | 3 | 1.5 | -- |
2 | NaN | Silom | N | NaN | 1 | 850 |
3 | 201.0 | Sukhumvit | 12 | 1 | NaN | 700 |
4 | 203.0 | Sukhumvit | Y | 3 | 2 | 1600 |
5 | 207.0 | Sukhumvit | Y | NaN | 1 | 800 |
6 | NaN | Thonglor | NaN | 2 | HURLEY | 950 |
7 | 213.0 | Rama 1 | Y | -- | 1 | NaN |
8 | 215.0 | Rama 1 | Y | na | 2 | 1800 |
By applying the drop()
function to the result of the isnull()
method, missing values can be removed. This has the same effect as filtering the NaN values, as explained above, but instead of returning a view, a duplicate of the DataFrame minus the NaN values is returned.
index = new_df.index[new_df['ST_NUM'].isnull()]
# Altenatively, one can obtain the index
new_index = new_df['ST_NUM'].index[new_df['ST_NUM'].apply(np.isnan)]
print(new_df.drop(index, axis = 0))
print(new_df.drop(new_index, axis = 0))
ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 104.0 Khao San Y 3 1 1000
1 197.0 Silom N 3 1.5 --
3 201.0 Sukhumvit 12 1 NaN 700
4 203.0 Sukhumvit Y 3 2 1600
5 207.0 Sukhumvit Y NaN 1 800
7 213.0 Rama 1 Y -- 1 NaN
8 215.0 Rama 1 Y na 2 1800
ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 104.0 Khao San Y 3 1 1000
1 197.0 Silom N 3 1.5 --
3 201.0 Sukhumvit 12 1 NaN 700
4 203.0 Sukhumvit Y 3 2 1600
5 207.0 Sukhumvit Y NaN 1 800
7 213.0 Rama 1 Y -- 1 NaN
8 215.0 Rama 1 Y na 2 1800
Instead of using the generic drop function, we can use the particular dropna()
function to remove NaN values. We must set the how
keyword to any
if we wish to delete any record that includes a NaN value. We can use the subset
keyword to limit it to a specific subset of columns. The effect will be the same as if we used the drop function, as shown below:
Note The parameter how{‘any’, ‘all’}
, default ‘any’ determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.
df.dropna(how = 'any', subset = ['ST_NUM'])
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
PID | ||||||
10101.0 | 104.0 | Khao San | Y | 3 | 1 | 1000 |
10102.0 | 197.0 | Silom | N | 3 | 1.5 | -- |
10104.0 | 201.0 | Sukhumvit | 12 | 1 | NaN | 700 |
NaN | 203.0 | Sukhumvit | Y | 3 | 2 | 1600 |
10106.0 | 207.0 | Sukhumvit | Y | NaN | 1 | 800 |
10108.0 | 213.0 | Rama 1 | Y | -- | 1 | NaN |
10109.0 | 215.0 | Rama 1 | Y | na | 2 | 1800 |
# Setting subset keyword with a subset of columns
df.dropna(how = 'any', subset = ['ST_NUM','NUM_BATH'])
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
PID | ||||||
10101.0 | 104.0 | Khao San | Y | 3 | 1 | 1000 |
10102.0 | 197.0 | Silom | N | 3 | 1.5 | -- |
NaN | 203.0 | Sukhumvit | Y | 3 | 2 | 1600 |
10106.0 | 207.0 | Sukhumvit | Y | NaN | 1 | 800 |
10108.0 | 213.0 | Rama 1 | Y | -- | 1 | NaN |
10109.0 | 215.0 | Rama 1 | Y | na | 2 | 1800 |
If we wish to fill the rows containing NaN with another value instead of removing them, we may use the fillna()
method and specify the value to use. If we only want to fill certain columns, we must pass a dictionary as an argument to the fillna()
function, with the names of the columns as the key and the character to use for filling as the value.
df.fillna(value = {'ST_NUM':0})
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
PID | ||||||
10101.0 | 104.0 | Khao San | Y | 3 | 1 | 1000 |
10102.0 | 197.0 | Silom | N | 3 | 1.5 | -- |
10103.0 | 0.0 | Silom | N | NaN | 1 | 850 |
10104.0 | 201.0 | Sukhumvit | 12 | 1 | NaN | 700 |
NaN | 203.0 | Sukhumvit | Y | 3 | 2 | 1600 |
10106.0 | 207.0 | Sukhumvit | Y | NaN | 1 | 800 |
10107.0 | 0.0 | Thonglor | NaN | 2 | HURLEY | 950 |
10108.0 | 213.0 | Rama 1 | Y | -- | 1 | NaN |
10109.0 | 215.0 | Rama 1 | Y | na | 2 | 1800 |
df.fillna(value = {'ST_NUM': -1, 'NUM_BEDROOMS': -1 })
ST_NUM | ST_NAME | OWN_OCCUPIED | NUM_BEDROOMS | NUM_BATH | SQ_FT | |
---|---|---|---|---|---|---|
PID | ||||||
10101.0 | 104.0 | Khao San | Y | 3 | 1 | 1000 |
10102.0 | 197.0 | Silom | N | 3 | 1.5 | -- |
10103.0 | -1.0 | Silom | N | -1 | 1 | 850 |
10104.0 | 201.0 | Sukhumvit | 12 | 1 | NaN | 700 |
NaN | 203.0 | Sukhumvit | Y | 3 | 2 | 1600 |
10106.0 | 207.0 | Sukhumvit | Y | -1 | 1 | 800 |
10107.0 | -1.0 | Thonglor | NaN | 2 | HURLEY | 950 |
10108.0 | 213.0 | Rama 1 | Y | -- | 1 | NaN |
10109.0 | 215.0 | Rama 1 | Y | na | 2 | 1800 |
3.4.3. Sorting¶
Sorting by columns is another important feature we will need while examining at our data. Using the sort method, we can sort a DataFrame by any column. We only need to execute the following commands to see the first five rows of data sorted in descending order (i.e., from the largest to the lowest values) and using the Value column:
Here we will work with the Iris dataset.
print(iris.columns)
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')
iris.sort_values(by = ['sepal_length'], ascending = False)
sepal_length | sepal_width | petal_length | petal_width | class | |
---|---|---|---|---|---|
131 | 7.9 | 3.8 | 6.4 | 2.0 | Iris-virginica |
135 | 7.7 | 3.0 | 6.1 | 2.3 | Iris-virginica |
122 | 7.7 | 2.8 | 6.7 | 2.0 | Iris-virginica |
117 | 7.7 | 3.8 | 6.7 | 2.2 | Iris-virginica |
118 | 7.7 | 2.6 | 6.9 | 2.3 | Iris-virginica |
... | ... | ... | ... | ... | ... |
41 | 4.5 | 2.3 | 1.3 | 0.3 | Iris-setosa |
42 | 4.4 | 3.2 | 1.3 | 0.2 | Iris-setosa |
38 | 4.4 | 3.0 | 1.3 | 0.2 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
13 | 4.3 | 3.0 | 1.1 | 0.1 | Iris-setosa |
150 rows × 5 columns
The following command sorts the sepal_lenghth column followed by the petal_length column.
iris.sort_values(by = ['sepal_length','petal_length'], ascending = False)
sepal_length | sepal_width | petal_length | petal_width | class | |
---|---|---|---|---|---|
131 | 7.9 | 3.8 | 6.4 | 2.0 | Iris-virginica |
118 | 7.7 | 2.6 | 6.9 | 2.3 | Iris-virginica |
117 | 7.7 | 3.8 | 6.7 | 2.2 | Iris-virginica |
122 | 7.7 | 2.8 | 6.7 | 2.0 | Iris-virginica |
135 | 7.7 | 3.0 | 6.1 | 2.3 | Iris-virginica |
... | ... | ... | ... | ... | ... |
41 | 4.5 | 2.3 | 1.3 | 0.3 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
38 | 4.4 | 3.0 | 1.3 | 0.2 | Iris-setosa |
42 | 4.4 | 3.2 | 1.3 | 0.2 | Iris-setosa |
13 | 4.3 | 3.0 | 1.1 | 0.1 | Iris-setosa |
150 rows × 5 columns
Note
The
inplace
keyword indicates that the DataFrame will be overwritten, hence no new DataFrame will be returned.When
ascending = True
is used instead ofascending = False
, the values are sorted in ascending order (i.e., from the smallest to the largest values).If we wish to restore the original order, we can use the
sort_index
method and sort by an index.
# restore the original order
iris.sort_index(axis = 0, ascending = True, inplace = True)
iris.head()
sepal_length | sepal_width | petal_length | petal_width | class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
# Importing sales dataset for the next section
filepath = '/Users/Kaemyuijang/SCMA248/Data/SalesData.csv'
# pd.read_csv(filepath, header = None, skiprows = 1, names = ['Trans_no','Name','Date','Product','Units','Dollars','Location'])
# header = 0 means that the first row to use as the column names
sales = pd.read_csv(filepath, header = 0, index_col = 0)
# https://stackoverflow.com/questions/25015711/time-data-does-not-match-format
sales['Date'] = pd.to_datetime(sales['Date'], format='%d/%m/%Y')
sales.head()
Name | Date | Product | Units | Dollars | Location | |
---|---|---|---|---|---|---|
Trans_Number | ||||||
1 | Betsy | 2004-04-01 | lip gloss | 45 | 137.20 | south |
2 | Hallagan | 2004-03-10 | foundation | 50 | 152.01 | midwest |
3 | Ashley | 2005-02-25 | lipstick | 9 | 28.72 | midwest |
4 | Hallagan | 2006-05-22 | lip gloss | 55 | 167.08 | west |
5 | Zaret | 2004-06-17 | lip gloss | 43 | 130.60 | midwest |
3.4.4. Grouping Data¶
Another useful method for inspecting data is to group it according to certain criteria. For the sales dataset, it would be useful to categorize all of the data by location, independent of the year. The groupby
function in Pandas allows us to accomplish just that. This function returns a special grouped DataFrame as a result. As a result, an aggregate function must be used to create a suitable DataFrame. As a result, all values in the same group will be subjected to this function.
For instance, in our scenario, we can get a DataFrame that shows the number (count) of the transactions for each location across all years by grouping by location and using the count function as the aggregation technique for each group. As a consequence, a DataFrame with locations as indexes and counting values of transactions as the column would be created:
print(type(sales[['Location','Dollars']].groupby('Location')))
sales[['Location','Dollars']].groupby('Location').count()
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Dollars | |
---|---|
Location | |
east | 456 |
midwest | 424 |
south | 521 |
west | 490 |
# a DataFrame with locations as indexes and mean of sales income as
# the column would be created
sales[['Location','Dollars']].groupby('Location').mean()
Dollars | |
---|---|
Location | |
east | 125.815965 |
midwest | 129.258113 |
south | 123.409616 |
west | 129.467082 |
sales[['Location','Dollars','Units']].groupby('Location').mean()
Dollars | Units | |
---|---|---|
Location | ||
east | 125.815965 | 41.267544 |
midwest | 129.258113 | 42.417453 |
south | 123.409616 | 40.466411 |
west | 129.467082 | 42.491837 |
3.4.5. Rearranging Data¶
Until now, our indices were merely a numerical representation of rows with little meaning. We can change the way our data is organized by redistributing indexes and columns for better data processing, which usually results in greater performance. Using the pivot_table
function, we may rearrange our data. We can define which columns will be the new indexes, values, and columns.
3.4.5.1. Simplest Pivot table¶
An index
is required even for the smallest pivot table. Let us utilize the location as our index in this case. It uses the 'mean'
aggregation function on all available numerical columns by default.
sales.pivot_table(index='Location')
Dollars | Units | |
---|---|---|
Location | ||
east | 125.815965 | 41.267544 |
midwest | 129.258113 | 42.417453 |
south | 123.409616 | 40.466411 |
west | 129.467082 | 42.491837 |
To display multiple indexes, we can pass a list to index:
sales.pivot_table(index=['Location','Product'])
Dollars | Units | ||
---|---|---|---|
Location | Product | ||
east | eye liner | 132.521304 | 43.513043 |
foundation | 120.755248 | 39.584158 | |
lip gloss | 119.001134 | 38.989691 | |
lipstick | 136.096279 | 44.697674 | |
mascara | 125.406000 | 41.120000 | |
midwest | eye liner | 127.295980 | 41.745098 |
foundation | 139.133333 | 45.712644 | |
lip gloss | 130.765316 | 42.936709 | |
lipstick | 136.465472 | 44.811321 | |
mascara | 117.995340 | 38.669903 | |
south | eye liner | 121.481947 | 39.831858 |
foundation | 121.983852 | 39.977778 | |
lip gloss | 121.492632 | 39.824561 | |
lipstick | 115.709273 | 37.909091 | |
mascara | 133.528462 | 43.846154 | |
west | eye liner | 136.215000 | 44.735849 |
foundation | 137.521553 | 45.194175 | |
lip gloss | 115.428197 | 37.795082 | |
lipstick | 132.699643 | 43.589286 | |
mascara | 129.339223 | 42.446602 |
On the pivot table, the values to index are the keys to group by. To achieve a different visual representation, you can change the order of the values. For example, we can look at average values by grouping the region with the product category.
sales.pivot_table(index=['Product','Location'])
Dollars | Units | ||
---|---|---|---|
Product | Location | ||
eye liner | east | 132.521304 | 43.513043 |
midwest | 127.295980 | 41.745098 | |
south | 121.481947 | 39.831858 | |
west | 136.215000 | 44.735849 | |
foundation | east | 120.755248 | 39.584158 |
midwest | 139.133333 | 45.712644 | |
south | 121.983852 | 39.977778 | |
west | 137.521553 | 45.194175 | |
lip gloss | east | 119.001134 | 38.989691 |
midwest | 130.765316 | 42.936709 | |
south | 121.492632 | 39.824561 | |
west | 115.428197 | 37.795082 | |
lipstick | east | 136.096279 | 44.697674 |
midwest | 136.465472 | 44.811321 | |
south | 115.709273 | 37.909091 | |
west | 132.699643 | 43.589286 | |
mascara | east | 125.406000 | 41.120000 |
midwest | 117.995340 | 38.669903 | |
south | 133.528462 | 43.846154 | |
west | 129.339223 | 42.446602 |
3.4.5.2. Specifying values and performing aggregation¶
The mean aggregation function is applied to all numerical columns by default, and the result is returned. Use the values
argument to specify the columns we are interested in.
sales.pivot_table(index=['Location'],
values = ['Dollars'])
Dollars | |
---|---|
Location | |
east | 125.815965 |
midwest | 129.258113 |
south | 123.409616 |
west | 129.467082 |
We can specify a valid string function to aggfunc
to perform an aggregation other than mean, for example, a sum:
sales.pivot_table(index=['Location'],
values = ['Dollars'], aggfunc = 'sum')
Dollars | |
---|---|
Location | |
east | 57372.08 |
midwest | 54805.44 |
south | 64296.41 |
west | 63438.87 |
aggfunc
can be a dict, and below is the dict equivalent.
sales.pivot_table(index=['Location'],
values = ['Dollars'], aggfunc = {'Dollars': 'sum'})
Dollars | |
---|---|
Location | |
east | 57372.08 |
midwest | 54805.44 |
south | 64296.41 |
west | 63438.87 |
aggfunc
can be a list of functions, and below is an example to display the sum
and count
sales.pivot_table(index=['Location'],
values = ['Dollars'], aggfunc = ['sum','count'])
sum | count | |
---|---|---|
Dollars | Dollars | |
Location | ||
east | 57372.08 | 456 |
midwest | 54805.44 | 424 |
south | 64296.41 | 521 |
west | 63438.87 | 490 |
3.4.5.3. Seeing break down using columns¶
If we would like to see sales broken down by product_category, the columns argument allows us to do that
sales.pivot_table(index=['Location'],
values = ['Dollars'],
aggfunc = 'sum',
columns='Product')
Dollars | |||||
---|---|---|---|---|---|
Product | eye liner | foundation | lip gloss | lipstick | mascara |
Location | |||||
east | 15239.95 | 12196.28 | 11543.11 | 5852.14 | 12540.60 |
midwest | 12984.19 | 12104.60 | 10330.46 | 7232.67 | 12153.52 |
south | 13727.46 | 16467.82 | 13850.16 | 6364.01 | 13886.96 |
west | 14438.79 | 14164.72 | 14082.24 | 7431.18 | 13321.94 |
Note If there are missing values and we want to replace them, we could use fill_value
argument, for example, to set NaN
to a specific value.
sales.pivot_table(index=['Location'],
values = ['Dollars'],
aggfunc = 'sum',
columns='Product',
fill_value = 0)
Dollars | |||||
---|---|---|---|---|---|
Product | eye liner | foundation | lip gloss | lipstick | mascara |
Location | |||||
east | 15239.95 | 12196.28 | 11543.11 | 5852.14 | 12540.60 |
midwest | 12984.19 | 12104.60 | 10330.46 | 7232.67 | 12153.52 |
south | 13727.46 | 16467.82 | 13850.16 | 6364.01 | 13886.96 |
west | 14438.79 | 14164.72 | 14082.24 | 7431.18 | 13321.94 |
3.4.6. Exercise¶
Apply pivot_table
(in new worksheets) to answer the following questions.
The number of sales transactions for the given salesperson
For the given salesperson, the total revenue by the given product
Total revenue generated by the given salesperson broken down by the given location
Total revenue by the salesperson and the given year
3.4.7. Solutions to Exercise¶
The number of sales transactions for the given salesperson
sales.columns
Index(['Name', 'Date', 'Product', 'Units', 'Dollars', 'Location'], dtype='object')
sales.pivot_table(index=['Name'],
values=['Dollars'], aggfunc='count')
Dollars | |
---|---|
Name | |
Ashley | 197 |
Betsy | 217 |
Cici | 230 |
Colleen | 206 |
Cristina | 207 |
Emilee | 203 |
Hallagan | 200 |
Jen | 217 |
Zaret | 214 |
sales.pivot_table(index=['Name'],
values=['Dollars'], aggfunc = 'count').sort_values(by = 'Dollars')
Dollars | |
---|---|
Name | |
Ashley | 197 |
Hallagan | 200 |
Emilee | 203 |
Colleen | 206 |
Cristina | 207 |
Zaret | 214 |
Betsy | 217 |
Jen | 217 |
Cici | 230 |
For the given salesperson, the total revenue by the given product
sales.pivot_table(index='Name',
values='Dollars',
columns = 'Product', aggfunc = 'sum', margins=True)
Product | eye liner | foundation | lip gloss | lipstick | mascara | All |
---|---|---|---|---|---|---|
Name | ||||||
Ashley | 5844.95 | 4186.06 | 6053.67 | 3245.46 | 6617.10 | 25947.24 |
Betsy | 6046.51 | 8043.48 | 5675.65 | 3968.62 | 4827.22 | 28561.48 |
Cici | 5982.83 | 6198.22 | 5199.96 | 3148.83 | 7060.71 | 27590.55 |
Colleen | 3389.61 | 6834.76 | 5573.35 | 2346.39 | 6746.54 | 24890.65 |
Cristina | 5397.30 | 5290.97 | 5298.00 | 2401.68 | 5461.64 | 23849.59 |
Emilee | 7587.37 | 5313.82 | 5270.26 | 2189.15 | 4719.34 | 25079.94 |
Hallagan | 6964.64 | 6985.80 | 5603.10 | 3177.88 | 5703.32 | 28434.74 |
Jen | 7010.45 | 5628.65 | 5461.65 | 3953.28 | 6887.21 | 28941.24 |
Zaret | 8166.73 | 6451.66 | 5670.33 | 2448.71 | 3879.94 | 26617.37 |
All | 56390.39 | 54933.42 | 49805.97 | 26880.00 | 51903.02 | 239912.80 |
The result above also show the total. In Panda pivot_table(), we can simply pass margins=True
.
Total revenue generated by the given salesperson broken down by the given location
sales.pivot_table(index=['Name','Location'],
values='Dollars', aggfunc = 'sum').head(10)
Dollars | ||
---|---|---|
Name | Location | |
Ashley | east | 7772.70 |
midwest | 4985.90 | |
south | 7398.58 | |
west | 5790.06 | |
Betsy | east | 8767.41 |
midwest | 4878.07 | |
south | 7732.04 | |
west | 7183.96 | |
Cici | east | 5956.31 |
midwest | 8129.60 |
Total revenue by the salesperson and the given year.
To generate a yearly report with Panda pivot_table(), here are the steps:
Defines a groupby instruction using Grouper() with key=’Date’ and freq=’Y’.
Applies pivot_table.
Note To group our data depending on the specified frequency for the specified column, we’ll use pd.Grouper(key=INPUT COLUMN>, freq=DESIRED FREQUENCY>)
. The frequency in our situation is ‘Y’ and the relevant column is ‘Date.’
Different standard frequencies, such as ‘D’,’W’,’M’, or ‘Q’, can be used instead of ‘Y.’ Check out the following for a list of less common useable frequencies, https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.
year_gp = pd.Grouper(key='Date',freq='Y')
print(year_gp)
sales.pivot_table(index='Name', columns=year_gp, values='Dollars',aggfunc='sum')
TimeGrouper(key='Date', freq=<YearEnd: month=12>, axis=0, sort=True, closed='right', label='right', how='mean', convention='e', origin='start_day')
Date | 2004-12-31 | 2005-12-31 | 2006-12-31 |
---|---|---|---|
Name | |||
Ashley | 9495.09 | 9547.53 | 6904.62 |
Betsy | 9420.24 | 9788.70 | 9352.54 |
Cici | 8965.25 | 9024.96 | 9600.34 |
Colleen | 9361.37 | 7996.81 | 7532.47 |
Cristina | 9132.12 | 7976.34 | 6741.13 |
Emilee | 7805.66 | 9326.44 | 7947.84 |
Hallagan | 10676.88 | 9102.52 | 8655.34 |
Jen | 9049.31 | 8920.30 | 10971.63 |
Zaret | 9078.51 | 8639.68 | 8899.18 |
Exercise
Apply pivot_table (in new worksheets) to answer the following questions.
How many saleperson are there? Hint: use
groupby
to create a grouped DataFrame grouped by salepersons and then callgroups
on the grouped object, which will returns the list of indices for every group.
grouped_person = sales.groupby('Name')
print(grouped_person.groups.keys())
len(grouped_person.groups.keys())
dict_keys(['Ashley', 'Betsy', 'Cici', 'Colleen', 'Cristina', 'Emilee', 'Hallagan', 'Jen', 'Zaret'])
9
grouped_person.size()
Name
Ashley 197
Betsy 217
Cici 230
Colleen 206
Cristina 207
Emilee 203
Hallagan 200
Jen 217
Zaret 214
dtype: int64