import numpy
Lab: Python for Data Science
ACTL3143 & ACTL5111 Deep Learning for Actuaries
Data Science Libraries
A couple of fundamental data science packages in Python are NumPy and Pandas. NumPy is a package for handling matrices and vector math, while Pandas handles dataframes and data wrangling.
Libraries are imported using the import
keyword:
You can set an alias to the libraries you are importing. Usually this is done to simplify the name of a long library.
import numpy as np
import pandas as pd
You can also import specific functions from a library by using the from
keyword:
from sklearn.preprocessing import StandardScaler
In this lab, we will be working with two libraries used for data processing, NumPy and Pandas.
A Note on Installing Libraries
If you have successfully installed Anaconda onto your system, you should already have NumPy and Pandas installed as well. However, if for some reason you do not have a particular library installed, or you would like to update a particular library, you can use the command line to install new packages.
You can either open up Command Prompt/Terminal and type:
pip install numpy
The pip
method will also work on Anaconda Prompt. This will install the libraries onto your machine. When installing libraries, it is highly recommended that you create a Conda environment, as this allows you to install and manage separate sets of libraries for each Python project you are working on.
For a tutorial on how to set up your own environments, see https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/environments.html
NumPy
NumPy is a package used for scientific computing in Python, with the ability to perform advanced mathematical operations, linear algebra, and vectorisation. Core to the NumPy package is the NumPy array.
NumPy 1D arrays
Unlike lists in base Python, NumPy arrays can only work with numerical data. NumPy arrays are also faster and consumes less memory than Python lists (source: numpy.org/doc/stable/user/absolute_beginners.html).
= [1,1,1]
l1 = [2,2,2]
l2
= np.array(l1)
a1 = np.array(l2)
a2
#What do you notice?
print(l1 + l2)
print(a1 + a2)
[1, 1, 1, 2, 2, 2]
[3 3 3]
As you can see in the above code snippet, NumPy arrays are designed for linear algebra operations.
Other operations you can do include adding and multiplying arrays by a constant, calculating determinants of matrices, and even calculating eigenvalues and eigenvectors:
+ 3 #adds 3 to each element of the array, returns an error if done to a list
a1 * 3 #multiplies each element by 3 a1
array([3, 3, 3])
= np.array([[2,4],[1,3]]) #creating a 2D array, i.e. a matrix
m1 print(m1)
print(np.linalg.det(m1)) #Determinant
print(np.linalg.eig(m1)) #Eigenvalues and eigenvectors
[[2 4]
[1 3]]
2.0
EigResult(eigenvalues=array([0.43844719, 4.56155281]), eigenvectors=array([[-0.93153209, -0.84212294],
[ 0.36365914, -0.5392856 ]]))
You can create arrays using ranges or linearly spaced sequences:
= np.arange(5)
array_range = np.linspace(start = 0, stop = 1, num = 6)
array_lin
print(array_range)
print(array_lin)
[0 1 2 3 4]
[0. 0.2 0.4 0.6 0.8 1. ]
NumPy 2D arrays
As mentioned beforehand, you can create a matrix by feeding a list of lists into np.array()
:
= np.array([[2,4],[1,3]]) m1
You can also create matrices of zeroes and identity matrices:
= np.zeros([3,3]) #3 x 3 matrix
m_zero print(m_zero)
= np.ones([3,3])
m_ones print(m_ones)
= np.identity(3)
m_id print(m_id)
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
Pandas
Pandas is a Python library used for working with tabular data. It contains tools for data manipulation, time series, and data visualisation. Pandas can be considered a Python equivalent to dplyr
, and core to Pandas is the DataFrame
object, which is analogous to R’s data.frame
type.
import pandas as pd
DataFrames
For this lab we will be working with the Titanic machine learning dataset - a legendary dataset in the data science community. It is available at https://www.kaggle.com/competitions/titanic/data, and we will specifically be using train.csv
.
To use the dataset in Google Colab, we need to upload and then import it. To see which datasets are available in Google Colab, click the folder icon on the sidebar. Here, you can see the datasets you have uploaded, as well as any sample datasets that are already built into Google Colab. To upload files, click the upload icon that appears and select the file that you want to upload.
We will import the dataset using Pandas’ read_csv()
function.
= pd.read_csv("train.csv") titanic
This creates a DataFrame
object, which is a 2-dimensional, tabular data structure.
There are a number of methods available in Pandas to inspect your data, including .head()
and .info()
.
#much like the head() function in R, this method prints the first 5 rows of the dataset. titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
10) # Prints last 10 rows titanic.tail(
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
881 | 882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
882 | 883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
883 | 884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | NaN | S |
884 | 885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
# Gives a list of columns, their counts and their types, akin to the str() function in R. titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Selecting columns of a Pandas DataFrame is done using square brackets notation:
"Age"] # Selecting "Age" column from dataset titanic[
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 NaN
889 26.0
890 32.0
Name: Age, Length: 891, dtype: float64
"Sex","Age"]] # Selecting multiple columns titanic[[
Sex | Age | |
---|---|---|
0 | male | 22.0 |
1 | female | 38.0 |
2 | female | 26.0 |
3 | female | 35.0 |
4 | male | 35.0 |
... | ... | ... |
886 | male | 27.0 |
887 | female | 19.0 |
888 | female | NaN |
889 | male | 26.0 |
890 | male | 32.0 |
891 rows × 2 columns
There are several ways of selecting rows in a DataFrame, including selecting by row number using the square bracket notation or the .iloc
method, or selecting by row name using the .loc
method.
4:9] # Selecting rows by the index (can be different to row number) titanic[
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
4:9] # Selecting rows by their row numbers titanic.iloc[
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
"Name", inplace=True) #sets the "Name" column as the index
titanic.set_index(# By setting inplace = True, we modify the existing DataFrame rather than creating a new one.
# In other words, we do not need to assign it back to the titanic variable.
# Selecting rows using .loc
"Allen, Mr. William Henry", "Moran, Mr. James"]] titanic.loc[[
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Name | |||||||||||
Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Moran, Mr. James | 6 | 0 | 3 | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
When selecting both rows and columns, using .loc
or .iloc
is necessary:
4:9, [0, 3]] # Selecting rows 4 to 8, and columns 0 and 3 titanic.iloc[
PassengerId | Sex | |
---|---|---|
Name | ||
Allen, Mr. William Henry | 5 | male |
Moran, Mr. James | 6 | male |
McCarthy, Mr. Timothy J | 7 | male |
Palsson, Master. Gosta Leonard | 8 | male |
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | 9 | female |
"McCarthy, Mr. Timothy J", "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"], "Age"] titanic.loc[[
Name
McCarthy, Mr. Timothy J 54.0
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 27.0
Name: Age, dtype: float64
You can use the bracket notation to filter the dataset:
"Age"] >= 18] titanic[titanic[
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Name | |||||||||||
Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
Heikkinen, Miss. Laina | 3 | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Rice, Mrs. William (Margaret Norton) | 886 | 0 | 3 | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
Montvila, Rev. Juozas | 887 | 0 | 2 | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
Graham, Miss. Margaret Edith | 888 | 1 | 1 | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
Behr, Mr. Karl Howell | 890 | 1 | 1 | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
Dooley, Mr. Patrick | 891 | 0 | 3 | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
601 rows × 11 columns
This has reduced the dataset from 891 rows to 601.
If we wanted to combine multiple conditions together, we can use conditional operators. However, Python’s usual conditional operators (and
, or
, not
) will not work here, and instead we will need to use symbols (&
, |
, !
).
# Selecting passengers whose ages are 18 and above and are in passenger class 3.
"Age"] >= 18) & (titanic["Pclass"] == 3)] #Note that we need to wrap each conditional statement in parentheses. titanic[(titanic[
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Name | |||||||||||
Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
Heikkinen, Miss. Laina | 3 | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | 9 | 1 | 3 | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
Saundercock, Mr. William Henry | 13 | 0 | 3 | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Markun, Mr. Johann | 882 | 0 | 3 | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
Dahlberg, Miss. Gerda Ulrika | 883 | 0 | 3 | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
Sutehall, Mr. Henry Jr | 885 | 0 | 3 | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
Rice, Mrs. William (Margaret Norton) | 886 | 0 | 3 | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
Dooley, Mr. Patrick | 891 | 0 | 3 | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
277 rows × 11 columns
That line of code is quite longwinded, so if you wanted to filter your DataFrame in a more concise way, you can use the .query()
method:
"Age >= 18 & Pclass == 3") titanic.query(
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Name | |||||||||||
Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
Heikkinen, Miss. Laina | 3 | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | 9 | 1 | 3 | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
Saundercock, Mr. William Henry | 13 | 0 | 3 | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Markun, Mr. Johann | 882 | 0 | 3 | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
Dahlberg, Miss. Gerda Ulrika | 883 | 0 | 3 | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
Sutehall, Mr. Henry Jr | 885 | 0 | 3 | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
Rice, Mrs. William (Margaret Norton) | 886 | 0 | 3 | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
Dooley, Mr. Patrick | 891 | 0 | 3 | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
277 rows × 11 columns
In Pandas you can aggregate datasets using the .groupby()
method:
"Pclass").sum()["Survived"] titanic.groupby(
Pclass
1 136
2 87
3 119
Name: Survived, dtype: int64
Notice in the above line of code, we combined two methods. In Pandas, you can chain multiple methods together, much like dplyr’s pipline operator (%>%
) in R.
# Select the names of passengers in class 3 who are 65 years of age or older.
"Pclass == 3 & Age >= 65")["Name"] titanic.reset_index().query(
116 Connors, Mr. Patrick
280 Duane, Mr. Frank
851 Svensson, Mr. Johan
Name: Name, dtype: object
Exercises
- Filter the dataset to people where
Embarked
is Q. - Filter the dataset to people 18 years or older, and
Fare
is less than 10 - Filter the dataset to people with an above-median age.
- What is the highest value of
Fare
for female passengers in class 2?
Series
Let’s select the Ticket
column:
"Ticket"] titanic[
Name
Braund, Mr. Owen Harris A/5 21171
Cumings, Mrs. John Bradley (Florence Briggs Thayer) PC 17599
Heikkinen, Miss. Laina STON/O2. 3101282
Futrelle, Mrs. Jacques Heath (Lily May Peel) 113803
Allen, Mr. William Henry 373450
...
Montvila, Rev. Juozas 211536
Graham, Miss. Margaret Edith 112053
Johnston, Miss. Catherine Helen "Carrie" W./C. 6607
Behr, Mr. Karl Howell 111369
Dooley, Mr. Patrick 370376
Name: Ticket, Length: 891, dtype: object
When selecting a single column of the DataFrame, Pandas returns what is known as a Series. This is a data structure used to represent one-dimensional data, much like a list or NumPy array. They are more flexible than NumPy arrays because they can hold non-numeric data types. However, they are not as flexible as lists because they can only hold one datatype at a time. If you try to create a Series with values of different data types, Pandas will convert all the elements of the Series into strings.
# You can create series from lists, tuples, and NumPy arrays
= ["The", "quick", "brown", "fox"]
l = (3,1,4,1,5,9)
t = np.array(t)
a = ["this", 3, "will", True, "convert"]
mix
print(pd.Series(l))
print(pd.Series(t))
print(pd.Series(a))
print(pd.Series(mix)) #converted into strings
0 The
1 quick
2 brown
3 fox
dtype: object
0 3
1 1
2 4
3 1
4 5
5 9
dtype: int64
0 3
1 1
2 4
3 1
4 5
5 9
dtype: int64
0 this
1 3
2 will
3 True
4 convert
dtype: object