Introduction to Pandas
- 61 minsThis is the first tutorial of the Explained! series.
I will be cataloging all the work I do with regards to PyLibraries and will share it here or on my Github.
I will also be updating this post as and when I work on Pandas.
That being said, Dive in!
Series
import pandas as pd
import numpy as np
import random
first_series = pd.Series([1,2,3, np.nan ,"hello"])
first_series
0 1
1 2
2 3
3 NaN
4 hello
dtype: object
series = pd.Series([1,2,3, np.nan ,"hello"], index = ['A','B','C','Unknown','String'])
series
#indexing the Series with custom values
A 1
B 2
C 3
Unknown NaN
String hello
dtype: object
dict = {"Python": "Fun", "C++": "Outdated","Coding":"Hmm.."}
series = pd.Series(dict)
series
# Dict to pandas Series
Python Fun
C++ Outdated
Coding Hmm..
dtype: object
series[['Coding','Python']]
Coding Hmm..
Python Fun
dtype: object
series.index
Index(['Python', 'C++', 'Coding'], dtype='object')
series.values
array(['Fun', 'Outdated', 'Hmm..'], dtype=object)
series.describe()
count 3
unique 3
top Outdated
freq 1
dtype: object
#Series is a mutable data structures and you can easily change any item’s value:
series['Coding'] = 'Awesome'
series
Python Fun
C++ Outdated
Coding Awesome
dtype: object
# add new values:
series['Java'] = 'Okay'
series
Python Fun
C++ Outdated
Coding Awesome
Java Okay
dtype: object
# If it is necessary to apply any mathematical operation to Series items, you may done it like below:
num_series = pd.Series([1,2,3,4,5,6,None])
num_series_changed = num_series/2
num_series_changed
0 0.5
1 1.0
2 1.5
3 2.0
4 2.5
5 3.0
6 NaN
dtype: float64
# NULL/NaN checking can be performed with isnull() and notnull().
print(series.isnull())
print(num_series.notnull())
print(num_series_changed.notnull())
Python False
C++ False
Coding False
Java False
dtype: bool
0 True
1 True
2 True
3 True
4 True
5 True
6 False
dtype: bool
0 True
1 True
2 True
3 True
4 True
5 True
6 False
dtype: bool
DataFrames
data = {'year': [1990, 1994, 1998, 2002, 2006, 2010, 2014],
'winner': ['Germany', 'Brazil', 'France', 'Brazil','Italy', 'Spain', 'Germany'],
'runner-up': ['Argentina', 'Italy', 'Brazil','Germany', 'France', 'Netherlands', 'Argentina'],
'final score': ['1-0', '0-0 (pen)', '3-0', '2-0', '1-1 (pen)', '1-0', '1-0'] }
world_cup = pd.DataFrame(data, columns=['year', 'winner', 'runner-up', 'final score'])
world_cup
year | winner | runner-up | final score | |
---|---|---|---|---|
0 | 1990 | Germany | Argentina | 1-0 |
1 | 1994 | Brazil | Italy | 0-0 (pen) |
2 | 1998 | France | Brazil | 3-0 |
3 | 2002 | Brazil | Germany | 2-0 |
4 | 2006 | Italy | France | 1-1 (pen) |
5 | 2010 | Spain | Netherlands | 1-0 |
6 | 2014 | Germany | Argentina | 1-0 |
# Another way to set a DataFrame is the using of Python list of dictionaries:
data_2 = [{'year': 1990, 'winner': 'Germany', 'runner-up': 'Argentina', 'final score': '1-0'},
{'year': 1994, 'winner': 'Brazil', 'runner-up': 'Italy', 'final score': '0-0 (pen)'},
{'year': 1998, 'winner': 'France', 'runner-up': 'Brazil', 'final score': '3-0'},
{'year': 2002, 'winner': 'Brazil', 'runner-up': 'Germany', 'final score': '2-0'},
{'year': 2006, 'winner': 'Italy','runner-up': 'France', 'final score': '1-1 (pen)'},
{'year': 2010, 'winner': 'Spain', 'runner-up': 'Netherlands', 'final score': '1-0'},
{'year': 2014, 'winner': 'Germany', 'runner-up': 'Argentina', 'final score': '1-0'}
]
world_cup = pd.DataFrame(data_2)
world_cup
final score | runner-up | winner | year | |
---|---|---|---|---|
0 | 1-0 | Argentina | Germany | 1990 |
1 | 0-0 (pen) | Italy | Brazil | 1994 |
2 | 3-0 | Brazil | France | 1998 |
3 | 2-0 | Germany | Brazil | 2002 |
4 | 1-1 (pen) | France | Italy | 2006 |
5 | 1-0 | Netherlands | Spain | 2010 |
6 | 1-0 | Argentina | Germany | 2014 |
print("First 2 Rows: ",end="\n\n")
print (world_cup.head(2),end="\n\n")
print ("Last 2 Rows : ",end="\n\n")
print (world_cup.tail(2),end="\n\n")
print("Using slicing : ",end="\n\n")
print (world_cup[2:4])
First 2 Rows:
final score runner-up winner year
0 1-0 Argentina Germany 1990
1 0-0 (pen) Italy Brazil 1994
Last 2 Rows :
final score runner-up winner year
5 1-0 Netherlands Spain 2010
6 1-0 Argentina Germany 2014
Using slicing :
final score runner-up winner year
2 3-0 Brazil France 1998
3 2-0 Germany Brazil 2002
CSV
Reading:
df = pd.read_csv("path\to\the\csv\file\for\reading")
Writing:
df.to_csv("path\to\the\folder\where\you\want\save\csv\file")
TXT file(s)
(txt file can be read as a CSV file with other separator (delimiter); we suppose below that columns are separated by tabulation):
Reading:
df = pd.read_csv("path\to\the\txt\file\for\reading", sep='\t')
Writing:
df.to_csv("path\to\the\folder\where\you\want\save\txt\file", sep='\t')
JSON files
(an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is the most common data format used for asynchronous browser/server communication. By its view it is very similar to Python dictionary)
Reading:
df = pd.read_json("path\to\the\json\file\for\reading", sep='\t')
Writing:
df.to_json("path\to\the\folder\where\you\want\save\json\file", sep='\t')
# To write world_cup Dataframe to a CSV File
world_cup.to_csv("worldcup.csv")
# To save CSV file without index use index=False attribute
print("File Written!",end="\n\n")
#To check if it was written
import os
print(os.path.exists('worldcup.csv'))
# reading from it in a new dataframe df
df = pd.read_csv('worldcup.csv')
print(df.head())
File Written!
True
Unnamed: 0 final score runner-up winner year
0 0 1-0 Argentina Germany 1990
1 1 0-0 (pen) Italy Brazil 1994
2 2 3-0 Brazil France 1998
3 3 2-0 Germany Brazil 2002
4 4 1-1 (pen) France Italy 2006
# We can also load the data without index as :
df = pd.read_csv('worldcup.csv',index_col=0)
print(df)
final score runner-up winner year
0 1-0 Argentina Germany 1990
1 0-0 (pen) Italy Brazil 1994
2 3-0 Brazil France 1998
3 2-0 Germany Brazil 2002
4 1-1 (pen) France Italy 2006
5 1-0 Netherlands Spain 2010
6 1-0 Argentina Germany 2014
You can find the dataset used in the following code here.
movies=pd.read_csv("data/movies.csv",encoding = "ISO-8859-1")
# encoding is added only for this specific dataset because it gave error with utf-8
movies['release_date'] = movies['release_date'].map(pd.to_datetime)
print(movies.head(20))
#print(movies.describe())
user_id movie_id rating timestamp age gender occupation zip_code \
0 196 242 3 881250949 49.0 M writer 55105
1 305 242 5 886307828 23.0 M programmer 94086
2 6 242 4 883268170 42.0 M executive 98101
3 234 242 4 891033261 60.0 M retired 94702
4 63 242 3 875747190 31.0 M marketing 75240
5 181 242 1 878961814 26.0 M executive 21218
6 201 242 4 884110598 27.0 M writer E2A4H
7 249 242 5 879571438 25.0 M student 84103
8 13 242 2 881515193 47.0 M educator 29206
9 279 242 3 877756647 33.0 M programmer 85251
10 145 242 5 875269755 31.0 M entertainment V3N4P
11 90 242 4 891382267 60.0 M educator 78155
12 271 242 4 885844495 51.0 M engineer 22932
13 18 242 5 880129305 35.0 F other 37212
14 1 242 5 889751633 NaN M NaN 85711
15 207 242 4 890793823 39.0 M marketing 92037
16 14 242 4 876964570 45.0 M scientist 55106
17 113 242 2 875075887 47.0 M executive 95032
18 123 242 5 879809053 NaN F artist 20008
19 296 242 4 884196057 43.0 F administrator 16803
movie_title release_date ... Fantasy Film-Noir Horror Musical \
0 Kolya (1996) 1997-01-24 ... 0 0 0 0
1 Kolya (1996) 1997-01-24 ... 0 0 0 0
2 Kolya (1996) 1997-01-24 ... 0 0 0 0
3 Kolya (1996) 1997-01-24 ... 0 0 0 0
4 Kolya (1996) 1997-01-24 ... 0 0 0 0
5 Kolya (1996) 1997-01-24 ... 0 0 0 0
6 Kolya (1996) 1997-01-24 ... 0 0 0 0
7 Kolya (1996) 1997-01-24 ... 0 0 0 0
8 Kolya (1996) 1997-01-24 ... 0 0 0 0
9 Kolya (1996) 1997-01-24 ... 0 0 0 0
10 Kolya (1996) 1997-01-24 ... 0 0 0 0
11 Kolya (1996) 1997-01-24 ... 0 0 0 0
12 Kolya (1996) 1997-01-24 ... 0 0 0 0
13 Kolya (1996) 1997-01-24 ... 0 0 0 0
14 Kolya (1996) 1997-01-24 ... 0 0 0 0
15 Kolya (1996) 1997-01-24 ... 0 0 0 0
16 Kolya (1996) 1997-01-24 ... 0 0 0 0
17 Kolya (1996) 1997-01-24 ... 0 0 0 0
18 Kolya (1996) 1997-01-24 ... 0 0 0 0
19 Kolya (1996) 1997-01-24 ... 0 0 0 0
Mystery Romance Sci-Fi Thriller War Western
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
7 0 0 0 0 0 0
8 0 0 0 0 0 0
9 0 0 0 0 0 0
10 0 0 0 0 0 0
11 0 0 0 0 0 0
12 0 0 0 0 0 0
13 0 0 0 0 0 0
14 0 0 0 0 0 0
15 0 0 0 0 0 0
16 0 0 0 0 0 0
17 0 0 0 0 0 0
18 0 0 0 0 0 0
19 0 0 0 0 0 0
[20 rows x 30 columns]
movies_rating = movies['rating']
# Here we are showing only one column, i.e. a Series
print ('type:', type(movies_rating))
movies_rating.head()
type: <class 'pandas.core.series.Series'>
0 3
1 5
2 4
3 4
4 3
Name: rating, dtype: int64
# Filtering data
# Let's display only women
movies_user_female = movies[movies['gender']=='F']
print(movies_user_female.head())
user_id movie_id rating timestamp age gender occupation zip_code \
13 18 242 5 880129305 35.0 F other 37212
18 123 242 5 879809053 NaN F artist 20008
19 296 242 4 884196057 43.0 F administrator 16803
21 270 242 5 876953744 18.0 F student 63119
22 240 242 5 885775683 23.0 F educator 20784
movie_title release_date ... Fantasy Film-Noir Horror Musical \
13 Kolya (1996) 1997-01-24 ... 0 0 0 0
18 Kolya (1996) 1997-01-24 ... 0 0 0 0
19 Kolya (1996) 1997-01-24 ... 0 0 0 0
21 Kolya (1996) 1997-01-24 ... 0 0 0 0
22 Kolya (1996) 1997-01-24 ... 0 0 0 0
Mystery Romance Sci-Fi Thriller War Western
13 0 0 0 0 0 0
18 0 0 0 0 0 0
19 0 0 0 0 0 0
21 0 0 0 0 0 0
22 0 0 0 0 0 0
[5 rows x 30 columns]
#to see all the different values possible for a given column
occupation_list = movies['occupation']
print(occupation_list)
0 writer
1 programmer
2 executive
3 retired
4 marketing
5 executive
6 writer
7 student
8 educator
9 programmer
10 entertainment
11 educator
12 engineer
13 other
14 NaN
15 marketing
16 scientist
17 executive
18 artist
19 administrator
20 student
21 student
22 educator
23 NaN
24 writer
25 NaN
26 NaN
27 marketing
28 administrator
29 student
...
99970 educator
99971 other
99972 other
99973 other
99974 administrator
99975 artist
99976 artist
99977 artist
99978 artist
99979 artist
99980 artist
99981 entertainment
99982 student
99983 student
99984 artist
99985 artist
99986 artist
99987 student
99988 librarian
99989 writer
99990 NaN
99991 artist
99992 other
99993 other
99994 student
99995 student
99996 student
99997 student
99998 writer
99999 engineer
Name: occupation, Length: 100000, dtype: object
Work with indexes and MultiIndex option
import random
indexes = [random.randrange(0,100) for i in range(5)]
data = [{i:random.randint(0,10) for i in 'ABCDE'} for i in range(5)]
df = pd.DataFrame(data, index=[1,2,3,4,5])
df
A | B | C | D | E | |
---|---|---|---|---|---|
1 | 5 | 8 | 6 | 4 | 2 |
2 | 1 | 3 | 9 | 8 | 2 |
3 | 8 | 7 | 2 | 1 | 2 |
4 | 3 | 8 | 1 | 3 | 8 |
5 | 0 | 7 | 4 | 4 | 2 |
movies_user_gender_male = movies[movies['gender']=='M']
movies_user_gender_male_dup = movies_user_gender_male.drop_duplicates(keep=False)
print(movies_user_gender_male.head())
# From this we can clearly see age has missing value and that from 100,000 the data reduced to 74260,
# due to filtering and removing duplicates
user_id movie_id rating timestamp age gender occupation zip_code \
0 196 242 3 881250949 49.0 M writer 55105
1 305 242 5 886307828 23.0 M programmer 94086
2 6 242 4 883268170 42.0 M executive 98101
3 234 242 4 891033261 60.0 M retired 94702
4 63 242 3 875747190 31.0 M marketing 75240
movie_title release_date ... Fantasy Film-Noir Horror Musical \
0 Kolya (1996) 1997-01-24 ... 0 0 0 0
1 Kolya (1996) 1997-01-24 ... 0 0 0 0
2 Kolya (1996) 1997-01-24 ... 0 0 0 0
3 Kolya (1996) 1997-01-24 ... 0 0 0 0
4 Kolya (1996) 1997-01-24 ... 0 0 0 0
Mystery Romance Sci-Fi Thriller War Western
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
[5 rows x 30 columns]
#gender = female and age between 30 and 40
gender_required = ['F']
filtered_df = movies[((movies['gender'] == 'F') & (movies['age'] > 30) & (movies['age'] <40))]
filtered_df
user_id | movie_id | rating | timestamp | age | gender | occupation | zip_code | movie_title | release_date | ... | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13 | 18 | 242 | 5 | 880129305 | 35.0 | F | other | 37212 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
27 | 129 | 242 | 4 | 883243972 | 36.0 | F | marketing | 07039 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
39 | 34 | 242 | 5 | 888601628 | 38.0 | F | administrator | 42141 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
42 | 209 | 242 | 4 | 883589606 | 33.0 | F | educator | 85710 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
110 | 861 | 242 | 5 | 881274504 | 38.0 | F | NaN | 14085 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
147 | 11 | 393 | 4 | 891905222 | 39.0 | F | other | 30329 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
157 | 269 | 393 | 1 | 891451036 | 31.0 | F | librarian | 43201 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
160 | 5 | 393 | 2 | 875636265 | 33.0 | F | other | 15213 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
161 | 18 | 393 | 3 | 880130930 | 35.0 | F | NaN | 37212 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
167 | 151 | 393 | 2 | 879528692 | 38.0 | F | administrator | 48103 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
178 | 152 | 393 | 5 | 884018430 | 33.0 | F | educator | 68767 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
187 | 330 | 393 | 4 | 876547004 | 35.0 | F | educator | 33884 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
210 | 450 | 393 | 4 | 882812349 | 35.0 | F | educator | 11758 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
213 | 457 | 393 | 3 | 882548583 | 33.0 | F | salesman | 30011 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
241 | 577 | 393 | 4 | 880475363 | 36.0 | F | student | 77845 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
246 | 593 | 393 | 4 | 886194041 | 31.0 | F | educator | 68767 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
267 | 716 | 393 | 3 | 879796596 | 36.0 | F | administrator | 44265 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
280 | 796 | 393 | 4 | 893218933 | 32.0 | F | writer | 33755 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
325 | 264 | 381 | 4 | 886123596 | 36.0 | F | writer | 90064 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
328 | 5 | 381 | 1 | 875636540 | 33.0 | F | other | 15213 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
329 | 18 | 381 | 4 | 880131474 | 35.0 | F | other | 37212 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
330 | 256 | 381 | 5 | 882165135 | 35.0 | F | none | 39042 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
331 | 151 | 381 | 5 | 879528754 | 38.0 | F | administrator | 48103 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
358 | 450 | 381 | 2 | 882374497 | 35.0 | F | educator | 11758 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
384 | 716 | 381 | 4 | 879795644 | 36.0 | F | administrator | 44265 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
387 | 786 | 381 | 3 | 882843397 | 36.0 | F | engineer | 01754 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
388 | 796 | 381 | 3 | 893047208 | 32.0 | F | writer | 33755 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
397 | 861 | 381 | 4 | 881274780 | 38.0 | F | student | 14085 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
406 | 911 | 381 | 5 | 892839846 | 37.0 | F | writer | 53210 | Muriel's Wedding (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
423 | 79 | 251 | 5 | 891271545 | 39.0 | F | administrator | 03755 | Shall We Dance? (1996) | 1997-07-11 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
99139 | 5 | 374 | 3 | 875636905 | 33.0 | F | other | 15213 | Mighty Morphin Power Rangers: The Movie (1995) | 1995-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99146 | 911 | 374 | 1 | 892841118 | 37.0 | F | writer | 53210 | Mighty Morphin Power Rangers: The Movie (1995) | 1995-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99189 | 18 | 958 | 5 | 880129731 | 35.0 | F | other | 37212 | To Live (Huozhe) (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99235 | 450 | 1249 | 3 | 882812821 | 35.0 | F | educator | 11758 | For Love or Money (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99249 | 796 | 1055 | 3 | 893188577 | 32.0 | F | writer | 33755 | Simple Twist of Fate, A (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99255 | 264 | 1474 | 2 | 886123728 | 36.0 | F | writer | 90064 | Nina Takes a Lover (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
99261 | 264 | 1475 | 2 | 886123596 | 36.0 | F | writer | 90064 | Bhaji on the Beach (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99279 | 938 | 1254 | 1 | 891357019 | 38.0 | F | technician | 55038 | Gone Fishin' (1997) | 1997-05-30 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99294 | 269 | 1479 | 2 | 891451111 | 31.0 | F | librarian | 43201 | Reckless (1995) | 1995-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99298 | 450 | 1479 | 3 | 882377479 | 35.0 | F | educator | 11758 | Reckless (1995) | 1995-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99313 | 5 | 267 | 4 | 875635064 | 33.0 | F | other | 15213 | unknown | NaT | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99324 | 18 | 113 | 5 | 880129628 | 35.0 | F | other | 37212 | Horseman on the Roof, The (Hussard sur le toit... | 1996-04-19 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99333 | 18 | 973 | 3 | 880129595 | 35.0 | F | other | 37212 | Grateful Dead (1995) | 1996-10-18 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99352 | 151 | 1297 | 1 | 879542847 | 38.0 | F | administrator | 48103 | Love Affair (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
99359 | 450 | 1297 | 4 | 882812635 | 35.0 | F | educator | 11758 | Love Affair (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
99361 | 796 | 1297 | 2 | 893047504 | 32.0 | F | writer | 33755 | Love Affair (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
99364 | 151 | 1299 | 4 | 879543423 | 38.0 | F | administrator | 48103 | Penny Serenade (1941) | 1941-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
99370 | 796 | 1299 | 2 | 892676043 | 32.0 | F | writer | 33755 | Penny Serenade (1941) | 1941-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
99452 | 688 | 1127 | 5 | 884153606 | 37.0 | F | administrator | 60476 | Truman Show, The (1998) | 1998-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99477 | 34 | 1024 | 5 | 888602618 | 38.0 | F | administrator | 42141 | Mrs. Dalloway (1997) | 1997-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
99501 | 129 | 1176 | 4 | 883244059 | 36.0 | F | marketing | 07039 | Welcome To Sarajevo (1997) | 1997-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
99541 | 356 | 1294 | 4 | 891405721 | 32.0 | F | homemaker | 92688 | Ayn Rand: A Sense of Life (1997) | 1998-02-13 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99561 | 796 | 1415 | 3 | 893219254 | 32.0 | F | writer | 33755 | Next Karate Kid, The (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99630 | 450 | 1518 | 4 | 887138957 | 35.0 | F | educator | 11758 | Losing Isaiah (1995) | 1995-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99642 | 577 | 1517 | 3 | 880475644 | 36.0 | F | student | 77845 | Race the Sun (1996) | 1996-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99648 | 450 | 1521 | 3 | 882812350 | 35.0 | F | educator | 11758 | Mr. Wonderful (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
99660 | 796 | 1522 | 3 | 893194740 | 32.0 | F | writer | 33755 | Trial by Jury (1994) | 1994-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
99695 | 577 | 1531 | 4 | 880475408 | 36.0 | F | student | 77845 | Far From Home: The Adventures of Yellow Dog (1... | 1995-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99852 | 450 | 1603 | 3 | 887139728 | 35.0 | F | educator | 11758 | Angela (1995) | 1996-02-16 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
99981 | 839 | 1664 | 1 | 875752902 | 38.0 | F | entertainment | 90814 | 8 Heads in a Duffel Bag (1997) | 1997-04-18 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5183 rows × 30 columns
Note
In the above fragment you HAVE TO ADD parantheses to each and every argument that is being compared else you will get an error.
As you can see after filtering result tables (i.e. DataFrames) have non-ordered indexes. To fix this trouble you may write the following:
filtered_df = filtered_df.reset_index()
filtered_df.head(10)
index | user_id | movie_id | rating | timestamp | age | gender | occupation | zip_code | movie_title | ... | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13 | 18 | 242 | 5 | 880129305 | 35.0 | F | other | 37212 | Kolya (1996) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 27 | 129 | 242 | 4 | 883243972 | 36.0 | F | marketing | 07039 | Kolya (1996) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 39 | 34 | 242 | 5 | 888601628 | 38.0 | F | administrator | 42141 | Kolya (1996) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 42 | 209 | 242 | 4 | 883589606 | 33.0 | F | educator | 85710 | Kolya (1996) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 110 | 861 | 242 | 5 | 881274504 | 38.0 | F | NaN | 14085 | Kolya (1996) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 147 | 11 | 393 | 4 | 891905222 | 39.0 | F | other | 30329 | Mrs. Doubtfire (1993) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 157 | 269 | 393 | 1 | 891451036 | 31.0 | F | librarian | 43201 | Mrs. Doubtfire (1993) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 160 | 5 | 393 | 2 | 875636265 | 33.0 | F | other | 15213 | Mrs. Doubtfire (1993) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 161 | 18 | 393 | 3 | 880130930 | 35.0 | F | NaN | 37212 | Mrs. Doubtfire (1993) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9 | 167 | 151 | 393 | 2 | 879528692 | 38.0 | F | administrator | 48103 | Mrs. Doubtfire (1993) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 rows × 31 columns
# set 'user_id' 'movie_id' as index
filtered_df_new = filtered_df.set_index(['user_id','movie_id'])
filtered_df_new.head(10)
# Note that set_index takes only a list as an argument to it.
# if you remove the [] then only the first argument is set as the index.
index | rating | timestamp | age | gender | occupation | zip_code | movie_title | release_date | IMDb_URL | ... | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | movie_id | |||||||||||||||||||||
18 | 242 | 13 | 5 | 880129305 | 35.0 | F | other | 37212 | Kolya (1996) | 1997-01-24 | http://us.imdb.com/M/title-exact?Kolya%20(1996) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
129 | 242 | 27 | 4 | 883243972 | 36.0 | F | marketing | 07039 | Kolya (1996) | 1997-01-24 | http://us.imdb.com/M/title-exact?Kolya%20(1996) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
34 | 242 | 39 | 5 | 888601628 | 38.0 | F | administrator | 42141 | Kolya (1996) | 1997-01-24 | http://us.imdb.com/M/title-exact?Kolya%20(1996) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
209 | 242 | 42 | 4 | 883589606 | 33.0 | F | educator | 85710 | Kolya (1996) | 1997-01-24 | http://us.imdb.com/M/title-exact?Kolya%20(1996) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
861 | 242 | 110 | 5 | 881274504 | 38.0 | F | NaN | 14085 | Kolya (1996) | 1997-01-24 | http://us.imdb.com/M/title-exact?Kolya%20(1996) | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 393 | 147 | 4 | 891905222 | 39.0 | F | other | 30329 | Mrs. Doubtfire (1993) | 1993-01-01 | http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
269 | 393 | 157 | 1 | 891451036 | 31.0 | F | librarian | 43201 | Mrs. Doubtfire (1993) | 1993-01-01 | http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 393 | 160 | 2 | 875636265 | 33.0 | F | other | 15213 | Mrs. Doubtfire (1993) | 1993-01-01 | http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 393 | 161 | 3 | 880130930 | 35.0 | F | NaN | 37212 | Mrs. Doubtfire (1993) | 1993-01-01 | http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
151 | 393 | 167 | 2 | 879528692 | 38.0 | F | administrator | 48103 | Mrs. Doubtfire (1993) | 1993-01-01 | http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 rows × 29 columns
# By default, `set_index()` returns a new DataFrame.
# so you’ll have to specify if you’d like the changes to occur in place.
# Here we used filtered_df_new to get the new dataframe and now see the type of filtererd_df_new
print(type(filtered_df_new.index))
<class 'pandas.core.indexes.multi.MultiIndex'>
Notice here that we now have a new sort of ‘index’ which is MultiIndex
, which contains information about indexing of DataFrame and allows manipulating with this data.
filtered_df_new.index.names
# Gives you the names of the two index values we set as a FrozenList
FrozenList(['user_id', 'movie_id'])
Method get_level_values()
allows to get all values for the corresponding index level. get_level_values(0)
corresponds to ‘user_id’ and get_level_values(1)
corresponds to ‘movie_id’
print(filtered_df_new.index.get_level_values(0))
print(filtered_df_new.index.get_level_values(1))
Int64Index([ 18, 129, 34, 209, 861, 11, 269, 5, 18, 151,
...
129, 356, 796, 450, 577, 450, 796, 577, 450, 839],
dtype='int64', name='user_id', length=5183)
Int64Index([ 242, 242, 242, 242, 242, 393, 393, 393, 393, 393,
...
1176, 1294, 1415, 1518, 1517, 1521, 1522, 1531, 1603, 1664],
dtype='int64', name='movie_id', length=5183)
Selection by label and position
Object selection in pandas is now supported by three types of multi-axis indexing.
-
.loc
works on labels in the index; -
.iloc
works on the positions in the index (so it only takes integers);
The sequence of the following examples demonstrates how we can manipulate with DataFrame’s rows. At first let’s get the first row of movies:
movies.loc[0]
user_id 196
movie_id 242
rating 3
timestamp 881250949
age 49
gender M
occupation writer
zip_code 55105
movie_title Kolya (1996)
release_date 1997-01-24 00:00:00
IMDb_URL http://us.imdb.com/M/title-exact?Kolya%20(1996)
unknown 0
Action 0
Adventure 0
Animation 0
Childrens 0
Comedy 1
Crime 0
Documentary 0
Drama 0
Fantasy 0
Film-Noir 0
Horror 0
Musical 0
Mystery 0
Romance 0
Sci-Fi 0
Thriller 0
War 0
Western 0
Name: 0, dtype: object
movies.loc[1:3]
user_id | movie_id | rating | timestamp | age | gender | occupation | zip_code | movie_title | release_date | ... | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 305 | 242 | 5 | 886307828 | 23.0 | M | programmer | 94086 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 6 | 242 | 4 | 883268170 | 42.0 | M | executive | 98101 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 234 | 242 | 4 | 891033261 | 60.0 | M | retired | 94702 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 30 columns
If you want to return specific columns then you have to specify them as a separate argument of .loc
movies.loc[1:3 , 'movie_title']
1 Kolya (1996)
2 Kolya (1996)
3 Kolya (1996)
Name: movie_title, dtype: object
movies.loc[1:5 , ['movie_title','age','gender']]
# If more than one column is to be selected then you have to give the second argument of .loc as a list
movie_title | age | gender | |
---|---|---|---|
1 | Kolya (1996) | 23.0 | M |
2 | Kolya (1996) | 42.0 | M |
3 | Kolya (1996) | 60.0 | M |
4 | Kolya (1996) | 31.0 | M |
5 | Kolya (1996) | 26.0 | M |
# movies.iloc[1:5 , ['movie_title','age','gender']]
# Gives error as iloc only uses integer values
movies.iloc[0]
user_id 196
movie_id 242
rating 3
timestamp 881250949
age 49
gender M
occupation writer
zip_code 55105
movie_title Kolya (1996)
release_date 1997-01-24 00:00:00
IMDb_URL http://us.imdb.com/M/title-exact?Kolya%20(1996)
unknown 0
Action 0
Adventure 0
Animation 0
Childrens 0
Comedy 1
Crime 0
Documentary 0
Drama 0
Fantasy 0
Film-Noir 0
Horror 0
Musical 0
Mystery 0
Romance 0
Sci-Fi 0
Thriller 0
War 0
Western 0
Name: 0, dtype: object
movies.iloc[1:5]
user_id | movie_id | rating | timestamp | age | gender | occupation | zip_code | movie_title | release_date | ... | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 305 | 242 | 5 | 886307828 | 23.0 | M | programmer | 94086 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 6 | 242 | 4 | 883268170 | 42.0 | M | executive | 98101 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 234 | 242 | 4 | 891033261 | 60.0 | M | retired | 94702 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 63 | 242 | 3 | 875747190 | 31.0 | M | marketing | 75240 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 rows × 30 columns
# movies.select(lambda x: x%2==0).head() is the same as :
movies.loc[movies.index.map(lambda x: x%2==0)].head()
# .select() has been deprecated for now and will be completely removed in future updates so use .loc
user_id | movie_id | rating | timestamp | age | gender | occupation | zip_code | movie_title | release_date | ... | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 | 49.0 | M | writer | 55105 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 6 | 242 | 4 | 883268170 | 42.0 | M | executive | 98101 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 63 | 242 | 3 | 875747190 | 31.0 | M | marketing | 75240 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 201 | 242 | 4 | 884110598 | 27.0 | M | writer | E2A4H | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 13 | 242 | 2 | 881515193 | 47.0 | M | educator | 29206 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 30 columns
Working with Missing Data
Pandas primarily uses the value np.nan to represent missing data (in table missed/empty value are marked by NaN). It is by default not included in computations. Missing data creates many issues at mathematical or computational tasks with DataFrames and Series and it’s important to know how fight with these values.
ages = movies['age']
sum(ages)
nan
This is because there are so many cases where Age isn’t given and hence takes on the value of np.nan. We can use fillna()
a very effecient pandas method for filling missing values
ages = movies['age'].fillna(0)
sum(ages)
3089983.0
This fills all the values with 0 and calculates the sum. To remain only rows with non-null values you can use method dropna()
ages = movies['age'].dropna()
sum(ages)
3089983.0
movies_nonnull = movies.dropna()
movies_nonnull.head(20)
#14th value was dropped because it had a missing value in a column
user_id | movie_id | rating | timestamp | age | gender | occupation | zip_code | movie_title | release_date | ... | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 | 49.0 | M | writer | 55105 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 305 | 242 | 5 | 886307828 | 23.0 | M | programmer | 94086 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 6 | 242 | 4 | 883268170 | 42.0 | M | executive | 98101 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 234 | 242 | 4 | 891033261 | 60.0 | M | retired | 94702 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 63 | 242 | 3 | 875747190 | 31.0 | M | marketing | 75240 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 181 | 242 | 1 | 878961814 | 26.0 | M | executive | 21218 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 201 | 242 | 4 | 884110598 | 27.0 | M | writer | E2A4H | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 249 | 242 | 5 | 879571438 | 25.0 | M | student | 84103 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 13 | 242 | 2 | 881515193 | 47.0 | M | educator | 29206 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9 | 279 | 242 | 3 | 877756647 | 33.0 | M | programmer | 85251 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | 145 | 242 | 5 | 875269755 | 31.0 | M | entertainment | V3N4P | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 90 | 242 | 4 | 891382267 | 60.0 | M | educator | 78155 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | 271 | 242 | 4 | 885844495 | 51.0 | M | engineer | 22932 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 | 18 | 242 | 5 | 880129305 | 35.0 | F | other | 37212 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | 207 | 242 | 4 | 890793823 | 39.0 | M | marketing | 92037 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 14 | 242 | 4 | 876964570 | 45.0 | M | scientist | 55106 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 113 | 242 | 2 | 875075887 | 47.0 | M | executive | 95032 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 296 | 242 | 4 | 884196057 | 43.0 | F | administrator | 16803 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | 154 | 242 | 3 | 879138235 | 25.0 | M | student | 53703 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
21 | 270 | 242 | 5 | 876953744 | 18.0 | F | student | 63119 | Kolya (1996) | 1997-01-24 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 rows × 30 columns
movies_notnull = movies.dropna(how='all',subset=['age','occupation'])
#Drops all nan values from movies belonging to age and occupation
movies_notnull.info()
#Notice how age and occupation now have nearly 6000 lesser values
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99616 entries, 0 to 99999
Data columns (total 30 columns):
user_id 99616 non-null int64
movie_id 99616 non-null int64
rating 99616 non-null int64
timestamp 99616 non-null int64
age 93731 non-null float64
gender 99616 non-null object
occupation 93806 non-null object
zip_code 99616 non-null object
movie_title 99616 non-null object
release_date 99607 non-null datetime64[ns]
IMDb_URL 99603 non-null object
unknown 99616 non-null int64
Action 99616 non-null int64
Adventure 99616 non-null int64
Animation 99616 non-null int64
Childrens 99616 non-null int64
Comedy 99616 non-null int64
Crime 99616 non-null int64
Documentary 99616 non-null int64
Drama 99616 non-null int64
Fantasy 99616 non-null int64
Film-Noir 99616 non-null int64
Horror 99616 non-null int64
Musical 99616 non-null int64
Mystery 99616 non-null int64
Romance 99616 non-null int64
Sci-Fi 99616 non-null int64
Thriller 99616 non-null int64
War 99616 non-null int64
Western 99616 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(23), object(5)
memory usage: 23.6+ MB
Thus, if how='all'
, we get DataFrame, where all values in both columns from subset are NaN.
If how='any'
, we get DataFrame, where at least one contains NaN.
movies.describe()
user_id | movie_id | rating | timestamp | age | unknown | Action | Adventure | Animation | Childrens | ... | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 100000.00000 | 100000.000000 | 100000.000000 | 1.000000e+05 | 93731.000000 | 100000.0000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | ... | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.00000 | 100000.00000 | 100000.000000 | 100000.000000 |
mean | 462.48475 | 425.530130 | 3.529860 | 8.835289e+08 | 32.966500 | 0.0001 | 0.255890 | 0.137530 | 0.036050 | 0.071820 | ... | 0.013520 | 0.017330 | 0.053170 | 0.049540 | 0.052450 | 0.194610 | 0.12730 | 0.21872 | 0.093980 | 0.018540 |
std | 266.61442 | 330.798356 | 1.125674 | 5.343856e+06 | 11.561809 | 0.0100 | 0.436362 | 0.344408 | 0.186416 | 0.258191 | ... | 0.115487 | 0.130498 | 0.224373 | 0.216994 | 0.222934 | 0.395902 | 0.33331 | 0.41338 | 0.291802 | 0.134894 |
min | 1.00000 | 1.000000 | 1.000000 | 8.747247e+08 | 7.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.000000 | 0.000000 |
25% | 254.00000 | 175.000000 | 3.000000 | 8.794487e+08 | 24.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.000000 | 0.000000 |
50% | 447.00000 | 322.000000 | 4.000000 | 8.828269e+08 | 30.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.000000 | 0.000000 |
75% | 682.00000 | 631.000000 | 4.000000 | 8.882600e+08 | 40.000000 | 0.0000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.000000 | 0.000000 |
max | 943.00000 | 1682.000000 | 5.000000 | 8.932866e+08 | 73.000000 | 1.0000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.00000 | 1.00000 | 1.000000 | 1.000000 |
8 rows × 24 columns
At first, let’s find all unique dates in ‘release_date’
column of movies
and then select only dates in range lower than 1995.
movies['release_date'] = movies['release_date'].map(pd.to_datetime)
# We map it to_datetime as pandas has a set way to deal with dates and then we can effectively work with dates.
unique_dates = movies['release_date'].drop_duplicates().dropna()
# Drops duplicates and nan values
unique_dates
0 1997-01-24
117 1993-01-01
309 1994-01-01
409 1997-07-11
455 1986-01-01
785 1997-01-01
881 1987-01-01
1137 1979-01-01
1253 1996-04-26
1525 1995-01-01
1557 1996-03-08
1850 1996-11-15
2331 1990-01-01
2851 1971-01-01
2972 1978-01-01
3432 1997-07-04
3735 1996-04-12
4269 1996-12-18
4347 1996-04-23
4583 1996-10-04
4745 1997-06-27
4751 1997-01-31
4798 1996-06-28
4961 1988-01-01
5208 1995-10-30
5392 1996-02-09
5863 1996-09-28
6861 1997-05-09
7058 1996-10-11
7186 1997-08-15
...
96679 1996-09-24
96855 1997-01-29
96948 1996-09-04
97195 1996-09-16
97434 1997-12-18
97639 1998-03-17
97816 1996-06-05
98068 1996-12-15
98546 1998-04-03
98574 1996-05-17
98590 1998-03-10
98739 1996-10-26
98748 1998-01-23
98786 1998-03-14
98856 1932-01-01
98969 1996-01-15
99205 1996-04-02
99280 1998-02-20
99321 1997-04-22
99598 1998-10-09
99650 1998-02-01
99702 1996-07-22
99737 1926-01-01
99813 1998-01-21
99885 1998-02-11
99938 1986-04-26
99940 1998-03-06
99958 1996-09-18
99967 1996-02-28
99977 1997-04-30
Name: release_date, Length: 240, dtype: datetime64[ns]
# find dates with year lower/equal than 1995
unique_dates_1 = filter(lambda x: x.year <= 1995, unique_dates)
# filter() takes two arguments. First one should return only boolean values and the second one is the variable over which ititerates over.
# This basically takes unique_dates and uses the lambda function (here, it returns bool values) and filters True cases.
unique_dates_1
<filter at 0x1187af6a0>
Here we have used drop_duplicates()
method to select only unique
Series values. Then we can filter movies
with respect to release_date
condition. Each datetime
Python object possesses with attributes year
, month
, day
, etc. allowing to extract values of year, month, day, etc. from the date. We call the new DataFrame as old_movies
.
old_movies = movies[movies['release_date'].isin(unique_dates_1)]
old_movies.head()
user_id | movie_id | rating | timestamp | age | gender | occupation | zip_code | movie_title | release_date | ... | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
117 | 196 | 393 | 4 | 881251863 | 49.0 | M | writer | 55105 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
118 | 22 | 393 | 4 | 878886989 | 25.0 | M | writer | 40206 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
119 | 244 | 393 | 3 | 880607365 | 28.0 | M | technician | 80525 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
120 | 298 | 393 | 4 | 884183099 | 44.0 | M | executive | 01581 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
121 | 286 | 393 | 4 | 877534481 | 27.0 | M | student | 15217 | Mrs. Doubtfire (1993) | 1993-01-01 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 30 columns
Now we may filter DataFrame old_movies
by age
and rating
. Lets’ drop timestamp
, zip_code
# get all users with age less than 25 that rated old movies higher than 3
old_movies_watch = old_movies[(old_movies['age']<25) & (old_movies['rating']>3)]
# Drop timestamp and zip_code
old_movies_watch = old_movies_watch.drop(['timestamp', 'zip_code'],axis=1)
old_movies_watch.head()
user_id | movie_id | rating | age | gender | occupation | movie_title | release_date | IMDb_URL | unknown | ... | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
124 | 303 | 393 | 4 | 19.0 | M | student | Mrs. Doubtfire (1993) | 1993-01-01 | http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
135 | 276 | 393 | 4 | 21.0 | M | student | Mrs. Doubtfire (1993) | 1993-01-01 | http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
153 | 128 | 393 | 4 | 24.0 | F | marketing | Mrs. Doubtfire (1993) | 1993-01-01 | http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
162 | 130 | 393 | 5 | 20.0 | M | none | Mrs. Doubtfire (1993) | 1993-01-01 | http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
183 | 314 | 393 | 4 | 20.0 | F | student | Mrs. Doubtfire (1993) | 1993-01-01 | http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 28 columns
Pandas
has support for accelerating certain types of binary numerical and boolean operations using the numexpr
library (it uses smart chunking, caching, and multiple cores) and the bottleneck
libraries (is a set of specialized cython routines that are especially fast when dealing with arrays that have NaNs). It allows one to increase pandas functionality a lot. This advantage is shown for some boolean and calculation operations. To count the time elapsed on operation performing we will use the decorator
# this function counts the time for a particular operation
def timer(func):
from datetime import datetime
def wrapper(*args):
start = datetime.now()
func(*args)
end = datetime.now()
return 'elapsed time = {' + str(end - start)+'}'
return wrapper
import random
n = 100
# generate rangon datasets
df_1 = pd.DataFrame({'col :'+str(i):[random.randint(-100,100) for j in range(n)]for i in range(n)})
# here we pass a dictionary to the DataFrame() constructor.
# The key is "col : i" where i can take random values and the value for those keys is i.
df_2 = pd.DataFrame({'col :'+str(i):[random.randint(-100,100) for j in range(n)] for i in range(n)})
@timer
def direct_comparison(df_1, df_2):
bool_df = pd.DataFrame({'col_{}'.format(i): [True for j in range(n)] for i in range(n)})
for i in range(len(df_1.index)):
for j in range(len(df_1.loc[i])):
if df_1.loc[i, df_1.columns[j]] >= df_2.loc[i, df_2.columns[j]]:
bool_df.loc[i,bool_df.columns[j]] = False
return bool_df
@timer
def pandas_comparison(df_1, df_2):
return df_1 < df_2
print ('direct_comparison:', (direct_comparison(df_1, df_2)))
print ('pandas_comparison:', (pandas_comparison(df_1, df_2)))
direct_comparison: elapsed time = {0:00:03.362719}
pandas_comparison: elapsed time = {0:00:00.029600}
As you can see, the difference in speed is too noticeable.
Besides, pandas possesses methods eq
(equal), ne
(not equal), lt
(less then), gt
(greater than), le
(less or equal) and ge
(greater or equal) for simplifying boolean comparison
Matrix Addition
df = pd.DataFrame({'A':[1,2,3],'B':[-2,-3,-4],"C":[7,8,9]})
dfa = pd.DataFrame({'A':[1,2,3],'D':[6,7,8],"C":[12,12,12]})
dfc = df + dfa
dfc
A | B | C | D | |
---|---|---|---|---|
0 | 2 | NaN | 19 | NaN |
1 | 4 | NaN | 20 | NaN |
2 | 6 | NaN | 21 | NaN |
df.le(dfa)
A | B | C | D | |
---|---|---|---|---|
0 | True | False | True | False |
1 | True | False | True | False |
2 | True | False | True | False |
You can also apply the reductions: empty
, any()
, all()
, and bool()
to provide a way to summarize a boolean result:
(df<0).all()
A False
B True
C False
dtype: bool
# here horyzontal direction for comparison is taking into account and we check all row’s items
(df < 0).all(axis=1)
0 False
1 False
2 False
dtype: bool
# here vertical direction for comparison is taking into
# account and we check if just one column’s item satisfies the condition
(df < 0).any()
A False
B True
C False
dtype: bool
# here we check if all DataFrame's items satisfy the condition
(df < 0).any().any()
True
# here we check if DataFrame no one element
df.empty
False
Descriptive Statistics
Function | Description |
---|---|
abs | absolute value |
count | number of non-null observations |
cumsum | cumulative sum (a sequence of partial sums of a given sequence) |
sum | sum of values |
mean | mean of values |
mad | mean absolute deviation |
median | arithmetic median of values |
min | minimum value |
max | maximum value |
mode | mode |
prod | product of values |
std | unbiased standard deviation |
var | unbiased variance |
print("Sum : ", movies['age'].sum())
Sum : 3089983.0
print(df)
A B C
0 1 -2 7
1 2 -3 8
2 3 -4 9
print("Mean : ")
print(df.mean())
print("\nMean of all Mean Values: ")
print(df.mean().mean())
print("\nMedian: ")
print(df.median())
print("\nStandard Deviation: ")
print(df.std())
print("\nVariance: ")
print(df.var())
print("\nMax: ")
print(df.max())
Mean :
A 2.0
B -3.0
C 8.0
dtype: float64
Mean of all Mean Values:
2.3333333333333335
Median:
A 2.0
B -3.0
C 8.0
dtype: float64
Standard Deviation:
A 1.0
B 1.0
C 1.0
dtype: float64
Variance:
A 1.0
B 1.0
C 1.0
dtype: float64
Max:
A 3
B -2
C 9
dtype: int64
Function Applications
When you need to make some transformations with some column’s or row’s elements, then method map
will be helpful (it works like pure Python function map()
). But there is also possibility to apply some function to each DataFrame element (not to a column or a row) – method apply(map)
aids in this case.
movies.loc[:, (movies.dtypes == np.int64) | (movies.dtypes == np.float64)].apply(np.mean)
# This calculates the mean of all the columns present in movies
user_id 4.624848e+02
movie_id 4.255301e+02
rating 3.529860e+00
timestamp 8.835289e+08
age 3.296650e+01
unknown 1.000000e-04
Action 2.558900e-01
Adventure 1.375300e-01
Animation 3.605000e-02
Childrens 7.182000e-02
Comedy 2.983200e-01
Crime 8.055000e-02
Documentary 7.580000e-03
Drama 3.989500e-01
Fantasy 1.352000e-02
Film-Noir 1.733000e-02
Horror 5.317000e-02
Musical 4.954000e-02
Mystery 5.245000e-02
Romance 1.946100e-01
Sci-Fi 1.273000e-01
Thriller 2.187200e-01
War 9.398000e-02
Western 1.854000e-02
dtype: float64
# to print mean of all row values in movies :
movies.loc[:,(movies.dtypes==np.int64) | (movies.dtypes==np.float64)].apply(np.mean, axis = 1)
0 3.671881e+07
1 3.692952e+07
2 3.680285e+07
3 3.712641e+07
4 3.648948e+07
5 3.662343e+07
6 3.683796e+07
7 3.664883e+07
8 3.672981e+07
9 3.657322e+07
10 3.646959e+07
11 3.714094e+07
12 3.691021e+07
13 3.667207e+07
14 3.868486e+07
15 3.711643e+07
16 3.654020e+07
17 3.646151e+07
18 3.825258e+07
19 3.684153e+07
20 3.663078e+07
21 3.653976e+07
22 3.690734e+07
23 3.700433e+07
24 3.804139e+07
25 3.704913e+07
26 3.715335e+07
27 3.680185e+07
28 3.682009e+07
29 3.682872e+07
...
99970 3.836551e+07
99971 3.702680e+07
99972 3.703066e+07
99973 3.705424e+07
99974 3.661341e+07
99975 3.714594e+07
99976 3.714594e+07
99977 3.714592e+07
99978 3.714594e+07
99979 3.714594e+07
99980 3.714592e+07
99981 3.648981e+07
99982 3.869825e+07
99983 3.720672e+07
99984 3.714594e+07
99985 3.714584e+07
99986 3.876098e+07
99987 3.704094e+07
99988 3.712668e+07
99989 3.696509e+07
99990 3.712652e+07
99991 3.713393e+07
99992 3.807540e+07
99993 3.684269e+07
99994 3.678404e+07
99995 3.705384e+07
99996 3.866487e+07
99997 3.705384e+07
99998 3.696514e+07
99999 3.670202e+07
Length: 100000, dtype: float64
Remember
The attribute axis define the horizontal (axis=1)
or vertical direction for calculations (axis=0)
Groupby with Dictionary
Find more about this at my GeeksForGeeks Publication.
Breaking up a String into columns using regex
Find more about this at my GeeksForGeeks Publication.
Ranking Rows in Pandas
Find more about this at my GeeksForGeeks Publication.
Find more at my Github repository Explained.
Show some by ing it.