Introduction to Pandas

Introduction to Pandas

- 61 mins

This is the first tutorial of the Explained! series.

I will be cataloging all the work I do with regards to PyLibraries and will share it here or on my Github.

I will also be updating this post as and when I work on Pandas.

That being said, Dive in!

Series

import pandas as pd
import numpy as np
import random
first_series = pd.Series([1,2,3, np.nan ,"hello"])
first_series
0        1
1        2
2        3
3      NaN
4    hello
dtype: object
series = pd.Series([1,2,3, np.nan ,"hello"], index = ['A','B','C','Unknown','String'])
series
#indexing the Series with custom values
A              1
B              2
C              3
Unknown      NaN
String     hello
dtype: object
dict = {"Python": "Fun", "C++": "Outdated","Coding":"Hmm.."}
series = pd.Series(dict)
series
# Dict to pandas Series
Python         Fun
C++       Outdated
Coding       Hmm..
dtype: object
series[['Coding','Python']]
Coding    Hmm..
Python      Fun
dtype: object
series.index
Index(['Python', 'C++', 'Coding'], dtype='object')
series.values
array(['Fun', 'Outdated', 'Hmm..'], dtype=object)
series.describe()
count            3
unique           3
top       Outdated
freq             1
dtype: object
#Series is a mutable data structures and you can easily change any item’s value:
series['Coding'] = 'Awesome'
series
Python         Fun
C++       Outdated
Coding     Awesome
dtype: object
# add new values:
series['Java'] = 'Okay'
series
Python         Fun
C++       Outdated
Coding     Awesome
Java          Okay
dtype: object
# If it is necessary to apply any mathematical operation to Series items, you may done it like below:
num_series = pd.Series([1,2,3,4,5,6,None])
num_series_changed = num_series/2
num_series_changed
0    0.5
1    1.0
2    1.5
3    2.0
4    2.5
5    3.0
6    NaN
dtype: float64
# NULL/NaN checking can be performed with isnull() and notnull().
print(series.isnull())
print(num_series.notnull())
print(num_series_changed.notnull())
Python    False
C++       False
Coding    False
Java      False
dtype: bool
0     True
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool
0     True
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool

DataFrames

data = {'year': [1990, 1994, 1998, 2002, 2006, 2010, 2014],
        'winner': ['Germany', 'Brazil', 'France', 'Brazil','Italy', 'Spain', 'Germany'],
        'runner-up': ['Argentina', 'Italy', 'Brazil','Germany', 'France', 'Netherlands', 'Argentina'],
        'final score': ['1-0', '0-0 (pen)', '3-0', '2-0', '1-1 (pen)', '1-0', '1-0'] }
world_cup = pd.DataFrame(data, columns=['year', 'winner', 'runner-up', 'final score'])
world_cup
year winner runner-up final score
0 1990 Germany Argentina 1-0
1 1994 Brazil Italy 0-0 (pen)
2 1998 France Brazil 3-0
3 2002 Brazil Germany 2-0
4 2006 Italy France 1-1 (pen)
5 2010 Spain Netherlands 1-0
6 2014 Germany Argentina 1-0
# Another way to set a DataFrame is the using of Python list of dictionaries:

data_2 = [{'year': 1990, 'winner': 'Germany', 'runner-up': 'Argentina', 'final score': '1-0'},
          {'year': 1994, 'winner': 'Brazil', 'runner-up': 'Italy', 'final score': '0-0 (pen)'},
          {'year': 1998, 'winner': 'France', 'runner-up': 'Brazil', 'final score': '3-0'},
          {'year': 2002, 'winner': 'Brazil', 'runner-up': 'Germany', 'final score': '2-0'},
          {'year': 2006, 'winner': 'Italy','runner-up': 'France', 'final score': '1-1 (pen)'},
          {'year': 2010, 'winner': 'Spain', 'runner-up': 'Netherlands', 'final score': '1-0'},
          {'year': 2014, 'winner': 'Germany', 'runner-up': 'Argentina', 'final score': '1-0'}
         ]
world_cup = pd.DataFrame(data_2)
world_cup
final score runner-up winner year
0 1-0 Argentina Germany 1990
1 0-0 (pen) Italy Brazil 1994
2 3-0 Brazil France 1998
3 2-0 Germany Brazil 2002
4 1-1 (pen) France Italy 2006
5 1-0 Netherlands Spain 2010
6 1-0 Argentina Germany 2014
print("First 2 Rows: ",end="\n\n")
print (world_cup.head(2),end="\n\n")
print ("Last 2 Rows : ",end="\n\n")
print (world_cup.tail(2),end="\n\n")
print("Using slicing : ",end="\n\n")
print (world_cup[2:4])
First 2 Rows:

  final score  runner-up   winner  year
0         1-0  Argentina  Germany  1990
1   0-0 (pen)      Italy   Brazil  1994

Last 2 Rows :

  final score    runner-up   winner  year
5         1-0  Netherlands    Spain  2010
6         1-0    Argentina  Germany  2014

Using slicing :

  final score runner-up  winner  year
2         3-0    Brazil  France  1998
3         2-0   Germany  Brazil  2002

CSV

Reading:

df = pd.read_csv("path\to\the\csv\file\for\reading")

Writing:

df.to_csv("path\to\the\folder\where\you\want\save\csv\file")

TXT file(s)

(txt file can be read as a CSV file with other separator (delimiter); we suppose below that columns are separated by tabulation):

Reading:

df = pd.read_csv("path\to\the\txt\file\for\reading", sep='\t')

Writing:

df.to_csv("path\to\the\folder\where\you\want\save\txt\file", sep='\t')

JSON files

(an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is the most common data format used for asynchronous browser/server communication. By its view it is very similar to Python dictionary)

Reading:

df = pd.read_json("path\to\the\json\file\for\reading", sep='\t')

Writing:

df.to_json("path\to\the\folder\where\you\want\save\json\file", sep='\t')

# To write world_cup Dataframe to a CSV File
world_cup.to_csv("worldcup.csv")
# To save CSV file without index use index=False attribute

print("File Written!",end="\n\n")

#To check if it was written
import os
print(os.path.exists('worldcup.csv'))

# reading from it in a new dataframe df
df = pd.read_csv('worldcup.csv')
print(df.head())


File Written!

True
   Unnamed: 0 final score  runner-up   winner  year
0           0         1-0  Argentina  Germany  1990
1           1   0-0 (pen)      Italy   Brazil  1994
2           2         3-0     Brazil   France  1998
3           3         2-0    Germany   Brazil  2002
4           4   1-1 (pen)     France    Italy  2006
# We can also load the data without index as :
df = pd.read_csv('worldcup.csv',index_col=0)
print(df)
  final score    runner-up   winner  year
0         1-0    Argentina  Germany  1990
1   0-0 (pen)        Italy   Brazil  1994
2         3-0       Brazil   France  1998
3         2-0      Germany   Brazil  2002
4   1-1 (pen)       France    Italy  2006
5         1-0  Netherlands    Spain  2010
6         1-0    Argentina  Germany  2014

You can find the dataset used in the following code here.

movies=pd.read_csv("data/movies.csv",encoding = "ISO-8859-1")
# encoding is added only for this specific dataset because it gave error with utf-8
movies['release_date'] = movies['release_date'].map(pd.to_datetime)
print(movies.head(20))

#print(movies.describe())
    user_id  movie_id  rating  timestamp   age gender     occupation zip_code  \
0       196       242       3  881250949  49.0      M         writer    55105   
1       305       242       5  886307828  23.0      M     programmer    94086   
2         6       242       4  883268170  42.0      M      executive    98101   
3       234       242       4  891033261  60.0      M        retired    94702   
4        63       242       3  875747190  31.0      M      marketing    75240   
5       181       242       1  878961814  26.0      M      executive    21218   
6       201       242       4  884110598  27.0      M         writer    E2A4H   
7       249       242       5  879571438  25.0      M        student    84103   
8        13       242       2  881515193  47.0      M       educator    29206   
9       279       242       3  877756647  33.0      M     programmer    85251   
10      145       242       5  875269755  31.0      M  entertainment    V3N4P   
11       90       242       4  891382267  60.0      M       educator    78155   
12      271       242       4  885844495  51.0      M       engineer    22932   
13       18       242       5  880129305  35.0      F          other    37212   
14        1       242       5  889751633   NaN      M            NaN    85711   
15      207       242       4  890793823  39.0      M      marketing    92037   
16       14       242       4  876964570  45.0      M      scientist    55106   
17      113       242       2  875075887  47.0      M      executive    95032   
18      123       242       5  879809053   NaN      F         artist    20008   
19      296       242       4  884196057  43.0      F  administrator    16803   

     movie_title release_date   ...    Fantasy  Film-Noir  Horror  Musical  \
0   Kolya (1996)   1997-01-24   ...          0          0       0        0   
1   Kolya (1996)   1997-01-24   ...          0          0       0        0   
2   Kolya (1996)   1997-01-24   ...          0          0       0        0   
3   Kolya (1996)   1997-01-24   ...          0          0       0        0   
4   Kolya (1996)   1997-01-24   ...          0          0       0        0   
5   Kolya (1996)   1997-01-24   ...          0          0       0        0   
6   Kolya (1996)   1997-01-24   ...          0          0       0        0   
7   Kolya (1996)   1997-01-24   ...          0          0       0        0   
8   Kolya (1996)   1997-01-24   ...          0          0       0        0   
9   Kolya (1996)   1997-01-24   ...          0          0       0        0   
10  Kolya (1996)   1997-01-24   ...          0          0       0        0   
11  Kolya (1996)   1997-01-24   ...          0          0       0        0   
12  Kolya (1996)   1997-01-24   ...          0          0       0        0   
13  Kolya (1996)   1997-01-24   ...          0          0       0        0   
14  Kolya (1996)   1997-01-24   ...          0          0       0        0   
15  Kolya (1996)   1997-01-24   ...          0          0       0        0   
16  Kolya (1996)   1997-01-24   ...          0          0       0        0   
17  Kolya (1996)   1997-01-24   ...          0          0       0        0   
18  Kolya (1996)   1997-01-24   ...          0          0       0        0   
19  Kolya (1996)   1997-01-24   ...          0          0       0        0   

    Mystery  Romance  Sci-Fi  Thriller  War  Western  
0         0        0       0         0    0        0  
1         0        0       0         0    0        0  
2         0        0       0         0    0        0  
3         0        0       0         0    0        0  
4         0        0       0         0    0        0  
5         0        0       0         0    0        0  
6         0        0       0         0    0        0  
7         0        0       0         0    0        0  
8         0        0       0         0    0        0  
9         0        0       0         0    0        0  
10        0        0       0         0    0        0  
11        0        0       0         0    0        0  
12        0        0       0         0    0        0  
13        0        0       0         0    0        0  
14        0        0       0         0    0        0  
15        0        0       0         0    0        0  
16        0        0       0         0    0        0  
17        0        0       0         0    0        0  
18        0        0       0         0    0        0  
19        0        0       0         0    0        0  

[20 rows x 30 columns]
movies_rating = movies['rating']
# Here we are showing only one column, i.e. a Series
print ('type:', type(movies_rating))
movies_rating.head()
type: <class 'pandas.core.series.Series'>





0    3
1    5
2    4
3    4
4    3
Name: rating, dtype: int64
# Filtering data
# Let's display only women
movies_user_female = movies[movies['gender']=='F']
print(movies_user_female.head())
    user_id  movie_id  rating  timestamp   age gender     occupation zip_code  \
13       18       242       5  880129305  35.0      F          other    37212   
18      123       242       5  879809053   NaN      F         artist    20008   
19      296       242       4  884196057  43.0      F  administrator    16803   
21      270       242       5  876953744  18.0      F        student    63119   
22      240       242       5  885775683  23.0      F       educator    20784   

     movie_title release_date   ...    Fantasy  Film-Noir  Horror  Musical  \
13  Kolya (1996)   1997-01-24   ...          0          0       0        0   
18  Kolya (1996)   1997-01-24   ...          0          0       0        0   
19  Kolya (1996)   1997-01-24   ...          0          0       0        0   
21  Kolya (1996)   1997-01-24   ...          0          0       0        0   
22  Kolya (1996)   1997-01-24   ...          0          0       0        0   

    Mystery  Romance  Sci-Fi  Thriller  War  Western  
13        0        0       0         0    0        0  
18        0        0       0         0    0        0  
19        0        0       0         0    0        0  
21        0        0       0         0    0        0  
22        0        0       0         0    0        0  

[5 rows x 30 columns]
#to see all the different values possible for a given column
occupation_list = movies['occupation']
print(occupation_list)
0               writer
1           programmer
2            executive
3              retired
4            marketing
5            executive
6               writer
7              student
8             educator
9           programmer
10       entertainment
11            educator
12            engineer
13               other
14                 NaN
15           marketing
16           scientist
17           executive
18              artist
19       administrator
20             student
21             student
22            educator
23                 NaN
24              writer
25                 NaN
26                 NaN
27           marketing
28       administrator
29             student
             ...      
99970         educator
99971            other
99972            other
99973            other
99974    administrator
99975           artist
99976           artist
99977           artist
99978           artist
99979           artist
99980           artist
99981    entertainment
99982          student
99983          student
99984           artist
99985           artist
99986           artist
99987          student
99988        librarian
99989           writer
99990              NaN
99991           artist
99992            other
99993            other
99994          student
99995          student
99996          student
99997          student
99998           writer
99999         engineer
Name: occupation, Length: 100000, dtype: object

Work with indexes and MultiIndex option

import random
indexes = [random.randrange(0,100) for i in range(5)]
data = [{i:random.randint(0,10) for i in 'ABCDE'} for i in range(5)]
df = pd.DataFrame(data, index=[1,2,3,4,5])
df
A B C D E
1 5 8 6 4 2
2 1 3 9 8 2
3 8 7 2 1 2
4 3 8 1 3 8
5 0 7 4 4 2
movies_user_gender_male = movies[movies['gender']=='M']
movies_user_gender_male_dup = movies_user_gender_male.drop_duplicates(keep=False)
print(movies_user_gender_male.head())
# From this we can clearly see age has missing value and that from 100,000 the data reduced to 74260,
# due to filtering and removing duplicates

   user_id  movie_id  rating  timestamp   age gender  occupation zip_code  \
0      196       242       3  881250949  49.0      M      writer    55105   
1      305       242       5  886307828  23.0      M  programmer    94086   
2        6       242       4  883268170  42.0      M   executive    98101   
3      234       242       4  891033261  60.0      M     retired    94702   
4       63       242       3  875747190  31.0      M   marketing    75240   

    movie_title release_date   ...    Fantasy  Film-Noir  Horror  Musical  \
0  Kolya (1996)   1997-01-24   ...          0          0       0        0   
1  Kolya (1996)   1997-01-24   ...          0          0       0        0   
2  Kolya (1996)   1997-01-24   ...          0          0       0        0   
3  Kolya (1996)   1997-01-24   ...          0          0       0        0   
4  Kolya (1996)   1997-01-24   ...          0          0       0        0   

   Mystery  Romance  Sci-Fi  Thriller  War  Western  
0        0        0       0         0    0        0  
1        0        0       0         0    0        0  
2        0        0       0         0    0        0  
3        0        0       0         0    0        0  
4        0        0       0         0    0        0  

[5 rows x 30 columns]
#gender = female and age between 30 and 40
gender_required = ['F']
filtered_df = movies[((movies['gender'] == 'F') & (movies['age'] > 30) & (movies['age'] <40))]
filtered_df
user_id movie_id rating timestamp age gender occupation zip_code movie_title release_date ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
13 18 242 5 880129305 35.0 F other 37212 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
27 129 242 4 883243972 36.0 F marketing 07039 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
39 34 242 5 888601628 38.0 F administrator 42141 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
42 209 242 4 883589606 33.0 F educator 85710 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
110 861 242 5 881274504 38.0 F NaN 14085 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
147 11 393 4 891905222 39.0 F other 30329 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
157 269 393 1 891451036 31.0 F librarian 43201 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
160 5 393 2 875636265 33.0 F other 15213 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
161 18 393 3 880130930 35.0 F NaN 37212 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
167 151 393 2 879528692 38.0 F administrator 48103 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
178 152 393 5 884018430 33.0 F educator 68767 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
187 330 393 4 876547004 35.0 F educator 33884 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
210 450 393 4 882812349 35.0 F educator 11758 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
213 457 393 3 882548583 33.0 F salesman 30011 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
241 577 393 4 880475363 36.0 F student 77845 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
246 593 393 4 886194041 31.0 F educator 68767 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
267 716 393 3 879796596 36.0 F administrator 44265 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
280 796 393 4 893218933 32.0 F writer 33755 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
325 264 381 4 886123596 36.0 F writer 90064 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
328 5 381 1 875636540 33.0 F other 15213 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
329 18 381 4 880131474 35.0 F other 37212 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
330 256 381 5 882165135 35.0 F none 39042 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
331 151 381 5 879528754 38.0 F administrator 48103 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
358 450 381 2 882374497 35.0 F educator 11758 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
384 716 381 4 879795644 36.0 F administrator 44265 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
387 786 381 3 882843397 36.0 F engineer 01754 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
388 796 381 3 893047208 32.0 F writer 33755 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
397 861 381 4 881274780 38.0 F student 14085 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
406 911 381 5 892839846 37.0 F writer 53210 Muriel's Wedding (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
423 79 251 5 891271545 39.0 F administrator 03755 Shall We Dance? (1996) 1997-07-11 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99139 5 374 3 875636905 33.0 F other 15213 Mighty Morphin Power Rangers: The Movie (1995) 1995-01-01 ... 0 0 0 0 0 0 0 0 0 0
99146 911 374 1 892841118 37.0 F writer 53210 Mighty Morphin Power Rangers: The Movie (1995) 1995-01-01 ... 0 0 0 0 0 0 0 0 0 0
99189 18 958 5 880129731 35.0 F other 37212 To Live (Huozhe) (1994) 1994-01-01 ... 0 0 0 0 0 0 0 0 0 0
99235 450 1249 3 882812821 35.0 F educator 11758 For Love or Money (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
99249 796 1055 3 893188577 32.0 F writer 33755 Simple Twist of Fate, A (1994) 1994-01-01 ... 0 0 0 0 0 0 0 0 0 0
99255 264 1474 2 886123728 36.0 F writer 90064 Nina Takes a Lover (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
99261 264 1475 2 886123596 36.0 F writer 90064 Bhaji on the Beach (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
99279 938 1254 1 891357019 38.0 F technician 55038 Gone Fishin' (1997) 1997-05-30 ... 0 0 0 0 0 0 0 0 0 0
99294 269 1479 2 891451111 31.0 F librarian 43201 Reckless (1995) 1995-01-01 ... 0 0 0 0 0 0 0 0 0 0
99298 450 1479 3 882377479 35.0 F educator 11758 Reckless (1995) 1995-01-01 ... 0 0 0 0 0 0 0 0 0 0
99313 5 267 4 875635064 33.0 F other 15213 unknown NaT ... 0 0 0 0 0 0 0 0 0 0
99324 18 113 5 880129628 35.0 F other 37212 Horseman on the Roof, The (Hussard sur le toit... 1996-04-19 ... 0 0 0 0 0 0 0 0 0 0
99333 18 973 3 880129595 35.0 F other 37212 Grateful Dead (1995) 1996-10-18 ... 0 0 0 0 0 0 0 0 0 0
99352 151 1297 1 879542847 38.0 F administrator 48103 Love Affair (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
99359 450 1297 4 882812635 35.0 F educator 11758 Love Affair (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
99361 796 1297 2 893047504 32.0 F writer 33755 Love Affair (1994) 1994-01-01 ... 0 0 0 0 0 1 0 0 0 0
99364 151 1299 4 879543423 38.0 F administrator 48103 Penny Serenade (1941) 1941-01-01 ... 0 0 0 0 0 1 0 0 0 0
99370 796 1299 2 892676043 32.0 F writer 33755 Penny Serenade (1941) 1941-01-01 ... 0 0 0 0 0 1 0 0 0 0
99452 688 1127 5 884153606 37.0 F administrator 60476 Truman Show, The (1998) 1998-01-01 ... 0 0 0 0 0 0 0 0 0 0
99477 34 1024 5 888602618 38.0 F administrator 42141 Mrs. Dalloway (1997) 1997-01-01 ... 0 0 0 0 0 1 0 0 0 0
99501 129 1176 4 883244059 36.0 F marketing 07039 Welcome To Sarajevo (1997) 1997-01-01 ... 0 0 0 0 0 0 0 0 1 0
99541 356 1294 4 891405721 32.0 F homemaker 92688 Ayn Rand: A Sense of Life (1997) 1998-02-13 ... 0 0 0 0 0 0 0 0 0 0
99561 796 1415 3 893219254 32.0 F writer 33755 Next Karate Kid, The (1994) 1994-01-01 ... 0 0 0 0 0 0 0 0 0 0
99630 450 1518 4 887138957 35.0 F educator 11758 Losing Isaiah (1995) 1995-01-01 ... 0 0 0 0 0 0 0 0 0 0
99642 577 1517 3 880475644 36.0 F student 77845 Race the Sun (1996) 1996-01-01 ... 0 0 0 0 0 0 0 0 0 0
99648 450 1521 3 882812350 35.0 F educator 11758 Mr. Wonderful (1993) 1993-01-01 ... 0 0 0 0 0 1 0 0 0 0
99660 796 1522 3 893194740 32.0 F writer 33755 Trial by Jury (1994) 1994-01-01 ... 0 0 0 0 0 0 0 1 0 0
99695 577 1531 4 880475408 36.0 F student 77845 Far From Home: The Adventures of Yellow Dog (1... 1995-01-01 ... 0 0 0 0 0 0 0 0 0 0
99852 450 1603 3 887139728 35.0 F educator 11758 Angela (1995) 1996-02-16 ... 0 0 0 0 0 0 0 0 0 0
99981 839 1664 1 875752902 38.0 F entertainment 90814 8 Heads in a Duffel Bag (1997) 1997-04-18 ... 0 0 0 0 0 0 0 0 0 0

5183 rows × 30 columns

Note

In the above fragment you HAVE TO ADD parantheses to each and every argument that is being compared else you will get an error.

As you can see after filtering result tables (i.e. DataFrames) have non-ordered indexes. To fix this trouble you may write the following:

filtered_df = filtered_df.reset_index()
filtered_df.head(10)
index user_id movie_id rating timestamp age gender occupation zip_code movie_title ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
0 13 18 242 5 880129305 35.0 F other 37212 Kolya (1996) ... 0 0 0 0 0 0 0 0 0 0
1 27 129 242 4 883243972 36.0 F marketing 07039 Kolya (1996) ... 0 0 0 0 0 0 0 0 0 0
2 39 34 242 5 888601628 38.0 F administrator 42141 Kolya (1996) ... 0 0 0 0 0 0 0 0 0 0
3 42 209 242 4 883589606 33.0 F educator 85710 Kolya (1996) ... 0 0 0 0 0 0 0 0 0 0
4 110 861 242 5 881274504 38.0 F NaN 14085 Kolya (1996) ... 0 0 0 0 0 0 0 0 0 0
5 147 11 393 4 891905222 39.0 F other 30329 Mrs. Doubtfire (1993) ... 0 0 0 0 0 0 0 0 0 0
6 157 269 393 1 891451036 31.0 F librarian 43201 Mrs. Doubtfire (1993) ... 0 0 0 0 0 0 0 0 0 0
7 160 5 393 2 875636265 33.0 F other 15213 Mrs. Doubtfire (1993) ... 0 0 0 0 0 0 0 0 0 0
8 161 18 393 3 880130930 35.0 F NaN 37212 Mrs. Doubtfire (1993) ... 0 0 0 0 0 0 0 0 0 0
9 167 151 393 2 879528692 38.0 F administrator 48103 Mrs. Doubtfire (1993) ... 0 0 0 0 0 0 0 0 0 0

10 rows × 31 columns

# set 'user_id' 'movie_id' as index
filtered_df_new = filtered_df.set_index(['user_id','movie_id'])
filtered_df_new.head(10)

# Note that set_index takes only a list as an argument to it.
# if you remove the [] then only the first argument is set as the index.
index rating timestamp age gender occupation zip_code movie_title release_date IMDb_URL ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
user_id movie_id
18 242 13 5 880129305 35.0 F other 37212 Kolya (1996) 1997-01-24 http://us.imdb.com/M/title-exact?Kolya%20(1996) ... 0 0 0 0 0 0 0 0 0 0
129 242 27 4 883243972 36.0 F marketing 07039 Kolya (1996) 1997-01-24 http://us.imdb.com/M/title-exact?Kolya%20(1996) ... 0 0 0 0 0 0 0 0 0 0
34 242 39 5 888601628 38.0 F administrator 42141 Kolya (1996) 1997-01-24 http://us.imdb.com/M/title-exact?Kolya%20(1996) ... 0 0 0 0 0 0 0 0 0 0
209 242 42 4 883589606 33.0 F educator 85710 Kolya (1996) 1997-01-24 http://us.imdb.com/M/title-exact?Kolya%20(1996) ... 0 0 0 0 0 0 0 0 0 0
861 242 110 5 881274504 38.0 F NaN 14085 Kolya (1996) 1997-01-24 http://us.imdb.com/M/title-exact?Kolya%20(1996) ... 0 0 0 0 0 0 0 0 0 0
11 393 147 4 891905222 39.0 F other 30329 Mrs. Doubtfire (1993) 1993-01-01 http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... ... 0 0 0 0 0 0 0 0 0 0
269 393 157 1 891451036 31.0 F librarian 43201 Mrs. Doubtfire (1993) 1993-01-01 http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... ... 0 0 0 0 0 0 0 0 0 0
5 393 160 2 875636265 33.0 F other 15213 Mrs. Doubtfire (1993) 1993-01-01 http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... ... 0 0 0 0 0 0 0 0 0 0
18 393 161 3 880130930 35.0 F NaN 37212 Mrs. Doubtfire (1993) 1993-01-01 http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... ... 0 0 0 0 0 0 0 0 0 0
151 393 167 2 879528692 38.0 F administrator 48103 Mrs. Doubtfire (1993) 1993-01-01 http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... ... 0 0 0 0 0 0 0 0 0 0

10 rows × 29 columns

# By default, `set_index()` returns a new DataFrame.
# so you’ll have to specify if you’d like the changes to occur in place.
# Here we used filtered_df_new to get the new dataframe and now see the type of filtererd_df_new

print(type(filtered_df_new.index))
<class 'pandas.core.indexes.multi.MultiIndex'>

Notice here that we now have a new sort of ‘index’ which is MultiIndex, which contains information about indexing of DataFrame and allows manipulating with this data.

filtered_df_new.index.names
# Gives you the names of the two index values we set as a FrozenList
FrozenList(['user_id', 'movie_id'])

Method get_level_values() allows to get all values for the corresponding index level. get_level_values(0) corresponds to ‘user_id’ and get_level_values(1) corresponds to ‘movie_id’

print(filtered_df_new.index.get_level_values(0))
print(filtered_df_new.index.get_level_values(1))
Int64Index([ 18, 129,  34, 209, 861,  11, 269,   5,  18, 151,
            ...
            129, 356, 796, 450, 577, 450, 796, 577, 450, 839],
           dtype='int64', name='user_id', length=5183)
Int64Index([ 242,  242,  242,  242,  242,  393,  393,  393,  393,  393,
            ...
            1176, 1294, 1415, 1518, 1517, 1521, 1522, 1531, 1603, 1664],
           dtype='int64', name='movie_id', length=5183)

Selection by label and position

Object selection in pandas is now supported by three types of multi-axis indexing.

The sequence of the following examples demonstrates how we can manipulate with DataFrame’s rows. At first let’s get the first row of movies:

movies.loc[0]
user_id                                                     196
movie_id                                                    242
rating                                                        3
timestamp                                             881250949
age                                                          49
gender                                                        M
occupation                                               writer
zip_code                                                  55105
movie_title                                        Kolya (1996)
release_date                                1997-01-24 00:00:00
IMDb_URL        http://us.imdb.com/M/title-exact?Kolya%20(1996)
unknown                                                       0
Action                                                        0
Adventure                                                     0
Animation                                                     0
Childrens                                                     0
Comedy                                                        1
Crime                                                         0
Documentary                                                   0
Drama                                                         0
Fantasy                                                       0
Film-Noir                                                     0
Horror                                                        0
Musical                                                       0
Mystery                                                       0
Romance                                                       0
Sci-Fi                                                        0
Thriller                                                      0
War                                                           0
Western                                                       0
Name: 0, dtype: object
movies.loc[1:3]
user_id movie_id rating timestamp age gender occupation zip_code movie_title release_date ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
1 305 242 5 886307828 23.0 M programmer 94086 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
2 6 242 4 883268170 42.0 M executive 98101 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
3 234 242 4 891033261 60.0 M retired 94702 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0

3 rows × 30 columns

If you want to return specific columns then you have to specify them as a separate argument of .loc

movies.loc[1:3 , 'movie_title']
1    Kolya (1996)
2    Kolya (1996)
3    Kolya (1996)
Name: movie_title, dtype: object
movies.loc[1:5 , ['movie_title','age','gender']]
# If more than one column is to be selected then you have to give the second argument of .loc as a list
movie_title age gender
1 Kolya (1996) 23.0 M
2 Kolya (1996) 42.0 M
3 Kolya (1996) 60.0 M
4 Kolya (1996) 31.0 M
5 Kolya (1996) 26.0 M
# movies.iloc[1:5 , ['movie_title','age','gender']]
# Gives error as iloc only uses integer values
movies.iloc[0]
user_id                                                     196
movie_id                                                    242
rating                                                        3
timestamp                                             881250949
age                                                          49
gender                                                        M
occupation                                               writer
zip_code                                                  55105
movie_title                                        Kolya (1996)
release_date                                1997-01-24 00:00:00
IMDb_URL        http://us.imdb.com/M/title-exact?Kolya%20(1996)
unknown                                                       0
Action                                                        0
Adventure                                                     0
Animation                                                     0
Childrens                                                     0
Comedy                                                        1
Crime                                                         0
Documentary                                                   0
Drama                                                         0
Fantasy                                                       0
Film-Noir                                                     0
Horror                                                        0
Musical                                                       0
Mystery                                                       0
Romance                                                       0
Sci-Fi                                                        0
Thriller                                                      0
War                                                           0
Western                                                       0
Name: 0, dtype: object
movies.iloc[1:5]
user_id movie_id rating timestamp age gender occupation zip_code movie_title release_date ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
1 305 242 5 886307828 23.0 M programmer 94086 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
2 6 242 4 883268170 42.0 M executive 98101 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
3 234 242 4 891033261 60.0 M retired 94702 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
4 63 242 3 875747190 31.0 M marketing 75240 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0

4 rows × 30 columns

# movies.select(lambda x: x%2==0).head() is the same as :
movies.loc[movies.index.map(lambda x: x%2==0)].head()

# .select() has been deprecated for now and will be completely removed in future updates so use .loc
user_id movie_id rating timestamp age gender occupation zip_code movie_title release_date ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
0 196 242 3 881250949 49.0 M writer 55105 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
2 6 242 4 883268170 42.0 M executive 98101 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
4 63 242 3 875747190 31.0 M marketing 75240 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
6 201 242 4 884110598 27.0 M writer E2A4H Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
8 13 242 2 881515193 47.0 M educator 29206 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 30 columns


Working with Missing Data

Pandas primarily uses the value np.nan to represent missing data (in table missed/empty value are marked by NaN). It is by default not included in computations. Missing data creates many issues at mathematical or computational tasks with DataFrames and Series and it’s important to know how fight with these values.

ages = movies['age']
sum(ages)
nan

This is because there are so many cases where Age isn’t given and hence takes on the value of np.nan. We can use fillna()a very effecient pandas method for filling missing values

ages = movies['age'].fillna(0)
sum(ages)
3089983.0

This fills all the values with 0 and calculates the sum. To remain only rows with non-null values you can use method dropna()

ages = movies['age'].dropna()
sum(ages)
3089983.0
movies_nonnull = movies.dropna()
movies_nonnull.head(20)
#14th value was dropped because it had a missing value in a column
user_id movie_id rating timestamp age gender occupation zip_code movie_title release_date ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
0 196 242 3 881250949 49.0 M writer 55105 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
1 305 242 5 886307828 23.0 M programmer 94086 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
2 6 242 4 883268170 42.0 M executive 98101 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
3 234 242 4 891033261 60.0 M retired 94702 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
4 63 242 3 875747190 31.0 M marketing 75240 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
5 181 242 1 878961814 26.0 M executive 21218 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
6 201 242 4 884110598 27.0 M writer E2A4H Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
7 249 242 5 879571438 25.0 M student 84103 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
8 13 242 2 881515193 47.0 M educator 29206 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
9 279 242 3 877756647 33.0 M programmer 85251 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
10 145 242 5 875269755 31.0 M entertainment V3N4P Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
11 90 242 4 891382267 60.0 M educator 78155 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
12 271 242 4 885844495 51.0 M engineer 22932 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
13 18 242 5 880129305 35.0 F other 37212 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
15 207 242 4 890793823 39.0 M marketing 92037 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
16 14 242 4 876964570 45.0 M scientist 55106 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
17 113 242 2 875075887 47.0 M executive 95032 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
19 296 242 4 884196057 43.0 F administrator 16803 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
20 154 242 3 879138235 25.0 M student 53703 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0
21 270 242 5 876953744 18.0 F student 63119 Kolya (1996) 1997-01-24 ... 0 0 0 0 0 0 0 0 0 0

20 rows × 30 columns

movies_notnull = movies.dropna(how='all',subset=['age','occupation'])
#Drops all nan values from movies belonging to age and occupation
movies_notnull.info()
#Notice how age and occupation now have nearly 6000 lesser values
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99616 entries, 0 to 99999
Data columns (total 30 columns):
user_id         99616 non-null int64
movie_id        99616 non-null int64
rating          99616 non-null int64
timestamp       99616 non-null int64
age             93731 non-null float64
gender          99616 non-null object
occupation      93806 non-null object
zip_code        99616 non-null object
movie_title     99616 non-null object
release_date    99607 non-null datetime64[ns]
IMDb_URL        99603 non-null object
unknown         99616 non-null int64
Action          99616 non-null int64
Adventure       99616 non-null int64
Animation       99616 non-null int64
Childrens       99616 non-null int64
Comedy          99616 non-null int64
Crime           99616 non-null int64
Documentary     99616 non-null int64
Drama           99616 non-null int64
Fantasy         99616 non-null int64
Film-Noir       99616 non-null int64
Horror          99616 non-null int64
Musical         99616 non-null int64
Mystery         99616 non-null int64
Romance         99616 non-null int64
Sci-Fi          99616 non-null int64
Thriller        99616 non-null int64
War             99616 non-null int64
Western         99616 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(23), object(5)
memory usage: 23.6+ MB

Thus, if how='all', we get DataFrame, where all values in both columns from subset are NaN.

If how='any', we get DataFrame, where at least one contains NaN.

movies.describe()
user_id movie_id rating timestamp age unknown Action Adventure Animation Childrens ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
count 100000.00000 100000.000000 100000.000000 1.000000e+05 93731.000000 100000.0000 100000.000000 100000.000000 100000.000000 100000.000000 ... 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.00000 100000.00000 100000.000000 100000.000000
mean 462.48475 425.530130 3.529860 8.835289e+08 32.966500 0.0001 0.255890 0.137530 0.036050 0.071820 ... 0.013520 0.017330 0.053170 0.049540 0.052450 0.194610 0.12730 0.21872 0.093980 0.018540
std 266.61442 330.798356 1.125674 5.343856e+06 11.561809 0.0100 0.436362 0.344408 0.186416 0.258191 ... 0.115487 0.130498 0.224373 0.216994 0.222934 0.395902 0.33331 0.41338 0.291802 0.134894
min 1.00000 1.000000 1.000000 8.747247e+08 7.000000 0.0000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.00000 0.000000 0.000000
25% 254.00000 175.000000 3.000000 8.794487e+08 24.000000 0.0000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.00000 0.000000 0.000000
50% 447.00000 322.000000 4.000000 8.828269e+08 30.000000 0.0000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.00000 0.000000 0.000000
75% 682.00000 631.000000 4.000000 8.882600e+08 40.000000 0.0000 1.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.00000 0.000000 0.000000
max 943.00000 1682.000000 5.000000 8.932866e+08 73.000000 1.0000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.00000 1.000000 1.000000

8 rows × 24 columns

At first, let’s find all unique dates in ‘release_date’ column of movies and then select only dates in range lower than 1995.

movies['release_date'] = movies['release_date'].map(pd.to_datetime)
# We map it to_datetime as pandas has a set way to deal with dates and then we can effectively work with dates.
unique_dates = movies['release_date'].drop_duplicates().dropna()
# Drops duplicates and nan values
unique_dates
0       1997-01-24
117     1993-01-01
309     1994-01-01
409     1997-07-11
455     1986-01-01
785     1997-01-01
881     1987-01-01
1137    1979-01-01
1253    1996-04-26
1525    1995-01-01
1557    1996-03-08
1850    1996-11-15
2331    1990-01-01
2851    1971-01-01
2972    1978-01-01
3432    1997-07-04
3735    1996-04-12
4269    1996-12-18
4347    1996-04-23
4583    1996-10-04
4745    1997-06-27
4751    1997-01-31
4798    1996-06-28
4961    1988-01-01
5208    1995-10-30
5392    1996-02-09
5863    1996-09-28
6861    1997-05-09
7058    1996-10-11
7186    1997-08-15
           ...    
96679   1996-09-24
96855   1997-01-29
96948   1996-09-04
97195   1996-09-16
97434   1997-12-18
97639   1998-03-17
97816   1996-06-05
98068   1996-12-15
98546   1998-04-03
98574   1996-05-17
98590   1998-03-10
98739   1996-10-26
98748   1998-01-23
98786   1998-03-14
98856   1932-01-01
98969   1996-01-15
99205   1996-04-02
99280   1998-02-20
99321   1997-04-22
99598   1998-10-09
99650   1998-02-01
99702   1996-07-22
99737   1926-01-01
99813   1998-01-21
99885   1998-02-11
99938   1986-04-26
99940   1998-03-06
99958   1996-09-18
99967   1996-02-28
99977   1997-04-30
Name: release_date, Length: 240, dtype: datetime64[ns]
# find dates with year lower/equal than 1995
unique_dates_1 = filter(lambda x: x.year <= 1995, unique_dates)
# filter() takes two arguments. First one should return only boolean values and the second one is the variable over which ititerates over.
# This basically takes unique_dates and uses the lambda function (here, it returns bool values) and filters True cases.

unique_dates_1
<filter at 0x1187af6a0>

Here we have used drop_duplicates() method to select only unique Series values. Then we can filter movies with respect to release_date condition. Each datetime Python object possesses with attributes year, month, day, etc. allowing to extract values of year, month, day, etc. from the date. We call the new DataFrame as old_movies.

old_movies = movies[movies['release_date'].isin(unique_dates_1)]
old_movies.head()
user_id movie_id rating timestamp age gender occupation zip_code movie_title release_date ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
117 196 393 4 881251863 49.0 M writer 55105 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
118 22 393 4 878886989 25.0 M writer 40206 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
119 244 393 3 880607365 28.0 M technician 80525 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
120 298 393 4 884183099 44.0 M executive 01581 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0
121 286 393 4 877534481 27.0 M student 15217 Mrs. Doubtfire (1993) 1993-01-01 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 30 columns

Now we may filter DataFrame old_movies by age and rating. Lets’ drop timestamp, zip_code

# get all users with age less than 25 that rated old movies higher than 3
old_movies_watch = old_movies[(old_movies['age']<25) & (old_movies['rating']>3)]
# Drop timestamp and zip_code
old_movies_watch = old_movies_watch.drop(['timestamp', 'zip_code'],axis=1)

old_movies_watch.head()
user_id movie_id rating age gender occupation movie_title release_date IMDb_URL unknown ... Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
124 303 393 4 19.0 M student Mrs. Doubtfire (1993) 1993-01-01 http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... 0 ... 0 0 0 0 0 0 0 0 0 0
135 276 393 4 21.0 M student Mrs. Doubtfire (1993) 1993-01-01 http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... 0 ... 0 0 0 0 0 0 0 0 0 0
153 128 393 4 24.0 F marketing Mrs. Doubtfire (1993) 1993-01-01 http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... 0 ... 0 0 0 0 0 0 0 0 0 0
162 130 393 5 20.0 M none Mrs. Doubtfire (1993) 1993-01-01 http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... 0 ... 0 0 0 0 0 0 0 0 0 0
183 314 393 4 20.0 F student Mrs. Doubtfire (1993) 1993-01-01 http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 28 columns


Pandas has support for accelerating certain types of binary numerical and boolean operations using the numexpr library (it uses smart chunking, caching, and multiple cores) and the bottleneck libraries (is a set of specialized cython routines that are especially fast when dealing with arrays that have NaNs). It allows one to increase pandas functionality a lot. This advantage is shown for some boolean and calculation operations. To count the time elapsed on operation performing we will use the decorator

# this function counts the time for a particular operation

def timer(func):
    from datetime import datetime
    def wrapper(*args):
        start = datetime.now()
        func(*args)
        end = datetime.now()
        return 'elapsed time = {' + str(end - start)+'}'
    return wrapper

import random
n = 100
# generate rangon datasets
df_1 = pd.DataFrame({'col :'+str(i):[random.randint(-100,100) for j in range(n)]for i in range(n)})
# here we pass a dictionary to the DataFrame() constructor.
# The key is "col : i" where i can take random values and the value for those keys is i.

df_2 = pd.DataFrame({'col :'+str(i):[random.randint(-100,100) for j in range(n)] for i in range(n)})

@timer
def direct_comparison(df_1, df_2):
    bool_df = pd.DataFrame({'col_{}'.format(i): [True for j in range(n)] for i in range(n)})
    for i in range(len(df_1.index)):
        for j in range(len(df_1.loc[i])):
            if df_1.loc[i, df_1.columns[j]] >= df_2.loc[i, df_2.columns[j]]:
                bool_df.loc[i,bool_df.columns[j]] = False
    return bool_df

@timer
def pandas_comparison(df_1, df_2):
    return df_1 < df_2

print ('direct_comparison:', (direct_comparison(df_1, df_2)))
print ('pandas_comparison:', (pandas_comparison(df_1, df_2)))
direct_comparison: elapsed time = {0:00:03.362719}
pandas_comparison: elapsed time = {0:00:00.029600}

As you can see, the difference in speed is too noticeable.

Besides, pandas possesses methods eq (equal), ne (not equal), lt (less then), gt (greater than), le (less or equal) and ge (greater or equal) for simplifying boolean comparison


Matrix Addition

df = pd.DataFrame({'A':[1,2,3],'B':[-2,-3,-4],"C":[7,8,9]})
dfa = pd.DataFrame({'A':[1,2,3],'D':[6,7,8],"C":[12,12,12]})
dfc = df + dfa
dfc
A B C D
0 2 NaN 19 NaN
1 4 NaN 20 NaN
2 6 NaN 21 NaN
df.le(dfa)
A B C D
0 True False True False
1 True False True False
2 True False True False

You can also apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result:

(df<0).all()
A    False
B     True
C    False
dtype: bool
# here horyzontal direction for comparison is taking into account and we check all row’s items
(df < 0).all(axis=1)
0    False
1    False
2    False
dtype: bool
# here vertical direction for comparison is taking into
# account and we check if just one column’s item satisfies the condition
(df < 0).any()
A    False
B     True
C    False
dtype: bool
# here we check if all DataFrame's items satisfy the condition
(df < 0).any().any()
True
# here we check if DataFrame no one element
df.empty
False

Descriptive Statistics

Function Description
abs absolute value
count number of non-null observations
cumsum cumulative sum (a sequence of partial sums of a given sequence)
sum sum of values
mean mean of values
mad mean absolute deviation
median arithmetic median of values
min minimum value
max maximum value
mode mode
prod product of values
std unbiased standard deviation
var unbiased variance
print("Sum : ", movies['age'].sum())
Sum :  3089983.0
print(df)
   A  B  C
0  1 -2  7
1  2 -3  8
2  3 -4  9
print("Mean : ")
print(df.mean())

print("\nMean of all Mean Values: ")
print(df.mean().mean())

print("\nMedian: ")
print(df.median())

print("\nStandard Deviation: ")
print(df.std())

print("\nVariance: ")
print(df.var())

print("\nMax: ")
print(df.max())
Mean :
A    2.0
B   -3.0
C    8.0
dtype: float64

Mean of all Mean Values:
2.3333333333333335

Median:
A    2.0
B   -3.0
C    8.0
dtype: float64

Standard Deviation:
A    1.0
B    1.0
C    1.0
dtype: float64

Variance:
A    1.0
B    1.0
C    1.0
dtype: float64

Max:
A    3
B   -2
C    9
dtype: int64

Function Applications

When you need to make some transformations with some column’s or row’s elements, then method map will be helpful (it works like pure Python function map() ). But there is also possibility to apply some function to each DataFrame element (not to a column or a row) – method apply(map) aids in this case.

movies.loc[:, (movies.dtypes == np.int64) | (movies.dtypes == np.float64)].apply(np.mean)
# This calculates the mean of all the columns present in movies
user_id        4.624848e+02
movie_id       4.255301e+02
rating         3.529860e+00
timestamp      8.835289e+08
age            3.296650e+01
unknown        1.000000e-04
Action         2.558900e-01
Adventure      1.375300e-01
Animation      3.605000e-02
Childrens      7.182000e-02
Comedy         2.983200e-01
Crime          8.055000e-02
Documentary    7.580000e-03
Drama          3.989500e-01
Fantasy        1.352000e-02
Film-Noir      1.733000e-02
Horror         5.317000e-02
Musical        4.954000e-02
Mystery        5.245000e-02
Romance        1.946100e-01
Sci-Fi         1.273000e-01
Thriller       2.187200e-01
War            9.398000e-02
Western        1.854000e-02
dtype: float64
# to print mean of all row values in movies :
movies.loc[:,(movies.dtypes==np.int64) | (movies.dtypes==np.float64)].apply(np.mean, axis = 1)
0        3.671881e+07
1        3.692952e+07
2        3.680285e+07
3        3.712641e+07
4        3.648948e+07
5        3.662343e+07
6        3.683796e+07
7        3.664883e+07
8        3.672981e+07
9        3.657322e+07
10       3.646959e+07
11       3.714094e+07
12       3.691021e+07
13       3.667207e+07
14       3.868486e+07
15       3.711643e+07
16       3.654020e+07
17       3.646151e+07
18       3.825258e+07
19       3.684153e+07
20       3.663078e+07
21       3.653976e+07
22       3.690734e+07
23       3.700433e+07
24       3.804139e+07
25       3.704913e+07
26       3.715335e+07
27       3.680185e+07
28       3.682009e+07
29       3.682872e+07
             ...     
99970    3.836551e+07
99971    3.702680e+07
99972    3.703066e+07
99973    3.705424e+07
99974    3.661341e+07
99975    3.714594e+07
99976    3.714594e+07
99977    3.714592e+07
99978    3.714594e+07
99979    3.714594e+07
99980    3.714592e+07
99981    3.648981e+07
99982    3.869825e+07
99983    3.720672e+07
99984    3.714594e+07
99985    3.714584e+07
99986    3.876098e+07
99987    3.704094e+07
99988    3.712668e+07
99989    3.696509e+07
99990    3.712652e+07
99991    3.713393e+07
99992    3.807540e+07
99993    3.684269e+07
99994    3.678404e+07
99995    3.705384e+07
99996    3.866487e+07
99997    3.705384e+07
99998    3.696514e+07
99999    3.670202e+07
Length: 100000, dtype: float64

Remember

The attribute axis define the horizontal (axis=1) or vertical direction for calculations (axis=0)


Groupby with Dictionary

Find more about this at my GeeksForGeeks Publication.


Breaking up a String into columns using regex

Find more about this at my GeeksForGeeks Publication.


Ranking Rows in Pandas

Find more about this at my GeeksForGeeks Publication.


Follow @RohitMidha23

Find more at my Github repository Explained.

Show some :heart: by :star:ing it.

Star

rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora