Introduction to Pandas

Friday. January 11, 2019 - 61 mins

This is the first tutorial of the Explained! series.

I will be cataloging all the work I do with regards to PyLibraries and will share it here or on my Github.

I will also be updating this post as and when I work on Pandas.

That being said, Dive in!

Series

import pandas as pd
import numpy as np
import random

first_series = pd.Series([1,2,3, np.nan ,"hello"])
first_series

      1
      2
      3
    NaN
  hello
dtype: object

series = pd.Series([1,2,3, np.nan ,"hello"], index = ['A','B','C','Unknown','String'])
series
#indexing the Series with custom values

A              1
B              2
C              3
Unknown      NaN
String     hello
dtype: object

dict = {"Python": "Fun", "C++": "Outdated","Coding":"Hmm.."}
series = pd.Series(dict)
series
# Dict to pandas Series

Python         Fun
C++       Outdated
Coding       Hmm..
dtype: object

series[['Coding','Python']]

Coding    Hmm..
Python      Fun
dtype: object

series.index

Index(['Python', 'C++', 'Coding'], dtype='object')

series.values

array(['Fun', 'Outdated', 'Hmm..'], dtype=object)

series.describe()

count            3
unique           3
top       Outdated
freq             1
dtype: object

#Series is a mutable data structures and you can easily change any item’s value:
series['Coding'] = 'Awesome'
series

Python         Fun
C++       Outdated
Coding     Awesome
dtype: object

# add new values:
series['Java'] = 'Okay'
series

Python         Fun
C++       Outdated
Coding     Awesome
Java          Okay
dtype: object

# If it is necessary to apply any mathematical operation to Series items, you may done it like below:
num_series = pd.Series([1,2,3,4,5,6,None])
num_series_changed = num_series/2
num_series_changed

  0.5
  1.0
  1.5
  2.0
  2.5
  3.0
  NaN
dtype: float64

# NULL/NaN checking can be performed with isnull() and notnull().
print(series.isnull())
print(num_series.notnull())
print(num_series_changed.notnull())

Python    False
C++       False
Coding    False
Java      False
dtype: bool
0     True
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool
0     True
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool

DataFrames

data = {'year': [1990, 1994, 1998, 2002, 2006, 2010, 2014],
        'winner': ['Germany', 'Brazil', 'France', 'Brazil','Italy', 'Spain', 'Germany'],
        'runner-up': ['Argentina', 'Italy', 'Brazil','Germany', 'France', 'Netherlands', 'Argentina'],
        'final score': ['1-0', '0-0 (pen)', '3-0', '2-0', '1-1 (pen)', '1-0', '1-0'] }
world_cup = pd.DataFrame(data, columns=['year', 'winner', 'runner-up', 'final score'])
world_cup

	year	winner	runner-up	final score
0	1990	Germany	Argentina	1-0
1	1994	Brazil	Italy	0-0 (pen)
2	1998	France	Brazil	3-0
3	2002	Brazil	Germany	2-0
4	2006	Italy	France	1-1 (pen)
5	2010	Spain	Netherlands	1-0
6	2014	Germany	Argentina	1-0

# Another way to set a DataFrame is the using of Python list of dictionaries:

data_2 = [{'year': 1990, 'winner': 'Germany', 'runner-up': 'Argentina', 'final score': '1-0'},
          {'year': 1994, 'winner': 'Brazil', 'runner-up': 'Italy', 'final score': '0-0 (pen)'},
          {'year': 1998, 'winner': 'France', 'runner-up': 'Brazil', 'final score': '3-0'},
          {'year': 2002, 'winner': 'Brazil', 'runner-up': 'Germany', 'final score': '2-0'},
          {'year': 2006, 'winner': 'Italy','runner-up': 'France', 'final score': '1-1 (pen)'},
          {'year': 2010, 'winner': 'Spain', 'runner-up': 'Netherlands', 'final score': '1-0'},
          {'year': 2014, 'winner': 'Germany', 'runner-up': 'Argentina', 'final score': '1-0'}
         ]
world_cup = pd.DataFrame(data_2)
world_cup

	final score	runner-up	winner	year
0	1-0	Argentina	Germany	1990
1	0-0 (pen)	Italy	Brazil	1994
2	3-0	Brazil	France	1998
3	2-0	Germany	Brazil	2002
4	1-1 (pen)	France	Italy	2006
5	1-0	Netherlands	Spain	2010
6	1-0	Argentina	Germany	2014

print("First 2 Rows: ",end="\n\n")
print (world_cup.head(2),end="\n\n")
print ("Last 2 Rows : ",end="\n\n")
print (world_cup.tail(2),end="\n\n")
print("Using slicing : ",end="\n\n")
print (world_cup[2:4])

First 2 Rows:

  final score  runner-up   winner  year
0         1-0  Argentina  Germany  1990
1   0-0 (pen)      Italy   Brazil  1994

Last 2 Rows :

  final score    runner-up   winner  year
5         1-0  Netherlands    Spain  2010
6         1-0    Argentina  Germany  2014

Using slicing :

  final score runner-up  winner  year
2         3-0    Brazil  France  1998
3         2-0   Germany  Brazil  2002

CSV

Reading:

df = pd.read_csv("path\to\the\csv\file\for\reading")

Writing:

df.to_csv("path\to\the\folder\where\you\want\save\csv\file")

TXT file(s)

(txt file can be read as a CSV file with other separator (delimiter); we suppose below that columns are separated by tabulation):

Reading:

df = pd.read_csv("path\to\the\txt\file\for\reading", sep='\t')

Writing:

df.to_csv("path\to\the\folder\where\you\want\save\txt\file", sep='\t')

JSON files

(an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is the most common data format used for asynchronous browser/server communication. By its view it is very similar to Python dictionary)

Reading:

df = pd.read_json("path\to\the\json\file\for\reading", sep='\t')

Writing:

df.to_json("path\to\the\folder\where\you\want\save\json\file", sep='\t')

# To write world_cup Dataframe to a CSV File
world_cup.to_csv("worldcup.csv")
# To save CSV file without index use index=False attribute

print("File Written!",end="\n\n")

#To check if it was written
import os
print(os.path.exists('worldcup.csv'))

# reading from it in a new dataframe df
df = pd.read_csv('worldcup.csv')
print(df.head())

File Written!

True
   Unnamed: 0 final score  runner-up   winner  year
0           0         1-0  Argentina  Germany  1990
1           1   0-0 (pen)      Italy   Brazil  1994
2           2         3-0     Brazil   France  1998
3           3         2-0    Germany   Brazil  2002
4           4   1-1 (pen)     France    Italy  2006

# We can also load the data without index as :
df = pd.read_csv('worldcup.csv',index_col=0)
print(df)

  final score    runner-up   winner  year
       1-0    Argentina  Germany  1990
 0-0 (pen)        Italy   Brazil  1994
       3-0       Brazil   France  1998
       2-0      Germany   Brazil  2002
 1-1 (pen)       France    Italy  2006
       1-0  Netherlands    Spain  2010
       1-0    Argentina  Germany  2014

You can find the dataset used in the following code here.

movies=pd.read_csv("data/movies.csv",encoding = "ISO-8859-1")
# encoding is added only for this specific dataset because it gave error with utf-8

movies['release_date'] = movies['release_date'].map(pd.to_datetime)
print(movies.head(20))

#print(movies.describe())

    user_id  movie_id  rating  timestamp   age gender     occupation zip_code  \
     196       242       3  881250949  49.0      M         writer    55105   
     305       242       5  886307828  23.0      M     programmer    94086   
       6       242       4  883268170  42.0      M      executive    98101   
     234       242       4  891033261  60.0      M        retired    94702   
      63       242       3  875747190  31.0      M      marketing    75240   
     181       242       1  878961814  26.0      M      executive    21218   
     201       242       4  884110598  27.0      M         writer    E2A4H   
     249       242       5  879571438  25.0      M        student    84103   
      13       242       2  881515193  47.0      M       educator    29206   
     279       242       3  877756647  33.0      M     programmer    85251   
    145       242       5  875269755  31.0      M  entertainment    V3N4P   
     90       242       4  891382267  60.0      M       educator    78155   
    271       242       4  885844495  51.0      M       engineer    22932   
     18       242       5  880129305  35.0      F          other    37212   
      1       242       5  889751633   NaN      M            NaN    85711   
    207       242       4  890793823  39.0      M      marketing    92037   
     14       242       4  876964570  45.0      M      scientist    55106   
    113       242       2  875075887  47.0      M      executive    95032   
    123       242       5  879809053   NaN      F         artist    20008   
    296       242       4  884196057  43.0      F  administrator    16803   

     movie_title release_date   ...    Fantasy  Film-Noir  Horror  Musical  \
 Kolya (1996)   1997-01-24   ...          0          0       0        0   
 Kolya (1996)   1997-01-24   ...          0          0       0        0   
 Kolya (1996)   1997-01-24   ...          0          0       0        0   
 Kolya (1996)   1997-01-24   ...          0          0       0        0   
 Kolya (1996)   1997-01-24   ...          0          0       0        0   
 Kolya (1996)   1997-01-24   ...          0          0       0        0   
 Kolya (1996)   1997-01-24   ...          0          0       0        0   
 Kolya (1996)   1997-01-24   ...          0          0       0        0   
 Kolya (1996)   1997-01-24   ...          0          0       0        0   
 Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   

    Mystery  Romance  Sci-Fi  Thriller  War  Western  
       0        0       0         0    0        0  
       0        0       0         0    0        0  
       0        0       0         0    0        0  
       0        0       0         0    0        0  
       0        0       0         0    0        0  
       0        0       0         0    0        0  
       0        0       0         0    0        0  
       0        0       0         0    0        0  
       0        0       0         0    0        0  
       0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  

[20 rows x 30 columns]

movies_rating = movies['rating']
# Here we are showing only one column, i.e. a Series
print ('type:', type(movies_rating))
movies_rating.head()

type: <class 'pandas.core.series.Series'>

  3
  5
  4
  4
  3
Name: rating, dtype: int64

# Filtering data
# Let's display only women
movies_user_female = movies[movies['gender']=='F']
print(movies_user_female.head())

    user_id  movie_id  rating  timestamp   age gender     occupation zip_code  \
     18       242       5  880129305  35.0      F          other    37212   
    123       242       5  879809053   NaN      F         artist    20008   
    296       242       4  884196057  43.0      F  administrator    16803   
    270       242       5  876953744  18.0      F        student    63119   
    240       242       5  885775683  23.0      F       educator    20784   

     movie_title release_date   ...    Fantasy  Film-Noir  Horror  Musical  \
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   

    Mystery  Romance  Sci-Fi  Thriller  War  Western  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  

[5 rows x 30 columns]

#to see all the different values possible for a given column
occupation_list = movies['occupation']
print(occupation_list)

             writer
         programmer
          executive
            retired
          marketing
          executive
             writer
            student
           educator
         programmer
     entertainment
          educator
          engineer
             other
               NaN
         marketing
         scientist
         executive
            artist
     administrator
           student
           student
          educator
               NaN
            writer
               NaN
               NaN
         marketing
     administrator
           student
             ...      
       educator
          other
          other
          other
  administrator
         artist
         artist
         artist
         artist
         artist
         artist
  entertainment
        student
        student
         artist
         artist
         artist
        student
      librarian
         writer
            NaN
         artist
          other
          other
        student
        student
        student
        student
         writer
       engineer
Name: occupation, Length: 100000, dtype: object

Work with indexes and MultiIndex option

import random
indexes = [random.randrange(0,100) for i in range(5)]
data = [{i:random.randint(0,10) for i in 'ABCDE'} for i in range(5)]
df = pd.DataFrame(data, index=[1,2,3,4,5])
df

	A	B	C	D	E
1	5	8	6	4	2
2	1	3	9	8	2
3	8	7	2	1	2
4	3	8	1	3	8
5	0	7	4	4	2

movies_user_gender_male = movies[movies['gender']=='M']
movies_user_gender_male_dup = movies_user_gender_male.drop_duplicates(keep=False)
print(movies_user_gender_male.head())
# From this we can clearly see age has missing value and that from 100,000 the data reduced to 74260,
# due to filtering and removing duplicates

   user_id  movie_id  rating  timestamp   age gender  occupation zip_code  \
    196       242       3  881250949  49.0      M      writer    55105   
    305       242       5  886307828  23.0      M  programmer    94086   
      6       242       4  883268170  42.0      M   executive    98101   
    234       242       4  891033261  60.0      M     retired    94702   
     63       242       3  875747190  31.0      M   marketing    75240   

    movie_title release_date   ...    Fantasy  Film-Noir  Horror  Musical  \
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   
Kolya (1996)   1997-01-24   ...          0          0       0        0   

   Mystery  Romance  Sci-Fi  Thriller  War  Western  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  
      0        0       0         0    0        0  

[5 rows x 30 columns]

#gender = female and age between 30 and 40
gender_required = ['F']
filtered_df = movies[((movies['gender'] == 'F') & (movies['age'] > 30) & (movies['age'] <40))]
filtered_df

	user_id	movie_id	rating	timestamp	age	gender	occupation	zip_code	movie_title	release_date	...	Fantasy	Film-Noir	Horror	Musical	Mystery	Romance	Sci-Fi	Thriller	War	Western
13	18	242	5	880129305	35.0	F	other	37212	Kolya (1996)	1997-01-24	...	0	0	0	0	0	0	0	0	0	0
27	129	242	4	883243972	36.0	F	marketing	07039	Kolya (1996)	1997-01-24	...	0	0	0	0	0	0	0	0	0	0
39	34	242	5	888601628	38.0	F	administrator	42141	Kolya (1996)	1997-01-24	...	0	0	0	0	0	0	0	0	0	0
42	209	242	4	883589606	33.0	F	educator	85710	Kolya (1996)	1997-01-24	...	0	0	0	0	0	0	0	0	0	0
110	861	242	5	881274504	38.0	F	NaN	14085	Kolya (1996)	1997-01-24	...	0	0	0	0	0	0	0	0	0	0
147	11	393	4	891905222	39.0	F	other	30329	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
157	269	393	1	891451036	31.0	F	librarian	43201	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
160	5	393	2	875636265	33.0	F	other	15213	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
161	18	393	3	880130930	35.0	F	NaN	37212	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
167	151	393	2	879528692	38.0	F	administrator	48103	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
178	152	393	5	884018430	33.0	F	educator	68767	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
187	330	393	4	876547004	35.0	F	educator	33884	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
210	450	393	4	882812349	35.0	F	educator	11758	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
213	457	393	3	882548583	33.0	F	salesman	30011	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
241	577	393	4	880475363	36.0	F	student	77845	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
246	593	393	4	886194041	31.0	F	educator	68767	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
267	716	393	3	879796596	36.0	F	administrator	44265	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
280	796	393	4	893218933	32.0	F	writer	33755	Mrs. Doubtfire (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
325	264	381	4	886123596	36.0	F	writer	90064	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
328	5	381	1	875636540	33.0	F	other	15213	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
329	18	381	4	880131474	35.0	F	other	37212	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
330	256	381	5	882165135	35.0	F	none	39042	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
331	151	381	5	879528754	38.0	F	administrator	48103	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
358	450	381	2	882374497	35.0	F	educator	11758	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
384	716	381	4	879795644	36.0	F	administrator	44265	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
387	786	381	3	882843397	36.0	F	engineer	01754	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
388	796	381	3	893047208	32.0	F	writer	33755	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
397	861	381	4	881274780	38.0	F	student	14085	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
406	911	381	5	892839846	37.0	F	writer	53210	Muriel's Wedding (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
423	79	251	5	891271545	39.0	F	administrator	03755	Shall We Dance? (1996)	1997-07-11	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
99139	5	374	3	875636905	33.0	F	other	15213	Mighty Morphin Power Rangers: The Movie (1995)	1995-01-01	...	0	0	0	0	0	0	0	0	0	0
99146	911	374	1	892841118	37.0	F	writer	53210	Mighty Morphin Power Rangers: The Movie (1995)	1995-01-01	...	0	0	0	0	0	0	0	0	0	0
99189	18	958	5	880129731	35.0	F	other	37212	To Live (Huozhe) (1994)	1994-01-01	...	0	0	0	0	0	0	0	0	0	0
99235	450	1249	3	882812821	35.0	F	educator	11758	For Love or Money (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
99249	796	1055	3	893188577	32.0	F	writer	33755	Simple Twist of Fate, A (1994)	1994-01-01	...	0	0	0	0	0	0	0	0	0	0
99255	264	1474	2	886123728	36.0	F	writer	90064	Nina Takes a Lover (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
99261	264	1475	2	886123596	36.0	F	writer	90064	Bhaji on the Beach (1993)	1993-01-01	...	0	0	0	0	0	0	0	0	0	0
99279	938	1254	1	891357019	38.0	F	technician	55038	Gone Fishin' (1997)	1997-05-30	...	0	0	0	0	0	0	0	0	0	0
99294	269	1479	2	891451111	31.0	F	librarian	43201	Reckless (1995)	1995-01-01	...	0	0	0	0	0	0	0	0	0	0
99298	450	1479	3	882377479	35.0	F	educator	11758	Reckless (1995)	1995-01-01	...	0	0	0	0	0	0	0	0	0	0
99313	5	267	4	875635064	33.0	F	other	15213	unknown	NaT	...	0	0	0	0	0	0	0	0	0	0
99324	18	113	5	880129628	35.0	F	other	37212	Horseman on the Roof, The (Hussard sur le toit...	1996-04-19	...	0	0	0	0	0	0	0	0	0	0
99333	18	973	3	880129595	35.0	F	other	37212	Grateful Dead (1995)	1996-10-18	...	0	0	0	0	0	0	0	0	0	0
99352	151	1297	1	879542847	38.0	F	administrator	48103	Love Affair (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
99359	450	1297	4	882812635	35.0	F	educator	11758	Love Affair (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
99361	796	1297	2	893047504	32.0	F	writer	33755	Love Affair (1994)	1994-01-01	...	0	0	0	0	0	1	0	0	0	0
99364	151	1299	4	879543423	38.0	F	administrator	48103	Penny Serenade (1941)	1941-01-01	...	0	0	0	0	0	1	0	0	0	0
99370	796	1299	2	892676043	32.0	F	writer	33755	Penny Serenade (1941)	1941-01-01	...	0	0	0	0	0	1	0	0	0	0
99452	688	1127	5	884153606	37.0	F	administrator	60476	Truman Show, The (1998)	1998-01-01	...	0	0	0	0	0	0	0	0	0	0
99477	34	1024	5	888602618	38.0	F	administrator	42141	Mrs. Dalloway (1997)	1997-01-01	...	0	0	0	0	0	1	0	0	0	0
99501	129	1176	4	883244059	36.0	F	marketing	07039	Welcome To Sarajevo (1997)	1997-01-01	...	0	0	0	0	0	0	0	0	1	0
99541	356	1294	4	891405721	32.0	F	homemaker	92688	Ayn Rand: A Sense of Life (1997)	1998-02-13	...	0	0	0	0	0	0	0	0	0	0
99561	796	1415	3	893219254	32.0	F	writer	33755	Next Karate Kid, The (1994)	1994-01-01	...	0	0	0	0	0	0	0	0	0	0
99630	450	1518	4	887138957	35.0	F	educator	11758	Losing Isaiah (1995)	1995-01-01	...	0	0	0	0	0	0	0	0	0	0
99642	577	1517	3	880475644	36.0	F	student	77845	Race the Sun (1996)	1996-01-01	...	0	0	0	0	0	0	0	0	0	0
99648	450	1521	3	882812350	35.0	F	educator	11758	Mr. Wonderful (1993)	1993-01-01	...	0	0	0	0	0	1	0	0	0	0
99660	796	1522	3	893194740	32.0	F	writer	33755	Trial by Jury (1994)	1994-01-01	...	0	0	0	0	0	0	0	1	0	0
99695	577	1531	4	880475408	36.0	F	student	77845	Far From Home: The Adventures of Yellow Dog (1...	1995-01-01	...	0	0	0	0	0	0	0	0	0	0
99852	450	1603	3	887139728	35.0	F	educator	11758	Angela (1995)	1996-02-16	...	0	0	0	0	0	0	0	0	0	0
99981	839	1664	1	875752902	38.0	F	entertainment	90814	8 Heads in a Duffel Bag (1997)	1997-04-18	...	0	0	0	0	0	0	0	0	0	0

5183 rows × 30 columns

Note

In the above fragment you HAVE TO ADD parantheses to each and every argument that is being compared else you will get an error.

As you can see after filtering result tables (i.e. DataFrames) have non-ordered indexes. To fix this trouble you may write the following:

filtered_df = filtered_df.reset_index()
filtered_df.head(10)

	index	user_id	movie_id	rating	timestamp	age	gender	occupation	zip_code	movie_title	...
0	13	18	242	5	880129305	35.0	F	other	37212	Kolya (1996)	...
1	27	129	242	4	883243972	36.0	F	marketing	07039	Kolya (1996)	...
2	39	34	242	5	888601628	38.0	F	administrator	42141	Kolya (1996)	...
3	42	209	242	4	883589606	33.0	F	educator	85710	Kolya (1996)	...
4	110	861	242	5	881274504	38.0	F	NaN	14085	Kolya (1996)	...
5	147	11	393	4	891905222	39.0	F	other	30329	Mrs. Doubtfire (1993)	...
6	157	269	393	1	891451036	31.0	F	librarian	43201	Mrs. Doubtfire (1993)	...
7	160	5	393	2	875636265	33.0	F	other	15213	Mrs. Doubtfire (1993)	...
8	161	18	393	3	880130930	35.0	F	NaN	37212	Mrs. Doubtfire (1993)	...
9	167	151	393	2	879528692	38.0	F	administrator	48103	Mrs. Doubtfire (1993)	...

10 rows × 31 columns

# set 'user_id' 'movie_id' as index
filtered_df_new = filtered_df.set_index(['user_id','movie_id'])
filtered_df_new.head(10)

# Note that set_index takes only a list as an argument to it.
# if you remove the [] then only the first argument is set as the index.

		index	rating	timestamp	age	gender	occupation	zip_code	movie_title	release_date	IMDb_URL	...	Fantasy	Film-Noir	Horror	Musical	Mystery	Romance	Sci-Fi	Thriller	War	Western
user_id	movie_id
18	242	13	5	880129305	35.0	F	other	37212	Kolya (1996)	1997-01-24	http://us.imdb.com/M/title-exact?Kolya%20(1996)	...	0	0	0	0	0	0	0	0	0	0
129	242	27	4	883243972	36.0	F	marketing	07039	Kolya (1996)	1997-01-24	http://us.imdb.com/M/title-exact?Kolya%20(1996)	...	0	0	0	0	0	0	0	0	0	0
34	242	39	5	888601628	38.0	F	administrator	42141	Kolya (1996)	1997-01-24	http://us.imdb.com/M/title-exact?Kolya%20(1996)	...	0	0	0	0	0	0	0	0	0	0
209	242	42	4	883589606	33.0	F	educator	85710	Kolya (1996)	1997-01-24	http://us.imdb.com/M/title-exact?Kolya%20(1996)	...	0	0	0	0	0	0	0	0	0	0
861	242	110	5	881274504	38.0	F	NaN	14085	Kolya (1996)	1997-01-24	http://us.imdb.com/M/title-exact?Kolya%20(1996)	...	0	0	0	0	0	0	0	0	0	0
11	393	147	4	891905222	39.0	F	other	30329	Mrs. Doubtfire (1993)	1993-01-01	http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...	...	0	0	0	0	0	0	0	0	0	0
269	393	157	1	891451036	31.0	F	librarian	43201	Mrs. Doubtfire (1993)	1993-01-01	http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...	...	0	0	0	0	0	0	0	0	0	0
5	393	160	2	875636265	33.0	F	other	15213	Mrs. Doubtfire (1993)	1993-01-01	http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...	...	0	0	0	0	0	0	0	0	0	0
18	393	161	3	880130930	35.0	F	NaN	37212	Mrs. Doubtfire (1993)	1993-01-01	http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...	...	0	0	0	0	0	0	0	0	0	0
151	393	167	2	879528692	38.0	F	administrator	48103	Mrs. Doubtfire (1993)	1993-01-01	http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...	...	0	0	0	0	0	0	0	0	0	0

10 rows × 29 columns

# By default, `set_index()` returns a new DataFrame.
# so you’ll have to specify if you’d like the changes to occur in place.
# Here we used filtered_df_new to get the new dataframe and now see the type of filtererd_df_new

print(type(filtered_df_new.index))

<class 'pandas.core.indexes.multi.MultiIndex'>

Notice here that we now have a new sort of ‘index’ which is MultiIndex, which contains information about indexing of DataFrame and allows manipulating with this data.

filtered_df_new.index.names
# Gives you the names of the two index values we set as a FrozenList

FrozenList(['user_id', 'movie_id'])

Method get_level_values() allows to get all values for the corresponding index level. get_level_values(0) corresponds to ‘user_id’ and get_level_values(1) corresponds to ‘movie_id’

print(filtered_df_new.index.get_level_values(0))
print(filtered_df_new.index.get_level_values(1))

Int64Index([ 18, 129,  34, 209, 861,  11, 269,   5,  18, 151,
            ...
            129, 356, 796, 450, 577, 450, 796, 577, 450, 839],
           dtype='int64', name='user_id', length=5183)
Int64Index([ 242,  242,  242,  242,  242,  393,  393,  393,  393,  393,
            ...
            1176, 1294, 1415, 1518, 1517, 1521, 1522, 1531, 1603, 1664],
           dtype='int64', name='movie_id', length=5183)

Selection by label and position

Object selection in pandas is now supported by three types of multi-axis indexing.

.loc works on labels in the index;
.iloc works on the positions in the index (so it only takes integers);

The sequence of the following examples demonstrates how we can manipulate with DataFrame’s rows. At first let’s get the first row of movies:

movies.loc[0]

user_id                                                     196
movie_id                                                    242
rating                                                        3
timestamp                                             881250949
age                                                          49
gender                                                        M
occupation                                               writer
zip_code                                                  55105
movie_title                                        Kolya (1996)
release_date                                1997-01-24 00:00:00
IMDb_URL        http://us.imdb.com/M/title-exact?Kolya%20(1996)
unknown                                                       0
Action                                                        0
Adventure                                                     0
Animation                                                     0
Childrens                                                     0
Comedy                                                        1
Crime                                                         0
Documentary                                                   0
Drama                                                         0
Fantasy                                                       0
Film-Noir                                                     0
Horror                                                        0
Musical                                                       0
Mystery                                                       0
Romance                                                       0
Sci-Fi                                                        0
Thriller                                                      0
War                                                           0
Western                                                       0
Name: 0, dtype: object

movies.loc[1:3]

	user_id	movie_id	rating	timestamp	age	gender	occupation	zip_code	movie_title	release_date	...
1	305	242	5	886307828	23.0	M	programmer	94086	Kolya (1996)	1997-01-24	...
2	6	242	4	883268170	42.0	M	executive	98101	Kolya (1996)	1997-01-24	...
3	234	242	4	891033261	60.0	M	retired	94702	Kolya (1996)	1997-01-24	...

3 rows × 30 columns

If you want to return specific columns then you have to specify them as a separate argument of .loc

movies.loc[1:3 , 'movie_title']

  Kolya (1996)
  Kolya (1996)
  Kolya (1996)
Name: movie_title, dtype: object

movies.loc[1:5 , ['movie_title','age','gender']]
# If more than one column is to be selected then you have to give the second argument of .loc as a list

	movie_title	age	gender
1	Kolya (1996)	23.0	M
2	Kolya (1996)	42.0	M
3	Kolya (1996)	60.0	M
4	Kolya (1996)	31.0	M
5	Kolya (1996)	26.0	M

# movies.iloc[1:5 , ['movie_title','age','gender']]
# Gives error as iloc only uses integer values

movies.iloc[0]

user_id                                                     196
movie_id                                                    242
rating                                                        3
timestamp                                             881250949
age                                                          49
gender                                                        M
occupation                                               writer
zip_code                                                  55105
movie_title                                        Kolya (1996)
release_date                                1997-01-24 00:00:00
IMDb_URL        http://us.imdb.com/M/title-exact?Kolya%20(1996)
unknown                                                       0
Action                                                        0
Adventure                                                     0
Animation                                                     0
Childrens                                                     0
Comedy                                                        1
Crime                                                         0
Documentary                                                   0
Drama                                                         0
Fantasy                                                       0
Film-Noir                                                     0
Horror                                                        0
Musical                                                       0
Mystery                                                       0
Romance                                                       0
Sci-Fi                                                        0
Thriller                                                      0
War                                                           0
Western                                                       0
Name: 0, dtype: object

movies.iloc[1:5]

	user_id	movie_id	rating	timestamp	age	gender	occupation	zip_code	movie_title	release_date	...
1	305	242	5	886307828	23.0	M	programmer	94086	Kolya (1996)	1997-01-24	...
2	6	242	4	883268170	42.0	M	executive	98101	Kolya (1996)	1997-01-24	...
3	234	242	4	891033261	60.0	M	retired	94702	Kolya (1996)	1997-01-24	...
4	63	242	3	875747190	31.0	M	marketing	75240	Kolya (1996)	1997-01-24	...

4 rows × 30 columns

# movies.select(lambda x: x%2==0).head() is the same as :
movies.loc[movies.index.map(lambda x: x%2==0)].head()

# .select() has been deprecated for now and will be completely removed in future updates so use .loc

	user_id	movie_id	rating	timestamp	age	gender	occupation	zip_code	movie_title	release_date	...
0	196	242	3	881250949	49.0	M	writer	55105	Kolya (1996)	1997-01-24	...
2	6	242	4	883268170	42.0	M	executive	98101	Kolya (1996)	1997-01-24	...
4	63	242	3	875747190	31.0	M	marketing	75240	Kolya (1996)	1997-01-24	...
6	201	242	4	884110598	27.0	M	writer	E2A4H	Kolya (1996)	1997-01-24	...
8	13	242	2	881515193	47.0	M	educator	29206	Kolya (1996)	1997-01-24	...

5 rows × 30 columns

Working with Missing Data

Pandas primarily uses the value np.nan to represent missing data (in table missed/empty value are marked by NaN). It is by default not included in computations. Missing data creates many issues at mathematical or computational tasks with DataFrames and Series and it’s important to know how fight with these values.

ages = movies['age']
sum(ages)

nan

This is because there are so many cases where Age isn’t given and hence takes on the value of np.nan. We can use fillna()a very effecient pandas method for filling missing values

ages = movies['age'].fillna(0)
sum(ages)

3089983.0

This fills all the values with 0 and calculates the sum. To remain only rows with non-null values you can use method dropna()

ages = movies['age'].dropna()
sum(ages)

3089983.0

movies_nonnull = movies.dropna()
movies_nonnull.head(20)
#14th value was dropped because it had a missing value in a column

	user_id	movie_id	rating	timestamp	age	gender	occupation	zip_code	movie_title	release_date	...
0	196	242	3	881250949	49.0	M	writer	55105	Kolya (1996)	1997-01-24	...
1	305	242	5	886307828	23.0	M	programmer	94086	Kolya (1996)	1997-01-24	...
2	6	242	4	883268170	42.0	M	executive	98101	Kolya (1996)	1997-01-24	...
3	234	242	4	891033261	60.0	M	retired	94702	Kolya (1996)	1997-01-24	...
4	63	242	3	875747190	31.0	M	marketing	75240	Kolya (1996)	1997-01-24	...
5	181	242	1	878961814	26.0	M	executive	21218	Kolya (1996)	1997-01-24	...
6	201	242	4	884110598	27.0	M	writer	E2A4H	Kolya (1996)	1997-01-24	...
7	249	242	5	879571438	25.0	M	student	84103	Kolya (1996)	1997-01-24	...
8	13	242	2	881515193	47.0	M	educator	29206	Kolya (1996)	1997-01-24	...
9	279	242	3	877756647	33.0	M	programmer	85251	Kolya (1996)	1997-01-24	...
10	145	242	5	875269755	31.0	M	entertainment	V3N4P	Kolya (1996)	1997-01-24	...
11	90	242	4	891382267	60.0	M	educator	78155	Kolya (1996)	1997-01-24	...
12	271	242	4	885844495	51.0	M	engineer	22932	Kolya (1996)	1997-01-24	...
13	18	242	5	880129305	35.0	F	other	37212	Kolya (1996)	1997-01-24	...
15	207	242	4	890793823	39.0	M	marketing	92037	Kolya (1996)	1997-01-24	...
16	14	242	4	876964570	45.0	M	scientist	55106	Kolya (1996)	1997-01-24	...
17	113	242	2	875075887	47.0	M	executive	95032	Kolya (1996)	1997-01-24	...
19	296	242	4	884196057	43.0	F	administrator	16803	Kolya (1996)	1997-01-24	...
20	154	242	3	879138235	25.0	M	student	53703	Kolya (1996)	1997-01-24	...
21	270	242	5	876953744	18.0	F	student	63119	Kolya (1996)	1997-01-24	...

20 rows × 30 columns

movies_notnull = movies.dropna(how='all',subset=['age','occupation'])
#Drops all nan values from movies belonging to age and occupation
movies_notnull.info()
#Notice how age and occupation now have nearly 6000 lesser values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99616 entries, 0 to 99999
Data columns (total 30 columns):
user_id         99616 non-null int64
movie_id        99616 non-null int64
rating          99616 non-null int64
timestamp       99616 non-null int64
age             93731 non-null float64
gender          99616 non-null object
occupation      93806 non-null object
zip_code        99616 non-null object
movie_title     99616 non-null object
release_date    99607 non-null datetime64[ns]
IMDb_URL        99603 non-null object
unknown         99616 non-null int64
Action          99616 non-null int64
Adventure       99616 non-null int64
Animation       99616 non-null int64
Childrens       99616 non-null int64
Comedy          99616 non-null int64
Crime           99616 non-null int64
Documentary     99616 non-null int64
Drama           99616 non-null int64
Fantasy         99616 non-null int64
Film-Noir       99616 non-null int64
Horror          99616 non-null int64
Musical         99616 non-null int64
Mystery         99616 non-null int64
Romance         99616 non-null int64
Sci-Fi          99616 non-null int64
Thriller        99616 non-null int64
War             99616 non-null int64
Western         99616 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(23), object(5)
memory usage: 23.6+ MB

Thus, if how='all', we get DataFrame, where all values in both columns from subset are NaN.

If how='any', we get DataFrame, where at least one contains NaN.

movies.describe()

	user_id	movie_id	rating	timestamp	age	unknown	Action	Adventure	Animation	Childrens	...	Fantasy	Film-Noir	Horror	Musical	Mystery	Romance	Sci-Fi	Thriller	War	Western
count	100000.00000	100000.000000	100000.000000	1.000000e+05	93731.000000	100000.0000	100000.000000	100000.000000	100000.000000	100000.000000	...	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.00000	100000.00000	100000.000000	100000.000000
mean	462.48475	425.530130	3.529860	8.835289e+08	32.966500	0.0001	0.255890	0.137530	0.036050	0.071820	...	0.013520	0.017330	0.053170	0.049540	0.052450	0.194610	0.12730	0.21872	0.093980	0.018540
std	266.61442	330.798356	1.125674	5.343856e+06	11.561809	0.0100	0.436362	0.344408	0.186416	0.258191	...	0.115487	0.130498	0.224373	0.216994	0.222934	0.395902	0.33331	0.41338	0.291802	0.134894
min	1.00000	1.000000	1.000000	8.747247e+08	7.000000	0.0000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.00000	0.000000	0.000000
25%	254.00000	175.000000	3.000000	8.794487e+08	24.000000	0.0000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.00000	0.000000	0.000000
50%	447.00000	322.000000	4.000000	8.828269e+08	30.000000	0.0000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.00000	0.000000	0.000000
75%	682.00000	631.000000	4.000000	8.882600e+08	40.000000	0.0000	1.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.00000	0.000000	0.000000
max	943.00000	1682.000000	5.000000	8.932866e+08	73.000000	1.0000	1.000000	1.000000	1.000000	1.000000	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.00000	1.00000	1.000000	1.000000

8 rows × 24 columns

At first, let’s find all unique dates in ‘release_date’ column of movies and then select only dates in range lower than 1995.

movies['release_date'] = movies['release_date'].map(pd.to_datetime)
# We map it to_datetime as pandas has a set way to deal with dates and then we can effectively work with dates.
unique_dates = movies['release_date'].drop_duplicates().dropna()
# Drops duplicates and nan values
unique_dates

     1997-01-24
   1993-01-01
   1994-01-01
   1997-07-11
   1986-01-01
   1997-01-01
   1987-01-01
  1979-01-01
  1996-04-26
  1995-01-01
  1996-03-08
  1996-11-15
  1990-01-01
  1971-01-01
  1978-01-01
  1997-07-04
  1996-04-12
  1996-12-18
  1996-04-23
  1996-10-04
  1997-06-27
  1997-01-31
  1996-06-28
  1988-01-01
  1995-10-30
  1996-02-09
  1996-09-28
  1997-05-09
  1996-10-11
  1997-08-15
           ...    
 1996-09-24
 1997-01-29
 1996-09-04
 1996-09-16
 1997-12-18
 1998-03-17
 1996-06-05
 1996-12-15
 1998-04-03
 1996-05-17
 1998-03-10
 1996-10-26
 1998-01-23
 1998-03-14
 1932-01-01
 1996-01-15
 1996-04-02
 1998-02-20
 1997-04-22
 1998-10-09
 1998-02-01
 1996-07-22
 1926-01-01
 1998-01-21
 1998-02-11
 1986-04-26
 1998-03-06
 1996-09-18
 1996-02-28
 1997-04-30
Name: release_date, Length: 240, dtype: datetime64[ns]

# find dates with year lower/equal than 1995
unique_dates_1 = filter(lambda x: x.year <= 1995, unique_dates)
# filter() takes two arguments. First one should return only boolean values and the second one is the variable over which ititerates over.
# This basically takes unique_dates and uses the lambda function (here, it returns bool values) and filters True cases.

unique_dates_1

<filter at 0x1187af6a0>

Here we have used drop_duplicates() method to select only unique Series values. Then we can filter movies with respect to release_date condition. Each datetime Python object possesses with attributes year, month, day, etc. allowing to extract values of year, month, day, etc. from the date. We call the new DataFrame as old_movies.

old_movies = movies[movies['release_date'].isin(unique_dates_1)]
old_movies.head()

	user_id	movie_id	rating	timestamp	age	gender	occupation	zip_code	movie_title	release_date	...
117	196	393	4	881251863	49.0	M	writer	55105	Mrs. Doubtfire (1993)	1993-01-01	...
118	22	393	4	878886989	25.0	M	writer	40206	Mrs. Doubtfire (1993)	1993-01-01	...
119	244	393	3	880607365	28.0	M	technician	80525	Mrs. Doubtfire (1993)	1993-01-01	...
120	298	393	4	884183099	44.0	M	executive	01581	Mrs. Doubtfire (1993)	1993-01-01	...
121	286	393	4	877534481	27.0	M	student	15217	Mrs. Doubtfire (1993)	1993-01-01	...

5 rows × 30 columns

Now we may filter DataFrame old_movies by age and rating. Lets’ drop timestamp, zip_code

# get all users with age less than 25 that rated old movies higher than 3
old_movies_watch = old_movies[(old_movies['age']<25) & (old_movies['rating']>3)]
# Drop timestamp and zip_code
old_movies_watch = old_movies_watch.drop(['timestamp', 'zip_code'],axis=1)

old_movies_watch.head()

	user_id	movie_id	rating	age	gender	occupation	movie_title	release_date	IMDb_URL	...
124	303	393	4	19.0	M	student	Mrs. Doubtfire (1993)	1993-01-01	http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...	...
135	276	393	4	21.0	M	student	Mrs. Doubtfire (1993)	1993-01-01	http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...	...
153	128	393	4	24.0	F	marketing	Mrs. Doubtfire (1993)	1993-01-01	http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...	...
162	130	393	5	20.0	M	none	Mrs. Doubtfire (1993)	1993-01-01	http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...	...
183	314	393	4	20.0	F	student	Mrs. Doubtfire (1993)	1993-01-01	http://us.imdb.com/M/title-exact?Mrs.%20Doubtf...	...

5 rows × 28 columns

Pandas has support for accelerating certain types of binary numerical and boolean operations using the numexpr library (it uses smart chunking, caching, and multiple cores) and the bottleneck libraries (is a set of specialized cython routines that are especially fast when dealing with arrays that have NaNs). It allows one to increase pandas functionality a lot. This advantage is shown for some boolean and calculation operations. To count the time elapsed on operation performing we will use the decorator

# this function counts the time for a particular operation

def timer(func):
    from datetime import datetime
    def wrapper(*args):
        start = datetime.now()
        func(*args)
        end = datetime.now()
        return 'elapsed time = {' + str(end - start)+'}'
    return wrapper

import random
n = 100
# generate rangon datasets
df_1 = pd.DataFrame({'col :'+str(i):[random.randint(-100,100) for j in range(n)]for i in range(n)})
# here we pass a dictionary to the DataFrame() constructor.
# The key is "col : i" where i can take random values and the value for those keys is i.

df_2 = pd.DataFrame({'col :'+str(i):[random.randint(-100,100) for j in range(n)] for i in range(n)})

@timer
def direct_comparison(df_1, df_2):
    bool_df = pd.DataFrame({'col_{}'.format(i): [True for j in range(n)] for i in range(n)})
    for i in range(len(df_1.index)):
        for j in range(len(df_1.loc[i])):
            if df_1.loc[i, df_1.columns[j]] >= df_2.loc[i, df_2.columns[j]]:
                bool_df.loc[i,bool_df.columns[j]] = False
    return bool_df

@timer
def pandas_comparison(df_1, df_2):
    return df_1 < df_2

print ('direct_comparison:', (direct_comparison(df_1, df_2)))
print ('pandas_comparison:', (pandas_comparison(df_1, df_2)))

direct_comparison: elapsed time = {0:00:03.362719}
pandas_comparison: elapsed time = {0:00:00.029600}

As you can see, the difference in speed is too noticeable.

Besides, pandas possesses methods eq (equal), ne (not equal), lt (less then), gt (greater than), le (less or equal) and ge (greater or equal) for simplifying boolean comparison

Matrix Addition

df = pd.DataFrame({'A':[1,2,3],'B':[-2,-3,-4],"C":[7,8,9]})
dfa = pd.DataFrame({'A':[1,2,3],'D':[6,7,8],"C":[12,12,12]})
dfc = df + dfa
dfc

	A	B	C	D
0	2	NaN	19	NaN
1	4	NaN	20	NaN
2	6	NaN	21	NaN

df.le(dfa)

	A	B	C	D
0	True	False	True	False
1	True	False	True	False
2	True	False	True	False

You can also apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result:

(df<0).all()

A    False
B     True
C    False
dtype: bool

# here horyzontal direction for comparison is taking into account and we check all row’s items
(df < 0).all(axis=1)

  False
  False
  False
dtype: bool

# here vertical direction for comparison is taking into
# account and we check if just one column’s item satisfies the condition
(df < 0).any()

A    False
B     True
C    False
dtype: bool

# here we check if all DataFrame's items satisfy the condition
(df < 0).any().any()

True

# here we check if DataFrame no one element
df.empty

False

Descriptive Statistics

Function	Description
abs	absolute value
count	number of non-null observations
cumsum	cumulative sum (a sequence of partial sums of a given sequence)
sum	sum of values
mean	mean of values
mad	mean absolute deviation
median	arithmetic median of values
min	minimum value
max	maximum value
mode	mode
prod	product of values
std	unbiased standard deviation
var	unbiased variance

print("Sum : ", movies['age'].sum())

Sum :  3089983.0

print(df)

print("Mean : ")
print(df.mean())

print("\nMean of all Mean Values: ")
print(df.mean().mean())

print("\nMedian: ")
print(df.median())

print("\nStandard Deviation: ")
print(df.std())

print("\nVariance: ")
print(df.var())

print("\nMax: ")
print(df.max())

Mean :
A    2.0
B   -3.0
C    8.0
dtype: float64

Mean of all Mean Values:
2.3333333333333335

Median:
A    2.0
B   -3.0
C    8.0
dtype: float64

Standard Deviation:
A    1.0
B    1.0
C    1.0
dtype: float64

Variance:
A    1.0
B    1.0
C    1.0
dtype: float64

Max:
A    3
B   -2
C    9
dtype: int64

Function Applications

When you need to make some transformations with some column’s or row’s elements, then method map will be helpful (it works like pure Python function map() ). But there is also possibility to apply some function to each DataFrame element (not to a column or a row) – method apply(map) aids in this case.

movies.loc[:, (movies.dtypes == np.int64) | (movies.dtypes == np.float64)].apply(np.mean)
# This calculates the mean of all the columns present in movies

user_id        4.624848e+02
movie_id       4.255301e+02
rating         3.529860e+00
timestamp      8.835289e+08
age            3.296650e+01
unknown        1.000000e-04
Action         2.558900e-01
Adventure      1.375300e-01
Animation      3.605000e-02
Childrens      7.182000e-02
Comedy         2.983200e-01
Crime          8.055000e-02
Documentary    7.580000e-03
Drama          3.989500e-01
Fantasy        1.352000e-02
Film-Noir      1.733000e-02
Horror         5.317000e-02
Musical        4.954000e-02
Mystery        5.245000e-02
Romance        1.946100e-01
Sci-Fi         1.273000e-01
Thriller       2.187200e-01
War            9.398000e-02
Western        1.854000e-02
dtype: float64

# to print mean of all row values in movies :
movies.loc[:,(movies.dtypes==np.int64) | (movies.dtypes==np.float64)].apply(np.mean, axis = 1)

      3.671881e+07
      3.692952e+07
      3.680285e+07
      3.712641e+07
      3.648948e+07
      3.662343e+07
      3.683796e+07
      3.664883e+07
      3.672981e+07
      3.657322e+07
     3.646959e+07
     3.714094e+07
     3.691021e+07
     3.667207e+07
     3.868486e+07
     3.711643e+07
     3.654020e+07
     3.646151e+07
     3.825258e+07
     3.684153e+07
     3.663078e+07
     3.653976e+07
     3.690734e+07
     3.700433e+07
     3.804139e+07
     3.704913e+07
     3.715335e+07
     3.680185e+07
     3.682009e+07
     3.682872e+07
             ...     
  3.836551e+07
  3.702680e+07
  3.703066e+07
  3.705424e+07
  3.661341e+07
  3.714594e+07
  3.714594e+07
  3.714592e+07
  3.714594e+07
  3.714594e+07
  3.714592e+07
  3.648981e+07
  3.869825e+07
  3.720672e+07
  3.714594e+07
  3.714584e+07
  3.876098e+07
  3.704094e+07
  3.712668e+07
  3.696509e+07
  3.712652e+07
  3.713393e+07
  3.807540e+07
  3.684269e+07
  3.678404e+07
  3.705384e+07
  3.866487e+07
  3.705384e+07
  3.696514e+07
  3.670202e+07
Length: 100000, dtype: float64

Remember

The attribute axis define the horizontal (axis=1) or vertical direction for calculations (axis=0)

Groupby with Dictionary

Find more about this at my GeeksForGeeks Publication.

Breaking up a String into columns using regex

Find more about this at my GeeksForGeeks Publication.

Ranking Rows in Pandas

Find more about this at my GeeksForGeeks Publication.

Follow @RohitMidha23

Find more at my Github repository Explained.

Show some by ing it.

Star

Introduction to Pandas

Series

DataFrames

CSV

Reading:

Writing:

TXT file(s)

Reading:

Writing:

JSON files

Reading:

Writing:

Work with indexes and MultiIndex option

Note

Selection by label and position

Working with Missing Data

Matrix Addition

Descriptive Statistics

Function Applications

Remember

Groupby with Dictionary

Breaking up a String into columns using regex

Ranking Rows in Pandas

Related Posts