Tutorial 2 - Pandas

Usando Pandas com Movielens

Agora vamos realizar algumas análises mais interessantes.

Você pode baixar os dados aqui.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Lendo os dados

users = pd.read_csv('pydata2/users.csv')

users.head()
user_id age sex occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
users.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     943 non-null    int64 
 1   age         943 non-null    int64 
 2   sex         943 non-null    object
 3   occupation  943 non-null    object
 4   zip_code    943 non-null    object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB
movies = pd.read_csv('pydata2/movies.csv')
movies.head()
movie_id title release_date video_release_date imdb_url
0 2 GoldenEye (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?GoldenEye%20(...
1 3 Four Rooms (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Four%20Rooms%...
2 4 Get Shorty (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Get%20Shorty%...
3 5 Copycat (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Copycat%20(1995)
4 6 Shanghai Triad (Yao a yao yao dao waipo qiao) ... 01-Jan-1995 NaN http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai...
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1681 entries, 0 to 1680
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_id            1681 non-null   int64  
 1   title               1681 non-null   object 
 2   release_date        1680 non-null   object 
 3   video_release_date  0 non-null      float64
 4   imdb_url            1678 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 65.8+ KB
ratings = pd.read_csv('pydata2/ratings.csv')
ratings.head()
user_id movie_id rating unix_timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype
---  ------          --------------   -----
 0   user_id         100000 non-null  int64
 1   movie_id        100000 non-null  int64
 2   rating          100000 non-null  int64
 3   unix_timestamp  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB

Juntando os dados

movie_ratings = pd.merge(movies, ratings, how='right')
lens = pd.merge(movie_ratings, users)
lens.head()
movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp age sex occupation zip_code
0 242 Kolya (1996) 24-Jan-1997 NaN http://us.imdb.com/M/title-exact?Kolya%20(1996) 196 3 881250949 49 M writer 55105
1 393 Mrs. Doubtfire (1993) 01-Jan-1993 NaN http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... 196 4 881251863 49 M writer 55105
2 381 Muriel's Wedding (1994) 01-Jan-1994 NaN http://us.imdb.com/M/title-exact?Muriel's%20We... 196 4 881251728 49 M writer 55105
3 251 Shall We Dance? (1996) 11-Jul-1997 NaN http://us.imdb.com/M/title-exact?Shall%20we%20... 196 3 881251274 49 M writer 55105
4 655 Stand by Me (1986) 01-Jan-1986 NaN http://us.imdb.com/M/title-exact?Stand%20by%20... 196 5 881251793 49 M writer 55105
lens.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   movie_id            100000 non-null  int64  
 1   title               99548 non-null   object 
 2   release_date        99539 non-null   object 
 3   video_release_date  0 non-null       float64
 4   imdb_url            99535 non-null   object 
 5   user_id             100000 non-null  int64  
 6   rating              100000 non-null  int64  
 7   unix_timestamp      100000 non-null  int64  
 8   age                 100000 non-null  int64  
 9   sex                 100000 non-null  object 
 10  occupation          100000 non-null  object 
 11  zip_code            100000 non-null  object 
dtypes: float64(1), int64(5), object(6)
memory usage: 9.2+ MB

Análise Exploratória de Dados

Quais são os 25 filmes mais avaliados?

best = lens.groupby('title').count().reset_index()[['title','movie_id']].rename(columns={'movie_id':'count'}).sort_values('count',
                ascending=False)
                
best.head(25)
title count
1398 Star Wars (1977) 583
333 Contact (1997) 509
498 Fargo (1996) 508
1234 Return of the Jedi (1983) 507
860 Liar Liar (1997) 485
460 English Patient, The (1996) 481
1284 Scream (1996) 478
32 Air Force One (1997) 431
744 Independence Day (ID4) (1996) 429
1205 Raiders of the Lost Ark (1981) 420
612 Godfather, The (1972) 413
1190 Pulp Fiction (1994) 394
1542 Twelve Monkeys (1995) 392
1329 Silence of the Lambs, The (1991) 390
780 Jerry Maguire (1996) 384
293 Chasing Amy (1997) 379
1251 Rock, The (1996) 378
456 Empire Strikes Back, The (1980) 367
1394 Star Trek: First Contact (1996) 365
113 Back to the Future (1985) 350
1500 Titanic (1997) 350
987 Mission: Impossible (1996) 344
570 Fugitive, The (1993) 336
747 Indiana Jones and the Last Crusade (1989) 331
1632 Willy Wonka and the Chocolate Factory (1971) 326

Outra forma de fazer o mesmo (porém, neste caso retorna uma Pandas series:

best2 = lens.title.value_counts()[:25]
best2
title
Star Wars (1977)                                583
Contact (1997)                                  509
Fargo (1996)                                    508
Return of the Jedi (1983)                       507
Liar Liar (1997)                                485
English Patient, The (1996)                     481
Scream (1996)                                   478
Air Force One (1997)                            431
Independence Day (ID4) (1996)                   429
Raiders of the Lost Ark (1981)                  420
Godfather, The (1972)                           413
Pulp Fiction (1994)                             394
Twelve Monkeys (1995)                           392
Silence of the Lambs, The (1991)                390
Jerry Maguire (1996)                            384
Chasing Amy (1997)                              379
Rock, The (1996)                                378
Empire Strikes Back, The (1980)                 367
Star Trek: First Contact (1996)                 365
Back to the Future (1985)                       350
Titanic (1997)                                  350
Mission: Impossible (1996)                      344
Fugitive, The (1993)                            336
Indiana Jones and the Last Crusade (1989)       331
Willy Wonka and the Chocolate Factory (1971)    326
Name: count, dtype: int64

Quais são os filmes com as maiores notas?

movie_stats = lens.groupby('title').agg({'occupation':'count',
'rating':'mean','age':'mean'}).sort_values('rating', ascending=False).rename(columns={'occupation':'count','age':'mean_age'}).round(decimals=2)

movie_stats
count rating mean_age
title
Santa with Muscles (1996) 2 5.0 26.50
Saint of Fort Washington, The (1993) 2 5.0 24.00
Prefontaine (1997) 3 5.0 29.00
Marlene Dietrich: Shadow and Light (1996) 1 5.0 60.00
Someone Else's America (1995) 1 5.0 27.00
... ... ... ...
Man from Down Under, The (1943) 1 1.0 22.00
Good Morning (1971) 1 1.0 26.00
Bird of Prey (1996) 1 1.0 26.00
Gordy (1995) 3 1.0 28.33
Power 98 (1995) 1 1.0 47.00

1663 rows × 3 columns

Os filmes acima são avaliados tão raramente, que nem podemos qualificá-los como top filmes. Que tal apenas olharmos os filmes que foram avaliados pelo menos 100 vezes.

movie_stats.query('count >= 100')
count rating mean_age
title
Close Shave, A (1995) 112 4.49 31.44
Schindler's List (1993) 298 4.47 33.45
Wrong Trousers, The (1993) 118 4.47 31.55
Casablanca (1942) 243 4.46 35.90
Shawshank Redemption, The (1994) 283 4.45 32.81
... ... ... ...
Spawn (1997) 143 2.62 28.29
Event Horizon (1997) 127 2.57 31.08
Crash (1996) 128 2.55 32.45
Jungle2Jungle (1997) 132 2.44 32.27
Cable Guy, The (1996) 106 2.34 28.43

337 rows × 3 columns

outra forma de fazer o mesmo:

atleast_100 = movie_stats['count'] >= 100

atleast_100_movies = movie_stats[atleast_100]

atleast_100_movies
count rating mean_age
title
Close Shave, A (1995) 112 4.49 31.44
Schindler's List (1993) 298 4.47 33.45
Wrong Trousers, The (1993) 118 4.47 31.55
Casablanca (1942) 243 4.46 35.90
Shawshank Redemption, The (1994) 283 4.45 32.81
... ... ... ...
Spawn (1997) 143 2.62 28.29
Event Horizon (1997) 127 2.57 31.08
Crash (1996) 128 2.55 32.45
Jungle2Jungle (1997) 132 2.44 32.27
Cable Guy, The (1996) 106 2.34 28.43

337 rows × 3 columns

Como se distribuem as idades?

Vamos criar um gráfico mostrando a distribuição das idades.

pandas tem um integração nativa com a biblioteca matplotlib que permite a plotagem de Series/DataFrames ficarem mais fáceis ainda. Neste caso apenas chamamos o método histo na coluna que queremos produzir o histograma. Nós podemos também chamar o matplotlib.pyplot para customizar o nosso gráfico um pouco. (sempre lembrar de rotular seus eixos!)

users.age.hist(bins=30)
# Add title and labels
plt.title('Distribuição das idades')
plt.xlabel('Idades')
plt.ylabel('Frequencia')

# Show the plot
plt.show()

Eu não acho muito legal comparar nossas idades individuais, portanto vamos binarizar nossos usuários em groupos por idade usando o pd.cut.

labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
lens['age_group'] = pd.cut(lens.age, range(0, 81, 10), right=False, labels=labels)

lens[['title','age','age_group']]
title age age_group
0 Kolya (1996) 49 40-49
1 Mrs. Doubtfire (1993) 49 40-49
2 Muriel's Wedding (1994) 49 40-49
3 Shall We Dance? (1996) 49 40-49
4 Stand by Me (1986) 49 40-49
... ... ... ...
99995 City of Lost Children, The (1995) 20 20-29
99996 Heat (1995) 20 20-29
99997 NaN 20 20-29
99998 Liar Liar (1997) 20 20-29
99999 Waiting for Guffman (1996) 20 20-29

100000 rows × 3 columns

Vimos que há uma valor ‘NaN’ no título, vamos conferior melhor:

lens.isna().sum()
movie_id                   0
title                    452
release_date             461
video_release_date    100000
imdb_url                 465
user_id                    0
rating                     0
unix_timestamp             0
age                        0
sex                        0
occupation                 0
zip_code                   0
age_group                  0
dtype: int64

Vamos eliminar os ‘NaN’:

lens = lens.dropna(subset=['title'])
lens
movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp age sex occupation zip_code age_group
0 242 Kolya (1996) 24-Jan-1997 NaN http://us.imdb.com/M/title-exact?Kolya%20(1996) 196 3 881250949 49 M writer 55105 40-49
1 393 Mrs. Doubtfire (1993) 01-Jan-1993 NaN http://us.imdb.com/M/title-exact?Mrs.%20Doubtf... 196 4 881251863 49 M writer 55105 40-49
2 381 Muriel's Wedding (1994) 01-Jan-1994 NaN http://us.imdb.com/M/title-exact?Muriel's%20We... 196 4 881251728 49 M writer 55105 40-49
3 251 Shall We Dance? (1996) 11-Jul-1997 NaN http://us.imdb.com/M/title-exact?Shall%20we%20... 196 3 881251274 49 M writer 55105 40-49
4 655 Stand by Me (1986) 01-Jan-1986 NaN http://us.imdb.com/M/title-exact?Stand%20by%20... 196 5 881251793 49 M writer 55105 40-49
... ... ... ... ... ... ... ... ... ... ... ... ... ...
99994 300 Air Force One (1997) 01-Jan-1997 NaN http://us.imdb.com/M/title-exact?Air+Force+One... 941 4 875048495 20 M student 97229 20-29
99995 919 City of Lost Children, The (1995) 01-Jan-1995 NaN http://us.imdb.com/Title?Cit%E9+des+enfants+pe... 941 5 875048887 20 M student 97229 20-29
99996 273 Heat (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Heat%20(1995) 941 3 875049038 20 M student 97229 20-29
99998 294 Liar Liar (1997) 21-Mar-1997 NaN http://us.imdb.com/Title?Liar+Liar+(1997) 941 4 875048532 20 M student 97229 20-29
99999 1007 Waiting for Guffman (1996) 31-Jan-1997 NaN http://us.imdb.com/M/title-exact?Waiting%20for... 941 4 875049077 20 M student 97229 20-29

99548 rows × 13 columns

Vamos verificar a contagem e o rating médio por cada faixa etária:

lens_age = lens.groupby('age_group').agg({'rating': 'mean', 'title':'count'})

lens_age
rating title
age_group
0-9 3.767442 43
10-19 3.485511 8144
20-29 3.465155 39346
30-39 3.552180 25575
40-49 3.591265 14951
50-59 3.635389 8675
60-69 3.649351 2618
70-79 3.642857 196

Jovens usuários tendem ser mais críticos que os outros grupos por faixa etária.

E vamos fazer um gráfico com estas informações.

plt.bar(lens_age.index,lens_age['rating'])
plt.title("Rating médio por faixa etária")
plt.show()

Vamos olhar apenas para os 50 filmes mais avaliados e ver como eles são vistos por cada grupo.

best50 = atleast_100_movies.head(50)
best50.info()
<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, Close Shave, A (1995) to Apt Pupil (1998)
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   count     50 non-null     int64  
 1   rating    50 non-null     float64
 2   mean_age  50 non-null     float64
dtypes: float64(2), int64(1)
memory usage: 1.6+ KB
best50.head(5)
count rating mean_age
title
Close Shave, A (1995) 112 4.49 31.44
Schindler's List (1993) 298 4.47 33.45
Wrong Trousers, The (1993) 118 4.47 31.55
Casablanca (1942) 243 4.46 35.90
Shawshank Redemption, The (1994) 283 4.45 32.81
lens50 = lens.set_index('title')
lens50.info()
<class 'pandas.core.frame.DataFrame'>
Index: 99548 entries, Kolya (1996) to Waiting for Guffman (1996)
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   movie_id            99548 non-null  int64   
 1   release_date        99539 non-null  object  
 2   video_release_date  0 non-null      float64 
 3   imdb_url            99535 non-null  object  
 4   user_id             99548 non-null  int64   
 5   rating              99548 non-null  int64   
 6   unix_timestamp      99548 non-null  int64   
 7   age                 99548 non-null  int64   
 8   sex                 99548 non-null  object  
 9   occupation          99548 non-null  object  
 10  zip_code            99548 non-null  object  
 11  age_group           99548 non-null  category
dtypes: category(1), float64(1), int64(5), object(5)
memory usage: 9.2+ MB
by_age = lens50.loc[best50.index]

by_age.reset_index()
title movie_id release_date video_release_date imdb_url user_id rating unix_timestamp age sex occupation zip_code age_group
0 Close Shave, A (1995) 408 28-Apr-1996 NaN http://us.imdb.com/M/title-exact?Close%20Shave... 305 5 886323189 23 M programmer 94086 20-29
1 Close Shave, A (1995) 408 28-Apr-1996 NaN http://us.imdb.com/M/title-exact?Close%20Shave... 6 4 883599075 42 M executive 98101 40-49
2 Close Shave, A (1995) 408 28-Apr-1996 NaN http://us.imdb.com/M/title-exact?Close%20Shave... 286 4 875806800 27 M student 15217 20-29
3 Close Shave, A (1995) 408 28-Apr-1996 NaN http://us.imdb.com/M/title-exact?Close%20Shave... 303 4 879467035 19 M student 14853 10-19
4 Close Shave, A (1995) 408 28-Apr-1996 NaN http://us.imdb.com/M/title-exact?Close%20Shave... 299 4 877877847 29 M doctor 63108 20-29
... ... ... ... ... ... ... ... ... ... ... ... ... ...
11173 Apt Pupil (1998) 315 23-Oct-1998 NaN http://us.imdb.com/Title?Apt+Pupil+(1998) 931 5 891037577 60 M educator 33556 60-69
11174 Apt Pupil (1998) 315 23-Oct-1998 NaN http://us.imdb.com/Title?Apt+Pupil+(1998) 934 4 891188403 61 M engineer 22902 60-69
11175 Apt Pupil (1998) 315 23-Oct-1998 NaN http://us.imdb.com/Title?Apt+Pupil+(1998) 940 4 884801125 32 M administrator 02215 30-39
11176 Apt Pupil (1998) 315 23-Oct-1998 NaN http://us.imdb.com/Title?Apt+Pupil+(1998) 942 4 891282355 48 F librarian 78209 40-49
11177 Apt Pupil (1998) 315 23-Oct-1998 NaN http://us.imdb.com/Title?Apt+Pupil+(1998) 926 4 888351623 49 M entertainment 01701 40-49

11178 rows × 13 columns

by_age.groupby(['title','age_group']).agg({'rating': 'mean'})
rating
title age_group
12 Angry Men (1957) 0-9 NaN
10-19 4.500000
20-29 4.230769
30-39 4.382353
40-49 4.500000
... ... ...
Wrong Trousers, The (1993) 30-39 4.558824
40-49 4.538462
50-59 4.625000
60-69 5.000000
70-79 NaN

400 rows × 1 columns

Quais são os filmes que mulheres e homens mais se diferem em gostos ?

lens.index
Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,     9,
       ...
       99989, 99990, 99991, 99992, 99993, 99994, 99995, 99996, 99998, 99999],
      dtype='int64', length=99548)
pivoted = lens.pivot_table(index=['title'], 
                           columns=['sex'], 
                           values='rating', 
                           fill_value=0)
                           
pivoted.head(5)
sex F M
title
'Til There Was You (1997) 2.200000 2.500000
1-900 (1994) 1.000000 3.000000
101 Dalmatians (1996) 3.116279 2.772727
12 Angry Men (1957) 4.269231 4.363636
187 (1997) 3.500000 2.870968

Vamos calcular a diferença entre as notas de homens e mulheres:

pivoted['diff']=pivoted.M - pivoted.F

pivoted.head(5)
sex F M diff
title
'Til There Was You (1997) 2.200000 2.500000 0.300000
1-900 (1994) 1.000000 3.000000 2.000000
101 Dalmatians (1996) 3.116279 2.772727 -0.343552
12 Angry Men (1957) 4.269231 4.363636 0.094406
187 (1997) 3.500000 2.870968 -0.629032
disagreements = pivoted[pivoted.index.isin(best50.index)]['diff']
disagreements_sorted = disagreements.sort_values()

disagreements_sorted
title
Sling Blade (1996)                                                            -0.284314
Philadelphia Story, The (1940)                                                -0.239583
Schindler's List (1993)                                                       -0.226519
It's a Wonderful Life (1946)                                                  -0.224182
Secrets & Lies (1996)                                                         -0.212963
Graduate, The (1967)                                                          -0.198571
Close Shave, A (1995)                                                         -0.169213
Henry V (1989)                                                                -0.167255
Shawshank Redemption, The (1994)                                              -0.151541
Good Will Hunting (1997)                                                      -0.132911
To Kill a Mockingbird (1962)                                                  -0.081159
Taxi Driver (1976)                                                            -0.061303
Manchurian Candidate, The (1962)                                              -0.054348
Titanic (1997)                                                                -0.047139
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)   -0.044608
Silence of the Lambs, The (1991)                                              -0.040690
Psycho (1960)                                                                 -0.038637
Wrong Trousers, The (1993)                                                     0.028083
Braveheart (1995)                                                              0.031136
Rear Window (1954)                                                             0.048148
Usual Suspects, The (1995)                                                     0.065728
Cinema Paradiso (1988)                                                         0.071970
Casablanca (1942)                                                              0.073404
Chinatown (1974)                                                               0.087179
Killing Fields, The (1984)                                                     0.091039
12 Angry Men (1957)                                                            0.094406
Citizen Kane (1941)                                                            0.109337
Mr. Smith Goes to Washington (1939)                                            0.110000
As Good As It Gets (1997)                                                      0.119048
Boot, Das (1981)                                                               0.129814
L.A. Confidential (1997)                                                       0.146463
Star Wars (1977)                                                               0.153115
Amadeus (1984)                                                                 0.172094
Princess Bride, The (1987)                                                     0.176333
Great Escape, The (1963)                                                       0.182092
One Flew Over the Cuckoo's Nest (1975)                                         0.183333
Fargo (1996)                                                                   0.185044
Blade Runner (1982)                                                            0.192640
North by Northwest (1959)                                                      0.200366
Godfather, The (1972)                                                          0.201082
Raiders of the Lost Ark (1981)                                                 0.217328
Vertigo (1958)                                                                 0.231177
African Queen, The (1951)                                                      0.245614
Apt Pupil (1998)                                                               0.263323
Empire Strikes Back, The (1980)                                                0.300964
Godfather: Part II, The (1974)                                                 0.323521
Maltese Falcon, The (1941)                                                     0.355844
Lawrence of Arabia (1962)                                                      0.398715
Raging Bull (1980)                                                             0.471655
Bridge on the River Kwai, The (1957)                                           0.621978
Name: diff, dtype: float64
# Create the horizontal bar plot
disagreements_sorted.plot(kind='barh', figsize=[9, 15])

# Add title and labels
plt.title('Diferença Média no Rating\n(Diferença > 0 = Favoritado por homens)')
plt.ylabel('Título')
plt.xlabel('Diferença Média no Rating')


# Show the plot
plt.show()