Você pode baixar os dados aqui.
Tutorial 1 - Pandas
O Pandas é uma biblioteca Python fundamental para análise de dados. Ela fornece estruturas de dados eficientes e ferramentas para manipulação, limpeza e análise de dados.
import pandas as pd
Criando um dataframe
= {'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob','Camille'],
data 'age': [25, 30, 35, 25, 40, 30,20],
'title':['Sherlock','The Walking Dead','Dark',
'Friends','Orange Is the New Black',
'The Walking Dead','Narcos']}
= pd.DataFrame(data) df
print(df)
name age title
0 Alice 25 Sherlock
1 Bob 30 The Walking Dead
2 Charlie 35 Dark
3 Alice 25 Friends
4 David 40 Orange Is the New Black
5 Bob 30 The Walking Dead
6 Camille 20 Narcos
Carregando os dados
= pd.read_excel("data1/netflix_series_limpo.xlsx")
netflix = pd.read_excel("data1/imdb_series.xlsx") imdb
print(netflix.head)
<bound method NDFrame.head of series_title season episode Date
0 Away Season 1 Home 2020-10-06
1 Away Season 1 Spektr 2020-10-06
2 Away Season 1 Vital Signs 2020-10-05
3 Away Season 1 Goodnight Mars 2020-10-05
4 Away Season 1 A Little Faith 2020-10-05
.. ... ... ... ...
918 Orphan Black Season 1 Natural Selection 2015-08-21
919 Bates Motel Season 2 The Immutable Truth 2015-08-15
920 Bates Motel Season 2 The Box 2015-08-15
921 Bates Motel Season 2 Meltdown 2015-08-14
922 Bates Motel Season 2 Presumed Innocent 2015-08-13
[923 rows x 4 columns]>
Também podemos ver rapidamente a estrutura de cada dataset, utilizando a função dtypes e describe().
netflix.describe()
Date | |
---|---|
count | 923 |
mean | 2018-05-29 23:58:26.392199424 |
min | 2015-08-13 00:00:00 |
25% | 2017-01-06 00:00:00 |
50% | 2018-09-10 00:00:00 |
75% | 2019-07-30 00:00:00 |
max | 2020-10-06 00:00:00 |
netflix.dtypes
series_title object
season object
episode object
Date datetime64[ns]
dtype: object
Vamos converter a coluna season
em variável categórica.
'season'].astype('category') netflix[
0 Season 1
1 Season 1
2 Season 1
3 Season 1
4 Season 1
...
918 Season 1
919 Season 2
920 Season 2
921 Season 2
922 Season 2
Name: season, Length: 923, dtype: category
Categories (32, object): [' 1ª temporada', ' Back in Business', ' Berry Bitty Adventures', ' Chapter Eight', ..., ' Turma da Mônica', ' Volume 1', ' Volume 2', ' Welcome to Ever After High']
Podemos ver as dimensões do nosso dataframe:
netflix.shape
(923, 4)
Selecionando colunas
Para selecionar colunas utilizamos a forma df[['coluna']]
onde df é o nome do dataframe e ‘coluna’ é o nome da coluna ou colunas que queremos selecionar.
'series_title','season']] netflix[[
series_title | season | |
---|---|---|
0 | Away | Season 1 |
1 | Away | Season 1 |
2 | Away | Season 1 |
3 | Away | Season 1 |
4 | Away | Season 1 |
... | ... | ... |
918 | Orphan Black | Season 1 |
919 | Bates Motel | Season 2 |
920 | Bates Motel | Season 2 |
921 | Bates Motel | Season 2 |
922 | Bates Motel | Season 2 |
923 rows × 2 columns
Ordenando os dados
Para ordenar as linhas, podemos utilizar a função sort_values()
de forma a termos, por exemplo, uma lista de maior a menor de um determinado valor. Vamos usar o dataframe imdb
para exemplificar, ordenando as linhas por ordem crescente de UserRating
:
imdb.head()
series_name | Episode | series_ep | season | season_ep | url | UserRating | UserVotes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13 Reasons Why | Tape 1, Side A | 1 | 1 | 1 | http://www.imdb.com/title/tt5174246/?ref_=ttep... | 8.3 | 7016 | 0.055445 | 0.004418 | 0.004704 | 0.005844 | 0.014681 | 0.032640 | 0.105188 | 0.237030 | 0.246579 | 0.293472 |
1 | 13 Reasons Why | Tape 1, Side B | 2 | 1 | 2 | http://www.imdb.com/title/tt5174248/?ref_=ttep... | 8.0 | 5859 | 0.056665 | 0.004438 | 0.005803 | 0.008022 | 0.016726 | 0.045400 | 0.138420 | 0.311145 | 0.171872 | 0.241509 |
2 | 13 Reasons Why | Tape 2, Side A | 3 | 1 | 3 | http://www.imdb.com/title/tt5174250/?ref_=ttep... | 7.9 | 5509 | 0.058813 | 0.003993 | 0.005627 | 0.009984 | 0.022509 | 0.051734 | 0.160828 | 0.302414 | 0.140860 | 0.243238 |
3 | 13 Reasons Why | Tape 2, Side B | 4 | 1 | 4 | http://www.imdb.com/title/tt5174252/?ref_=ttep... | 8.1 | 5309 | 0.063477 | 0.003956 | 0.005086 | 0.007911 | 0.019213 | 0.045960 | 0.129215 | 0.298550 | 0.171219 | 0.255415 |
4 | 13 Reasons Why | Tape 3, Side A | 5 | 1 | 5 | http://www.imdb.com/title/tt5174254/?ref_=ttep... | 8.2 | 5252 | 0.066832 | 0.002666 | 0.004570 | 0.008378 | 0.018850 | 0.037129 | 0.114242 | 0.257997 | 0.203542 | 0.285796 |
= imdb.sort_values('UserRating',
imdb =False) ascending
Agora vamos ver como ficou:
'series_name','UserRating']] imdb[[
series_name | UserRating | |
---|---|---|
2359 | Lúcifer | 9.8 |
134 | Dark | 9.7 |
1404 | The Walking Dead | 9.7 |
1183 | Friends | 9.7 |
2371 | Lúcifer | 9.7 |
... | ... | ... |
773 | Pânico: A Série de TV | 5.5 |
774 | Pânico: A Série de TV | 5.2 |
1908 | Dracula | 5.2 |
42 | 13 Reasons Why | 5.2 |
137 | Dracula | 5.2 |
2597 rows × 2 columns
Filtrando Linhas
Podemos filtrar linhas utilizando o método df.query()
onde df é o nome do dataframe.
'UserVotes > 10000') imdb.query(
series_name | Episode | series_ep | season | season_ep | url | UserRating | UserVotes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2359 | Lúcifer | A Devil of My Word | 55 | 3 | 24 | http://www.imdb.com/title/tt8253126/?ref_=ttep... | 9.8 | 10712 | 0.021004 | 0.001587 | 0.000840 | 0.000747 | 0.002054 | 0.002801 | 0.007468 | 0.018764 | 0.063107 | 0.881628 |
134 | Dark | The Paradise | 26 | 3 | 8 | http://www.imdb.com/title/tt12557704/?ref_=tte... | 9.7 | 17468 | 0.020151 | 0.002633 | 0.003263 | 0.002805 | 0.004351 | 0.006698 | 0.012308 | 0.022556 | 0.059824 | 0.865411 |
1404 | The Walking Dead | No Way Out | 57 | 6 | 9 | http://www.imdb.com/title/tt4575388/?ref_=ttep... | 9.7 | 24358 | 0.027424 | 0.003490 | 0.002094 | 0.002669 | 0.005665 | 0.007390 | 0.015519 | 0.033008 | 0.101979 | 0.800764 |
1371 | The Walking Dead | Too Far Gone | 24 | 4 | 8 | http://www.imdb.com/title/tt2948638/?ref_=ttep... | 9.7 | 22061 | 0.044876 | 0.002584 | 0.001541 | 0.001496 | 0.004624 | 0.006029 | 0.013100 | 0.039255 | 0.126286 | 0.760210 |
985 | The Walking Dead | No Way Out | 57 | 6 | 9 | http://www.imdb.com/title/tt4575388/?ref_=ttep... | 9.7 | 24358 | 0.027424 | 0.003490 | 0.002094 | 0.002669 | 0.005665 | 0.007390 | 0.015519 | 0.033008 | 0.101979 | 0.800764 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1411 | The Walking Dead | Last Day on Earth | 64 | 6 | 16 | http://www.imdb.com/title/tt4589574/?ref_=ttep... | 6.6 | 26981 | 0.263667 | 0.036878 | 0.037285 | 0.037211 | 0.052259 | 0.049590 | 0.068826 | 0.090693 | 0.104221 | 0.259368 |
992 | The Walking Dead | Last Day on Earth | 64 | 6 | 16 | http://www.imdb.com/title/tt4589574/?ref_=ttep... | 6.6 | 26981 | 0.263667 | 0.036878 | 0.037285 | 0.037211 | 0.052259 | 0.049590 | 0.068826 | 0.090693 | 0.104221 | 0.259368 |
1337 | Stranger Things | Chapter Seven: The Lost Sister | 15 | 2 | 7 | http://www.imdb.com/title/tt6020810/?ref_=ttep... | 6.1 | 21371 | 0.166955 | 0.045716 | 0.047869 | 0.059239 | 0.100791 | 0.118198 | 0.157784 | 0.116794 | 0.058958 | 0.127696 |
1417 | The Walking Dead | Swear | 70 | 7 | 6 | http://www.imdb.com/title/tt5207734/?ref_=ttep... | 5.6 | 11971 | 0.197978 | 0.042686 | 0.047615 | 0.060647 | 0.102164 | 0.140506 | 0.133823 | 0.087127 | 0.037591 | 0.149862 |
998 | The Walking Dead | Swear | 70 | 7 | 6 | http://www.imdb.com/title/tt5207734/?ref_=ttep... | 5.6 | 11971 | 0.197978 | 0.042686 | 0.047615 | 0.060647 | 0.102164 | 0.140506 | 0.133823 | 0.087127 | 0.037591 | 0.149862 |
124 rows × 18 columns
Se quisermos combinar filtrado de linhas com seleção de colunas:
= imdb.query('UserVotes > 10000')[['series_name','Episode','UserRating','UserVotes']] imdb_10
10) imdb_10.head(
series_name | Episode | UserRating | UserVotes | |
---|---|---|---|---|
2359 | Lúcifer | A Devil of My Word | 9.8 | 10712 |
134 | Dark | The Paradise | 9.7 | 17468 |
1404 | The Walking Dead | No Way Out | 9.7 | 24358 |
1371 | The Walking Dead | Too Far Gone | 9.7 | 22061 |
985 | The Walking Dead | No Way Out | 9.7 | 24358 |
1314 | Sherlock | The Reichenbach Fall | 9.7 | 33904 |
1307 | Friends | The Last One | 9.7 | 11634 |
952 | The Walking Dead | Too Far Gone | 9.7 | 22061 |
961 | The Walking Dead | No Sanctuary | 9.6 | 23608 |
133 | Dark | Between the Time | 9.6 | 13301 |
imdb_10.shape
(124, 4)
vamos identificar se há duplicados.
= imdb_10.duplicated()
duplicates print(duplicates)
2359 False
134 False
1404 False
1371 False
985 True
...
1411 False
992 True
1337 False
1417 False
998 True
Length: 124, dtype: bool
Notamos que existem duplicados, portanto precisamos removê-los utilizando o método drop_duplicates()
.
= imdb_10.drop_duplicates(subset=['Episode']) imdb_10
Vamos ver quantas linhas temos agora.
imdb_10.shape
(91, 4)
também podemos filtrar por strings. Por exemplo, vamos filtrar pelo seriado “Dark”:
'series_name'].isin(['Dark']) imdb[
2359 False
134 True
1404 False
1183 False
2371 False
...
773 False
774 False
1908 False
42 False
137 False
Name: series_name, Length: 2597, dtype: bool
Vemos que o código acima retorna uma série com valores lógicos (True-False) para cada linha. Para efetivamente ver o resultado do seriado Dark devemos:
'series_name'].isin(['Dark'])] imdb[imdb[
series_name | Episode | series_ep | season | season_ep | url | UserRating | UserVotes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
134 | Dark | The Paradise | 26 | 3 | 8 | http://www.imdb.com/title/tt12557704/?ref_=tte... | 9.7 | 17468 | 0.020151 | 0.002633 | 0.003263 | 0.002805 | 0.004351 | 0.006698 | 0.012308 | 0.022556 | 0.059824 | 0.865411 |
133 | Dark | Between the Time | 25 | 3 | 7 | http://www.imdb.com/title/tt12557700/?ref_=tte... | 9.6 | 13301 | 0.020374 | 0.001880 | 0.002556 | 0.002707 | 0.004285 | 0.006842 | 0.014811 | 0.031577 | 0.073904 | 0.841065 |
124 | Dark | An Endless Cycle | 16 | 2 | 6 | http://www.imdb.com/title/tt10454734/?ref_=tte... | 9.6 | 11149 | 0.007983 | 0.000807 | 0.001256 | 0.001435 | 0.001884 | 0.005292 | 0.016055 | 0.045834 | 0.124765 | 0.794690 |
131 | Dark | Life and Death | 23 | 3 | 5 | http://www.imdb.com/title/tt12557688/?ref_=tte... | 9.5 | 10359 | 0.021238 | 0.001834 | 0.002510 | 0.002993 | 0.005502 | 0.010908 | 0.020948 | 0.044985 | 0.123661 | 0.765421 |
126 | Dark | Endings and Beginnings | 18 | 2 | 8 | http://www.imdb.com/title/tt10457654/?ref_=tte... | 9.5 | 11379 | 0.012040 | 0.002109 | 0.001846 | 0.002373 | 0.004482 | 0.009667 | 0.020301 | 0.045698 | 0.123473 | 0.778012 |
122 | Dark | The Travelers | 14 | 2 | 4 | http://www.imdb.com/title/tt10454726/?ref_=tte... | 9.5 | 9968 | 0.007925 | 0.000903 | 0.001204 | 0.001505 | 0.002307 | 0.006120 | 0.019161 | 0.056180 | 0.166132 | 0.738563 |
123 | Dark | Lost and Found | 15 | 2 | 5 | http://www.imdb.com/title/tt10454732/?ref_=tte... | 9.4 | 9210 | 0.007600 | 0.000869 | 0.001303 | 0.001629 | 0.002606 | 0.007166 | 0.023127 | 0.075570 | 0.197720 | 0.682410 |
125 | Dark | The White Devil | 17 | 2 | 7 | http://www.imdb.com/title/tt10457652/?ref_=tte... | 9.3 | 9050 | 0.008066 | 0.001657 | 0.001878 | 0.001768 | 0.003757 | 0.009392 | 0.026851 | 0.083315 | 0.204972 | 0.658343 |
132 | Dark | Light and Shadow | 24 | 3 | 6 | http://www.imdb.com/title/tt12557694/?ref_=tte... | 9.3 | 10060 | 0.024453 | 0.001889 | 0.002783 | 0.004175 | 0.006362 | 0.010934 | 0.025547 | 0.063221 | 0.127137 | 0.733499 |
116 | Dark | As You Sow, so You Shall Reap | 8 | 1 | 8 | http://www.imdb.com/title/tt7313316/?ref_=ttep... | 9.2 | 9294 | 0.007209 | 0.001076 | 0.000753 | 0.001829 | 0.003443 | 0.011298 | 0.030773 | 0.101033 | 0.257586 | 0.585001 |
118 | Dark | Alpha and Omega | 10 | 1 | 10 | http://www.imdb.com/title/tt7313322/?ref_=ttep... | 9.2 | 9758 | 0.010248 | 0.001947 | 0.001230 | 0.003279 | 0.005944 | 0.011273 | 0.032589 | 0.086801 | 0.229248 | 0.617442 |
121 | Dark | Ghosts | 13 | 2 | 3 | http://www.imdb.com/title/tt10454722/?ref_=tte... | 9.2 | 9138 | 0.007004 | 0.000766 | 0.001970 | 0.002079 | 0.004487 | 0.008645 | 0.030422 | 0.100897 | 0.226855 | 0.616875 |
114 | Dark | Sic Mundus Creatus Est | 6 | 1 | 6 | http://www.imdb.com/title/tt7313312/?ref_=ttep... | 9.1 | 9434 | 0.006996 | 0.000742 | 0.001272 | 0.001272 | 0.004240 | 0.014204 | 0.035616 | 0.115222 | 0.264151 | 0.556286 |
130 | Dark | The Origin | 22 | 3 | 4 | http://www.imdb.com/title/tt12557686/?ref_=tte... | 9.1 | 9529 | 0.022248 | 0.003883 | 0.004722 | 0.004198 | 0.006611 | 0.013118 | 0.033582 | 0.082695 | 0.163816 | 0.665128 |
113 | Dark | Truths | 5 | 1 | 5 | http://www.imdb.com/title/tt7313308/?ref_=ttep... | 9.0 | 9563 | 0.005751 | 0.000837 | 0.001046 | 0.002301 | 0.004183 | 0.014744 | 0.045070 | 0.134790 | 0.278992 | 0.512287 |
120 | Dark | Dark Matter | 12 | 2 | 2 | http://www.imdb.com/title/tt10454716/?ref_=tte... | 9.0 | 9221 | 0.006941 | 0.001084 | 0.001193 | 0.002603 | 0.003579 | 0.010303 | 0.034920 | 0.122655 | 0.247479 | 0.569244 |
127 | Dark | Deja-vu | 19 | 3 | 1 | http://www.imdb.com/title/tt10414808/?ref_=tte... | 9.0 | 10609 | 0.022622 | 0.003205 | 0.003959 | 0.004430 | 0.006221 | 0.014704 | 0.039872 | 0.098407 | 0.202658 | 0.603921 |
129 | Dark | Adam and Eva | 21 | 3 | 3 | http://www.imdb.com/title/tt12557682/?ref_=tte... | 8.9 | 9364 | 0.022747 | 0.002777 | 0.004806 | 0.004058 | 0.008330 | 0.016339 | 0.040474 | 0.109141 | 0.194682 | 0.596647 |
128 | Dark | The Survivors | 20 | 3 | 2 | http://www.imdb.com/title/tt12557670/?ref_=tte... | 8.9 | 9666 | 0.022967 | 0.003311 | 0.004138 | 0.004552 | 0.006104 | 0.017381 | 0.045520 | 0.119284 | 0.202773 | 0.573971 |
119 | Dark | Beginnings and Endings | 11 | 2 | 1 | http://www.imdb.com/title/tt7787482/?ref_=ttep... | 8.9 | 9969 | 0.007122 | 0.001505 | 0.001404 | 0.003611 | 0.004213 | 0.013442 | 0.049955 | 0.154780 | 0.238239 | 0.525730 |
115 | Dark | Crossroads | 7 | 1 | 7 | http://www.imdb.com/title/tt7305824/?ref_=ttep... | 8.8 | 8725 | 0.005845 | 0.001261 | 0.001032 | 0.002636 | 0.005501 | 0.015931 | 0.051920 | 0.194269 | 0.275186 | 0.446418 |
117 | Dark | Everything Is Now | 9 | 1 | 9 | http://www.imdb.com/title/tt7313320/?ref_=ttep... | 8.8 | 8599 | 0.006978 | 0.001744 | 0.001512 | 0.001279 | 0.006280 | 0.018025 | 0.055472 | 0.185719 | 0.260147 | 0.462845 |
111 | Dark | Past and Present | 3 | 1 | 3 | http://www.imdb.com/title/tt7305820/?ref_=ttep... | 8.7 | 9585 | 0.006051 | 0.001148 | 0.001669 | 0.002608 | 0.006886 | 0.023996 | 0.067710 | 0.208242 | 0.255921 | 0.425769 |
109 | Dark | Secrets | 1 | 1 | 1 | http://www.imdb.com/title/tt6305578/?ref_=ttep... | 8.3 | 11242 | 0.007917 | 0.001245 | 0.003202 | 0.003914 | 0.011564 | 0.030422 | 0.106298 | 0.274239 | 0.194094 | 0.367105 |
112 | Dark | Double Lives | 4 | 1 | 4 | http://www.imdb.com/title/tt7305818/?ref_=ttep... | 8.3 | 9088 | 0.005722 | 0.001761 | 0.001540 | 0.003631 | 0.009793 | 0.029820 | 0.102993 | 0.281690 | 0.180238 | 0.382812 |
110 | Dark | Lies | 2 | 1 | 2 | http://www.imdb.com/title/tt7305776/?ref_=ttep... | 8.2 | 9903 | 0.006160 | 0.001616 | 0.003130 | 0.002827 | 0.011108 | 0.034939 | 0.119661 | 0.293547 | 0.165303 | 0.361709 |
vamos guardar este dataframe com o nome darkdf
:
= imdb[imdb['series_name'].isin(['Dark'])] darkdf
Agrupando dados por categorias
Vamos agrupar os dados de imdb
e calcular algumas informações: a contagem de capítulos por seriado, a média do UserRating
por seriado e a soma total dos votos.
= imdb.groupby(['series_name']).agg({'series_name':"count",'UserRating':'mean','UserVotes':'sum'}) imdb_summary
imdb_summary
series_name | UserRating | UserVotes | |
---|---|---|---|
series_name | |||
13 Reasons Why | 49 | 7.248980 | 151452 |
Arquivo X | 356 | 7.988202 | 969628 |
Away | 10 | 7.080000 | 5388 |
Como Defender um Assassino | 180 | 8.568889 | 228352 |
Dark | 26 | 9.076923 | 264631 |
Dracula | 6 | 6.966667 | 28858 |
Era Uma Vez | 312 | 8.266026 | 397332 |
Friends | 235 | 8.451915 | 853287 |
Greenleaf | 60 | 7.586667 | 2777 |
Grimm: Contos de Terror | 123 | 8.435772 | 104967 |
Hemlock Grove | 33 | 7.490909 | 12538 |
La Casa de Papel | 31 | 8.209677 | 161782 |
Lúcifer | 75 | 8.744000 | 298412 |
Motel Bates | 50 | 8.592000 | 78773 |
Narcos | 30 | 8.760000 | 122160 |
O Bom Lugar | 50 | 8.272000 | 98562 |
Orange Is the New Black | 91 | 8.172527 | 156565 |
Orphan Black | 50 | 8.528000 | 60532 |
Os Originais | 184 | 8.828261 | 179686 |
Pânico: A Série de TV | 29 | 7.537931 | 28078 |
Ratched | 8 | 7.812500 | 8234 |
Sherlock | 15 | 8.800000 | 373114 |
Sleepy Hollow | 62 | 7.680645 | 33079 |
Sobrenatural | 211 | 8.389573 | 499046 |
Stranger Things | 25 | 8.676000 | 391288 |
The Walking Dead | 288 | 8.006944 | 2394586 |
The Witcher | 8 | 8.487500 | 122656 |
Podemos fazer o mesmo para o dataframe darkdf
:
= darkdf.groupby(['series_name']).count()
darkdf_grouped
darkdf_grouped
Episode | series_ep | season | season_ep | url | UserRating | UserVotes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
series_name | |||||||||||||||||
Dark | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 |
Uma boa prática é retornar o nome da coluna, neste caso series_name
para uma nova coluna, e desta forma, deixar o index
novamente numérico, para isto utilizamos a função reset_index
.
darkdf_grouped.reset_index()
series_name | Episode | series_ep | season | season_ep | url | UserRating | UserVotes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Dark | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 |
Vamos fazer o mesmo para o imdb_summary
:
= imdb_summary.rename(columns={'series_name':'Episodes'})
imdb_summary2 =True)
imdb_summary2.reset_index(inplace
7) imdb_summary2.head(
series_name | Episodes | UserRating | UserVotes | |
---|---|---|---|---|
0 | 13 Reasons Why | 49 | 7.248980 | 151452 |
1 | Arquivo X | 356 | 7.988202 | 969628 |
2 | Away | 10 | 7.080000 | 5388 |
3 | Como Defender um Assassino | 180 | 8.568889 | 228352 |
4 | Dark | 26 | 9.076923 | 264631 |
5 | Dracula | 6 | 6.966667 | 28858 |
6 | Era Uma Vez | 312 | 8.266026 | 397332 |
Modificando ou criando novas colunas
Similar à função mutate() no R, podemos usar o método df.assign
em Python.
= imdb.assign(r_total=imdb['r1']+imdb['r2']+imdb['r3']+imdb['r4']+imdb['r5']+imdb['r6']+imdb['r7']+imdb['r8']+imdb['r9']+imdb['r10']) imdb_rtotal
Agora podemos selecionar apenas as colunas que nos interessam.
'series_name','Episode','r_total']] imdb_rtotal[[
series_name | Episode | r_total | |
---|---|---|---|
2359 | Lúcifer | A Devil of My Word | 1.0 |
134 | Dark | The Paradise | 1.0 |
1404 | The Walking Dead | No Way Out | 1.0 |
1183 | Friends | The One Where Everybody Finds Out | 1.0 |
2371 | Lúcifer | Who's da New King of Hell? | 1.0 |
... | ... | ... | ... |
773 | Pânico: A Série de TV | Blindspots | 1.0 |
774 | Pânico: A Série de TV | Endgame | 1.0 |
1908 | Dracula | The Dark Compass | 1.0 |
42 | 13 Reasons Why | Senior Camping Trip | 1.0 |
137 | Dracula | The Dark Compass | 1.0 |
2597 rows × 3 columns
também podemos criar novas colunas utilizando outra notação, vamos assumir que queremos criar uma nova coluna no dataframe imdb_summary
:
'VotesPerEpisode']=imdb_summary2['UserVotes']/imdb_summary2['Episodes']
imdb_summary2[
8) imdb_summary2.head(
series_name | Episodes | UserRating | UserVotes | VotesPerEpisode | |
---|---|---|---|---|---|
0 | 13 Reasons Why | 49 | 7.248980 | 151452 | 3090.857143 |
1 | Arquivo X | 356 | 7.988202 | 969628 | 2723.674157 |
2 | Away | 10 | 7.080000 | 5388 | 538.800000 |
3 | Como Defender um Assassino | 180 | 8.568889 | 228352 | 1268.622222 |
4 | Dark | 26 | 9.076923 | 264631 | 10178.115385 |
5 | Dracula | 6 | 6.966667 | 28858 | 4809.666667 |
6 | Era Uma Vez | 312 | 8.266026 | 397332 | 1273.500000 |
7 | Friends | 235 | 8.451915 | 853287 | 3631.008511 |
e agora vamos ordenar pela nova coluna VotesPerEpisode
e arredondar para dois decimais:
'VotesPerEpisode',
imdb_summary2.sort_values(=False).round(decimals=2) ascending
series_name | Episodes | UserRating | UserVotes | VotesPerEpisode | |
---|---|---|---|---|---|
21 | Sherlock | 15 | 8.80 | 373114 | 24874.27 |
24 | Stranger Things | 25 | 8.68 | 391288 | 15651.52 |
26 | The Witcher | 8 | 8.49 | 122656 | 15332.00 |
4 | Dark | 26 | 9.08 | 264631 | 10178.12 |
25 | The Walking Dead | 288 | 8.01 | 2394586 | 8314.53 |
11 | La Casa de Papel | 31 | 8.21 | 161782 | 5218.77 |
5 | Dracula | 6 | 6.97 | 28858 | 4809.67 |
14 | Narcos | 30 | 8.76 | 122160 | 4072.00 |
12 | Lúcifer | 75 | 8.74 | 298412 | 3978.83 |
7 | Friends | 235 | 8.45 | 853287 | 3631.01 |
0 | 13 Reasons Why | 49 | 7.25 | 151452 | 3090.86 |
1 | Arquivo X | 356 | 7.99 | 969628 | 2723.67 |
23 | Sobrenatural | 211 | 8.39 | 499046 | 2365.15 |
15 | O Bom Lugar | 50 | 8.27 | 98562 | 1971.24 |
16 | Orange Is the New Black | 91 | 8.17 | 156565 | 1720.49 |
13 | Motel Bates | 50 | 8.59 | 78773 | 1575.46 |
6 | Era Uma Vez | 312 | 8.27 | 397332 | 1273.50 |
3 | Como Defender um Assassino | 180 | 8.57 | 228352 | 1268.62 |
17 | Orphan Black | 50 | 8.53 | 60532 | 1210.64 |
20 | Ratched | 8 | 7.81 | 8234 | 1029.25 |
18 | Os Originais | 184 | 8.83 | 179686 | 976.55 |
19 | Pânico: A Série de TV | 29 | 7.54 | 28078 | 968.21 |
9 | Grimm: Contos de Terror | 123 | 8.44 | 104967 | 853.39 |
2 | Away | 10 | 7.08 | 5388 | 538.80 |
22 | Sleepy Hollow | 62 | 7.68 | 33079 | 533.53 |
10 | Hemlock Grove | 33 | 7.49 | 12538 | 379.94 |
8 | Greenleaf | 60 | 7.59 | 2777 | 46.28 |
Juntando dois ou mais df
merge
une dois DataFrames (df1
e df2
) com base em colunas especificadas (on, left_on, right_on). O tipo de junção (how
) define como as tabelas serão combinadas:
'inner'
: Mantém apenas as linhas onde há correspondência nas colunas de junção (interseção).'left'
: Mantém todas as linhas do DataFrame esquerdo (df1
) e as correspondentes do direito (df2
).'right'
: Mantém todas as linhas do DataFrame direito (df2
) e as correspondentes do esquerdo (df1
).'outer'
: Mantém todas as linhas de ambos os DataFrames, preenchendo com NaN onde não há correspondência (união).
Use left_on
e right_on
quando as colunas de junção têm nomes diferentes nos DataFrames. Vamos a usar o pequeno df
que foi criado no início deste tutorial para verificar o funcionamento do merge
.
df
name | age | title | |
---|---|---|---|
0 | Alice | 25 | Sherlock |
1 | Bob | 30 | The Walking Dead |
2 | Charlie | 35 | Dark |
3 | Alice | 25 | Friends |
4 | David | 40 | Orange Is the New Black |
5 | Bob | 30 | The Walking Dead |
6 | Camille | 20 | Narcos |
e vamos criar um pequeno df
com dados do Netflix:
= {'season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Part 1','Part 2'],
netflix_5 'title': ['Sherlock', 'The Walking Dead', 'Dark', 'Friends', 'Orange Is the New Black', 'The Walking Dead','Narcos']}
= pd.DataFrame(netflix_5)
netflix_5_df
netflix_5_df
season | title | |
---|---|---|
0 | Season 1 | Sherlock |
1 | Season 2 | The Walking Dead |
2 | Season 3 | Dark |
3 | Season 4 | Friends |
4 | Season 5 | Orange Is the New Black |
5 | Part 1 | The Walking Dead |
6 | Part 2 | Narcos |
Vamos realizar o merge
:
='title', how='left') netflix_5_df.merge(df, on
season | title | name | age | |
---|---|---|---|---|
0 | Season 1 | Sherlock | Alice | 25 |
1 | Season 2 | The Walking Dead | Bob | 30 |
2 | Season 2 | The Walking Dead | Bob | 30 |
3 | Season 3 | Dark | Charlie | 35 |
4 | Season 4 | Friends | Alice | 25 |
5 | Season 5 | Orange Is the New Black | David | 40 |
6 | Part 1 | The Walking Dead | Bob | 30 |
7 | Part 1 | The Walking Dead | Bob | 30 |
8 | Part 2 | Narcos | Camille | 20 |