Você pode baixar os dados aqui.
Tutorial 1 - Pandas
O Pandas é uma biblioteca Python fundamental para análise de dados. Ela fornece estruturas de dados eficientes e ferramentas para manipulação, limpeza e análise de dados.
import pandas as pdCriando um dataframe
data = {'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob','Camille'],
'age': [25, 30, 35, 25, 40, 30,20],
'title':['Sherlock','The Walking Dead','Dark',
'Friends','Orange Is the New Black',
'The Walking Dead','Narcos']}
df = pd.DataFrame(data)print(df) name age title
0 Alice 25 Sherlock
1 Bob 30 The Walking Dead
2 Charlie 35 Dark
3 Alice 25 Friends
4 David 40 Orange Is the New Black
5 Bob 30 The Walking Dead
6 Camille 20 Narcos
Carregando os dados
netflix = pd.read_excel("data1/netflix_series_limpo.xlsx")
imdb = pd.read_excel("data1/imdb_series.xlsx")print(netflix.head)<bound method NDFrame.head of series_title season episode Date
0 Away Season 1 Home 2020-10-06
1 Away Season 1 Spektr 2020-10-06
2 Away Season 1 Vital Signs 2020-10-05
3 Away Season 1 Goodnight Mars 2020-10-05
4 Away Season 1 A Little Faith 2020-10-05
.. ... ... ... ...
918 Orphan Black Season 1 Natural Selection 2015-08-21
919 Bates Motel Season 2 The Immutable Truth 2015-08-15
920 Bates Motel Season 2 The Box 2015-08-15
921 Bates Motel Season 2 Meltdown 2015-08-14
922 Bates Motel Season 2 Presumed Innocent 2015-08-13
[923 rows x 4 columns]>
Também podemos ver rapidamente a estrutura de cada dataset, utilizando a função dtypes e describe().
netflix.describe()| Date | |
|---|---|
| count | 923 |
| mean | 2018-05-29 23:58:26.392199424 |
| min | 2015-08-13 00:00:00 |
| 25% | 2017-01-06 00:00:00 |
| 50% | 2018-09-10 00:00:00 |
| 75% | 2019-07-30 00:00:00 |
| max | 2020-10-06 00:00:00 |
netflix.dtypesseries_title object
season object
episode object
Date datetime64[ns]
dtype: object
Vamos converter a coluna season em variável categórica.
netflix['season'].astype('category')0 Season 1
1 Season 1
2 Season 1
3 Season 1
4 Season 1
...
918 Season 1
919 Season 2
920 Season 2
921 Season 2
922 Season 2
Name: season, Length: 923, dtype: category
Categories (32, object): [' 1ª temporada', ' Back in Business', ' Berry Bitty Adventures', ' Chapter Eight', ..., ' Turma da Mônica', ' Volume 1', ' Volume 2', ' Welcome to Ever After High']
Podemos ver as dimensões do nosso dataframe:
netflix.shape(923, 4)
Selecionando colunas
Para selecionar colunas utilizamos a forma df[['coluna']] onde df é o nome do dataframe e ‘coluna’ é o nome da coluna ou colunas que queremos selecionar.
netflix[['series_title','season']]| series_title | season | |
|---|---|---|
| 0 | Away | Season 1 |
| 1 | Away | Season 1 |
| 2 | Away | Season 1 |
| 3 | Away | Season 1 |
| 4 | Away | Season 1 |
| ... | ... | ... |
| 918 | Orphan Black | Season 1 |
| 919 | Bates Motel | Season 2 |
| 920 | Bates Motel | Season 2 |
| 921 | Bates Motel | Season 2 |
| 922 | Bates Motel | Season 2 |
923 rows × 2 columns
Ordenando os dados
Para ordenar as linhas, podemos utilizar a função sort_values() de forma a termos, por exemplo, uma lista de maior a menor de um determinado valor. Vamos usar o dataframe imdb para exemplificar, ordenando as linhas por ordem crescente de UserRating:
imdb.head()| series_name | Episode | series_ep | season | season_ep | url | UserRating | UserVotes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13 Reasons Why | Tape 1, Side A | 1 | 1 | 1 | http://www.imdb.com/title/tt5174246/?ref_=ttep... | 8.3 | 7016 | 0.055445 | 0.004418 | 0.004704 | 0.005844 | 0.014681 | 0.032640 | 0.105188 | 0.237030 | 0.246579 | 0.293472 |
| 1 | 13 Reasons Why | Tape 1, Side B | 2 | 1 | 2 | http://www.imdb.com/title/tt5174248/?ref_=ttep... | 8.0 | 5859 | 0.056665 | 0.004438 | 0.005803 | 0.008022 | 0.016726 | 0.045400 | 0.138420 | 0.311145 | 0.171872 | 0.241509 |
| 2 | 13 Reasons Why | Tape 2, Side A | 3 | 1 | 3 | http://www.imdb.com/title/tt5174250/?ref_=ttep... | 7.9 | 5509 | 0.058813 | 0.003993 | 0.005627 | 0.009984 | 0.022509 | 0.051734 | 0.160828 | 0.302414 | 0.140860 | 0.243238 |
| 3 | 13 Reasons Why | Tape 2, Side B | 4 | 1 | 4 | http://www.imdb.com/title/tt5174252/?ref_=ttep... | 8.1 | 5309 | 0.063477 | 0.003956 | 0.005086 | 0.007911 | 0.019213 | 0.045960 | 0.129215 | 0.298550 | 0.171219 | 0.255415 |
| 4 | 13 Reasons Why | Tape 3, Side A | 5 | 1 | 5 | http://www.imdb.com/title/tt5174254/?ref_=ttep... | 8.2 | 5252 | 0.066832 | 0.002666 | 0.004570 | 0.008378 | 0.018850 | 0.037129 | 0.114242 | 0.257997 | 0.203542 | 0.285796 |
imdb = imdb.sort_values('UserRating',
ascending=False)Agora vamos ver como ficou:
imdb[['series_name','UserRating']]| series_name | UserRating | |
|---|---|---|
| 2359 | Lúcifer | 9.8 |
| 134 | Dark | 9.7 |
| 1404 | The Walking Dead | 9.7 |
| 1183 | Friends | 9.7 |
| 2371 | Lúcifer | 9.7 |
| ... | ... | ... |
| 773 | Pânico: A Série de TV | 5.5 |
| 774 | Pânico: A Série de TV | 5.2 |
| 1908 | Dracula | 5.2 |
| 42 | 13 Reasons Why | 5.2 |
| 137 | Dracula | 5.2 |
2597 rows × 2 columns
Filtrando Linhas
Podemos filtrar linhas utilizando o método df.query() onde df é o nome do dataframe.
imdb.query('UserVotes > 10000')| series_name | Episode | series_ep | season | season_ep | url | UserRating | UserVotes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2359 | Lúcifer | A Devil of My Word | 55 | 3 | 24 | http://www.imdb.com/title/tt8253126/?ref_=ttep... | 9.8 | 10712 | 0.021004 | 0.001587 | 0.000840 | 0.000747 | 0.002054 | 0.002801 | 0.007468 | 0.018764 | 0.063107 | 0.881628 |
| 134 | Dark | The Paradise | 26 | 3 | 8 | http://www.imdb.com/title/tt12557704/?ref_=tte... | 9.7 | 17468 | 0.020151 | 0.002633 | 0.003263 | 0.002805 | 0.004351 | 0.006698 | 0.012308 | 0.022556 | 0.059824 | 0.865411 |
| 1404 | The Walking Dead | No Way Out | 57 | 6 | 9 | http://www.imdb.com/title/tt4575388/?ref_=ttep... | 9.7 | 24358 | 0.027424 | 0.003490 | 0.002094 | 0.002669 | 0.005665 | 0.007390 | 0.015519 | 0.033008 | 0.101979 | 0.800764 |
| 1371 | The Walking Dead | Too Far Gone | 24 | 4 | 8 | http://www.imdb.com/title/tt2948638/?ref_=ttep... | 9.7 | 22061 | 0.044876 | 0.002584 | 0.001541 | 0.001496 | 0.004624 | 0.006029 | 0.013100 | 0.039255 | 0.126286 | 0.760210 |
| 985 | The Walking Dead | No Way Out | 57 | 6 | 9 | http://www.imdb.com/title/tt4575388/?ref_=ttep... | 9.7 | 24358 | 0.027424 | 0.003490 | 0.002094 | 0.002669 | 0.005665 | 0.007390 | 0.015519 | 0.033008 | 0.101979 | 0.800764 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1411 | The Walking Dead | Last Day on Earth | 64 | 6 | 16 | http://www.imdb.com/title/tt4589574/?ref_=ttep... | 6.6 | 26981 | 0.263667 | 0.036878 | 0.037285 | 0.037211 | 0.052259 | 0.049590 | 0.068826 | 0.090693 | 0.104221 | 0.259368 |
| 992 | The Walking Dead | Last Day on Earth | 64 | 6 | 16 | http://www.imdb.com/title/tt4589574/?ref_=ttep... | 6.6 | 26981 | 0.263667 | 0.036878 | 0.037285 | 0.037211 | 0.052259 | 0.049590 | 0.068826 | 0.090693 | 0.104221 | 0.259368 |
| 1337 | Stranger Things | Chapter Seven: The Lost Sister | 15 | 2 | 7 | http://www.imdb.com/title/tt6020810/?ref_=ttep... | 6.1 | 21371 | 0.166955 | 0.045716 | 0.047869 | 0.059239 | 0.100791 | 0.118198 | 0.157784 | 0.116794 | 0.058958 | 0.127696 |
| 1417 | The Walking Dead | Swear | 70 | 7 | 6 | http://www.imdb.com/title/tt5207734/?ref_=ttep... | 5.6 | 11971 | 0.197978 | 0.042686 | 0.047615 | 0.060647 | 0.102164 | 0.140506 | 0.133823 | 0.087127 | 0.037591 | 0.149862 |
| 998 | The Walking Dead | Swear | 70 | 7 | 6 | http://www.imdb.com/title/tt5207734/?ref_=ttep... | 5.6 | 11971 | 0.197978 | 0.042686 | 0.047615 | 0.060647 | 0.102164 | 0.140506 | 0.133823 | 0.087127 | 0.037591 | 0.149862 |
124 rows × 18 columns
Se quisermos combinar filtrado de linhas com seleção de colunas:
imdb_10 = imdb.query('UserVotes > 10000')[['series_name','Episode','UserRating','UserVotes']]imdb_10.head(10)| series_name | Episode | UserRating | UserVotes | |
|---|---|---|---|---|
| 2359 | Lúcifer | A Devil of My Word | 9.8 | 10712 |
| 134 | Dark | The Paradise | 9.7 | 17468 |
| 1404 | The Walking Dead | No Way Out | 9.7 | 24358 |
| 1371 | The Walking Dead | Too Far Gone | 9.7 | 22061 |
| 985 | The Walking Dead | No Way Out | 9.7 | 24358 |
| 1314 | Sherlock | The Reichenbach Fall | 9.7 | 33904 |
| 1307 | Friends | The Last One | 9.7 | 11634 |
| 952 | The Walking Dead | Too Far Gone | 9.7 | 22061 |
| 961 | The Walking Dead | No Sanctuary | 9.6 | 23608 |
| 133 | Dark | Between the Time | 9.6 | 13301 |
imdb_10.shape(124, 4)
vamos identificar se há duplicados.
duplicates = imdb_10.duplicated()
print(duplicates)2359 False
134 False
1404 False
1371 False
985 True
...
1411 False
992 True
1337 False
1417 False
998 True
Length: 124, dtype: bool
Notamos que existem duplicados, portanto precisamos removê-los utilizando o método drop_duplicates().
imdb_10 = imdb_10.drop_duplicates(subset=['Episode'])Vamos ver quantas linhas temos agora.
imdb_10.shape(91, 4)
também podemos filtrar por strings. Por exemplo, vamos filtrar pelo seriado “Dark”:
imdb['series_name'].isin(['Dark'])2359 False
134 True
1404 False
1183 False
2371 False
...
773 False
774 False
1908 False
42 False
137 False
Name: series_name, Length: 2597, dtype: bool
Vemos que o código acima retorna uma série com valores lógicos (True-False) para cada linha. Para efetivamente ver o resultado do seriado Dark devemos:
imdb[imdb['series_name'].isin(['Dark'])]| series_name | Episode | series_ep | season | season_ep | url | UserRating | UserVotes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 134 | Dark | The Paradise | 26 | 3 | 8 | http://www.imdb.com/title/tt12557704/?ref_=tte... | 9.7 | 17468 | 0.020151 | 0.002633 | 0.003263 | 0.002805 | 0.004351 | 0.006698 | 0.012308 | 0.022556 | 0.059824 | 0.865411 |
| 133 | Dark | Between the Time | 25 | 3 | 7 | http://www.imdb.com/title/tt12557700/?ref_=tte... | 9.6 | 13301 | 0.020374 | 0.001880 | 0.002556 | 0.002707 | 0.004285 | 0.006842 | 0.014811 | 0.031577 | 0.073904 | 0.841065 |
| 124 | Dark | An Endless Cycle | 16 | 2 | 6 | http://www.imdb.com/title/tt10454734/?ref_=tte... | 9.6 | 11149 | 0.007983 | 0.000807 | 0.001256 | 0.001435 | 0.001884 | 0.005292 | 0.016055 | 0.045834 | 0.124765 | 0.794690 |
| 131 | Dark | Life and Death | 23 | 3 | 5 | http://www.imdb.com/title/tt12557688/?ref_=tte... | 9.5 | 10359 | 0.021238 | 0.001834 | 0.002510 | 0.002993 | 0.005502 | 0.010908 | 0.020948 | 0.044985 | 0.123661 | 0.765421 |
| 126 | Dark | Endings and Beginnings | 18 | 2 | 8 | http://www.imdb.com/title/tt10457654/?ref_=tte... | 9.5 | 11379 | 0.012040 | 0.002109 | 0.001846 | 0.002373 | 0.004482 | 0.009667 | 0.020301 | 0.045698 | 0.123473 | 0.778012 |
| 122 | Dark | The Travelers | 14 | 2 | 4 | http://www.imdb.com/title/tt10454726/?ref_=tte... | 9.5 | 9968 | 0.007925 | 0.000903 | 0.001204 | 0.001505 | 0.002307 | 0.006120 | 0.019161 | 0.056180 | 0.166132 | 0.738563 |
| 123 | Dark | Lost and Found | 15 | 2 | 5 | http://www.imdb.com/title/tt10454732/?ref_=tte... | 9.4 | 9210 | 0.007600 | 0.000869 | 0.001303 | 0.001629 | 0.002606 | 0.007166 | 0.023127 | 0.075570 | 0.197720 | 0.682410 |
| 125 | Dark | The White Devil | 17 | 2 | 7 | http://www.imdb.com/title/tt10457652/?ref_=tte... | 9.3 | 9050 | 0.008066 | 0.001657 | 0.001878 | 0.001768 | 0.003757 | 0.009392 | 0.026851 | 0.083315 | 0.204972 | 0.658343 |
| 132 | Dark | Light and Shadow | 24 | 3 | 6 | http://www.imdb.com/title/tt12557694/?ref_=tte... | 9.3 | 10060 | 0.024453 | 0.001889 | 0.002783 | 0.004175 | 0.006362 | 0.010934 | 0.025547 | 0.063221 | 0.127137 | 0.733499 |
| 116 | Dark | As You Sow, so You Shall Reap | 8 | 1 | 8 | http://www.imdb.com/title/tt7313316/?ref_=ttep... | 9.2 | 9294 | 0.007209 | 0.001076 | 0.000753 | 0.001829 | 0.003443 | 0.011298 | 0.030773 | 0.101033 | 0.257586 | 0.585001 |
| 118 | Dark | Alpha and Omega | 10 | 1 | 10 | http://www.imdb.com/title/tt7313322/?ref_=ttep... | 9.2 | 9758 | 0.010248 | 0.001947 | 0.001230 | 0.003279 | 0.005944 | 0.011273 | 0.032589 | 0.086801 | 0.229248 | 0.617442 |
| 121 | Dark | Ghosts | 13 | 2 | 3 | http://www.imdb.com/title/tt10454722/?ref_=tte... | 9.2 | 9138 | 0.007004 | 0.000766 | 0.001970 | 0.002079 | 0.004487 | 0.008645 | 0.030422 | 0.100897 | 0.226855 | 0.616875 |
| 114 | Dark | Sic Mundus Creatus Est | 6 | 1 | 6 | http://www.imdb.com/title/tt7313312/?ref_=ttep... | 9.1 | 9434 | 0.006996 | 0.000742 | 0.001272 | 0.001272 | 0.004240 | 0.014204 | 0.035616 | 0.115222 | 0.264151 | 0.556286 |
| 130 | Dark | The Origin | 22 | 3 | 4 | http://www.imdb.com/title/tt12557686/?ref_=tte... | 9.1 | 9529 | 0.022248 | 0.003883 | 0.004722 | 0.004198 | 0.006611 | 0.013118 | 0.033582 | 0.082695 | 0.163816 | 0.665128 |
| 113 | Dark | Truths | 5 | 1 | 5 | http://www.imdb.com/title/tt7313308/?ref_=ttep... | 9.0 | 9563 | 0.005751 | 0.000837 | 0.001046 | 0.002301 | 0.004183 | 0.014744 | 0.045070 | 0.134790 | 0.278992 | 0.512287 |
| 120 | Dark | Dark Matter | 12 | 2 | 2 | http://www.imdb.com/title/tt10454716/?ref_=tte... | 9.0 | 9221 | 0.006941 | 0.001084 | 0.001193 | 0.002603 | 0.003579 | 0.010303 | 0.034920 | 0.122655 | 0.247479 | 0.569244 |
| 127 | Dark | Deja-vu | 19 | 3 | 1 | http://www.imdb.com/title/tt10414808/?ref_=tte... | 9.0 | 10609 | 0.022622 | 0.003205 | 0.003959 | 0.004430 | 0.006221 | 0.014704 | 0.039872 | 0.098407 | 0.202658 | 0.603921 |
| 129 | Dark | Adam and Eva | 21 | 3 | 3 | http://www.imdb.com/title/tt12557682/?ref_=tte... | 8.9 | 9364 | 0.022747 | 0.002777 | 0.004806 | 0.004058 | 0.008330 | 0.016339 | 0.040474 | 0.109141 | 0.194682 | 0.596647 |
| 128 | Dark | The Survivors | 20 | 3 | 2 | http://www.imdb.com/title/tt12557670/?ref_=tte... | 8.9 | 9666 | 0.022967 | 0.003311 | 0.004138 | 0.004552 | 0.006104 | 0.017381 | 0.045520 | 0.119284 | 0.202773 | 0.573971 |
| 119 | Dark | Beginnings and Endings | 11 | 2 | 1 | http://www.imdb.com/title/tt7787482/?ref_=ttep... | 8.9 | 9969 | 0.007122 | 0.001505 | 0.001404 | 0.003611 | 0.004213 | 0.013442 | 0.049955 | 0.154780 | 0.238239 | 0.525730 |
| 115 | Dark | Crossroads | 7 | 1 | 7 | http://www.imdb.com/title/tt7305824/?ref_=ttep... | 8.8 | 8725 | 0.005845 | 0.001261 | 0.001032 | 0.002636 | 0.005501 | 0.015931 | 0.051920 | 0.194269 | 0.275186 | 0.446418 |
| 117 | Dark | Everything Is Now | 9 | 1 | 9 | http://www.imdb.com/title/tt7313320/?ref_=ttep... | 8.8 | 8599 | 0.006978 | 0.001744 | 0.001512 | 0.001279 | 0.006280 | 0.018025 | 0.055472 | 0.185719 | 0.260147 | 0.462845 |
| 111 | Dark | Past and Present | 3 | 1 | 3 | http://www.imdb.com/title/tt7305820/?ref_=ttep... | 8.7 | 9585 | 0.006051 | 0.001148 | 0.001669 | 0.002608 | 0.006886 | 0.023996 | 0.067710 | 0.208242 | 0.255921 | 0.425769 |
| 109 | Dark | Secrets | 1 | 1 | 1 | http://www.imdb.com/title/tt6305578/?ref_=ttep... | 8.3 | 11242 | 0.007917 | 0.001245 | 0.003202 | 0.003914 | 0.011564 | 0.030422 | 0.106298 | 0.274239 | 0.194094 | 0.367105 |
| 112 | Dark | Double Lives | 4 | 1 | 4 | http://www.imdb.com/title/tt7305818/?ref_=ttep... | 8.3 | 9088 | 0.005722 | 0.001761 | 0.001540 | 0.003631 | 0.009793 | 0.029820 | 0.102993 | 0.281690 | 0.180238 | 0.382812 |
| 110 | Dark | Lies | 2 | 1 | 2 | http://www.imdb.com/title/tt7305776/?ref_=ttep... | 8.2 | 9903 | 0.006160 | 0.001616 | 0.003130 | 0.002827 | 0.011108 | 0.034939 | 0.119661 | 0.293547 | 0.165303 | 0.361709 |
vamos guardar este dataframe com o nome darkdf:
darkdf = imdb[imdb['series_name'].isin(['Dark'])]Agrupando dados por categorias
Vamos agrupar os dados de imdb e calcular algumas informações: a contagem de capítulos por seriado, a média do UserRating por seriado e a soma total dos votos.
imdb_summary = imdb.groupby(['series_name']).agg({'series_name':"count",'UserRating':'mean','UserVotes':'sum'})imdb_summary| series_name | UserRating | UserVotes | |
|---|---|---|---|
| series_name | |||
| 13 Reasons Why | 49 | 7.248980 | 151452 |
| Arquivo X | 356 | 7.988202 | 969628 |
| Away | 10 | 7.080000 | 5388 |
| Como Defender um Assassino | 180 | 8.568889 | 228352 |
| Dark | 26 | 9.076923 | 264631 |
| Dracula | 6 | 6.966667 | 28858 |
| Era Uma Vez | 312 | 8.266026 | 397332 |
| Friends | 235 | 8.451915 | 853287 |
| Greenleaf | 60 | 7.586667 | 2777 |
| Grimm: Contos de Terror | 123 | 8.435772 | 104967 |
| Hemlock Grove | 33 | 7.490909 | 12538 |
| La Casa de Papel | 31 | 8.209677 | 161782 |
| Lúcifer | 75 | 8.744000 | 298412 |
| Motel Bates | 50 | 8.592000 | 78773 |
| Narcos | 30 | 8.760000 | 122160 |
| O Bom Lugar | 50 | 8.272000 | 98562 |
| Orange Is the New Black | 91 | 8.172527 | 156565 |
| Orphan Black | 50 | 8.528000 | 60532 |
| Os Originais | 184 | 8.828261 | 179686 |
| Pânico: A Série de TV | 29 | 7.537931 | 28078 |
| Ratched | 8 | 7.812500 | 8234 |
| Sherlock | 15 | 8.800000 | 373114 |
| Sleepy Hollow | 62 | 7.680645 | 33079 |
| Sobrenatural | 211 | 8.389573 | 499046 |
| Stranger Things | 25 | 8.676000 | 391288 |
| The Walking Dead | 288 | 8.006944 | 2394586 |
| The Witcher | 8 | 8.487500 | 122656 |
Podemos fazer o mesmo para o dataframe darkdf:
darkdf_grouped = darkdf.groupby(['series_name']).count()
darkdf_grouped| Episode | series_ep | season | season_ep | url | UserRating | UserVotes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| series_name | |||||||||||||||||
| Dark | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 |
Uma boa prática é retornar o nome da coluna, neste caso series_name para uma nova coluna, e desta forma, deixar o index novamente numérico, para isto utilizamos a função reset_index.
darkdf_grouped.reset_index()| series_name | Episode | series_ep | season | season_ep | url | UserRating | UserVotes | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dark | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 | 26 |
Vamos fazer o mesmo para o imdb_summary:
imdb_summary2 = imdb_summary.rename(columns={'series_name':'Episodes'})
imdb_summary2.reset_index(inplace=True)
imdb_summary2.head(7)| series_name | Episodes | UserRating | UserVotes | |
|---|---|---|---|---|
| 0 | 13 Reasons Why | 49 | 7.248980 | 151452 |
| 1 | Arquivo X | 356 | 7.988202 | 969628 |
| 2 | Away | 10 | 7.080000 | 5388 |
| 3 | Como Defender um Assassino | 180 | 8.568889 | 228352 |
| 4 | Dark | 26 | 9.076923 | 264631 |
| 5 | Dracula | 6 | 6.966667 | 28858 |
| 6 | Era Uma Vez | 312 | 8.266026 | 397332 |
Modificando ou criando novas colunas
Similar à função mutate() no R, podemos usar o método df.assign em Python.
imdb_rtotal = imdb.assign(r_total=imdb['r1']+imdb['r2']+imdb['r3']+imdb['r4']+imdb['r5']+imdb['r6']+imdb['r7']+imdb['r8']+imdb['r9']+imdb['r10'])Agora podemos selecionar apenas as colunas que nos interessam.
imdb_rtotal[['series_name','Episode','r_total']]| series_name | Episode | r_total | |
|---|---|---|---|
| 2359 | Lúcifer | A Devil of My Word | 1.0 |
| 134 | Dark | The Paradise | 1.0 |
| 1404 | The Walking Dead | No Way Out | 1.0 |
| 1183 | Friends | The One Where Everybody Finds Out | 1.0 |
| 2371 | Lúcifer | Who's da New King of Hell? | 1.0 |
| ... | ... | ... | ... |
| 773 | Pânico: A Série de TV | Blindspots | 1.0 |
| 774 | Pânico: A Série de TV | Endgame | 1.0 |
| 1908 | Dracula | The Dark Compass | 1.0 |
| 42 | 13 Reasons Why | Senior Camping Trip | 1.0 |
| 137 | Dracula | The Dark Compass | 1.0 |
2597 rows × 3 columns
também podemos criar novas colunas utilizando outra notação, vamos assumir que queremos criar uma nova coluna no dataframe imdb_summary :
imdb_summary2['VotesPerEpisode']=imdb_summary2['UserVotes']/imdb_summary2['Episodes']
imdb_summary2.head(8)| series_name | Episodes | UserRating | UserVotes | VotesPerEpisode | |
|---|---|---|---|---|---|
| 0 | 13 Reasons Why | 49 | 7.248980 | 151452 | 3090.857143 |
| 1 | Arquivo X | 356 | 7.988202 | 969628 | 2723.674157 |
| 2 | Away | 10 | 7.080000 | 5388 | 538.800000 |
| 3 | Como Defender um Assassino | 180 | 8.568889 | 228352 | 1268.622222 |
| 4 | Dark | 26 | 9.076923 | 264631 | 10178.115385 |
| 5 | Dracula | 6 | 6.966667 | 28858 | 4809.666667 |
| 6 | Era Uma Vez | 312 | 8.266026 | 397332 | 1273.500000 |
| 7 | Friends | 235 | 8.451915 | 853287 | 3631.008511 |
e agora vamos ordenar pela nova coluna VotesPerEpisode e arredondar para dois decimais:
imdb_summary2.sort_values('VotesPerEpisode',
ascending=False).round(decimals=2)| series_name | Episodes | UserRating | UserVotes | VotesPerEpisode | |
|---|---|---|---|---|---|
| 21 | Sherlock | 15 | 8.80 | 373114 | 24874.27 |
| 24 | Stranger Things | 25 | 8.68 | 391288 | 15651.52 |
| 26 | The Witcher | 8 | 8.49 | 122656 | 15332.00 |
| 4 | Dark | 26 | 9.08 | 264631 | 10178.12 |
| 25 | The Walking Dead | 288 | 8.01 | 2394586 | 8314.53 |
| 11 | La Casa de Papel | 31 | 8.21 | 161782 | 5218.77 |
| 5 | Dracula | 6 | 6.97 | 28858 | 4809.67 |
| 14 | Narcos | 30 | 8.76 | 122160 | 4072.00 |
| 12 | Lúcifer | 75 | 8.74 | 298412 | 3978.83 |
| 7 | Friends | 235 | 8.45 | 853287 | 3631.01 |
| 0 | 13 Reasons Why | 49 | 7.25 | 151452 | 3090.86 |
| 1 | Arquivo X | 356 | 7.99 | 969628 | 2723.67 |
| 23 | Sobrenatural | 211 | 8.39 | 499046 | 2365.15 |
| 15 | O Bom Lugar | 50 | 8.27 | 98562 | 1971.24 |
| 16 | Orange Is the New Black | 91 | 8.17 | 156565 | 1720.49 |
| 13 | Motel Bates | 50 | 8.59 | 78773 | 1575.46 |
| 6 | Era Uma Vez | 312 | 8.27 | 397332 | 1273.50 |
| 3 | Como Defender um Assassino | 180 | 8.57 | 228352 | 1268.62 |
| 17 | Orphan Black | 50 | 8.53 | 60532 | 1210.64 |
| 20 | Ratched | 8 | 7.81 | 8234 | 1029.25 |
| 18 | Os Originais | 184 | 8.83 | 179686 | 976.55 |
| 19 | Pânico: A Série de TV | 29 | 7.54 | 28078 | 968.21 |
| 9 | Grimm: Contos de Terror | 123 | 8.44 | 104967 | 853.39 |
| 2 | Away | 10 | 7.08 | 5388 | 538.80 |
| 22 | Sleepy Hollow | 62 | 7.68 | 33079 | 533.53 |
| 10 | Hemlock Grove | 33 | 7.49 | 12538 | 379.94 |
| 8 | Greenleaf | 60 | 7.59 | 2777 | 46.28 |
Juntando dois ou mais df
merge une dois DataFrames (df1 e df2) com base em colunas especificadas (on, left_on, right_on). O tipo de junção (how) define como as tabelas serão combinadas:
'inner': Mantém apenas as linhas onde há correspondência nas colunas de junção (interseção).'left': Mantém todas as linhas do DataFrame esquerdo (df1) e as correspondentes do direito (df2).'right': Mantém todas as linhas do DataFrame direito (df2) e as correspondentes do esquerdo (df1).'outer': Mantém todas as linhas de ambos os DataFrames, preenchendo com NaN onde não há correspondência (união).
Use left_on e right_on quando as colunas de junção têm nomes diferentes nos DataFrames. Vamos a usar o pequeno df que foi criado no início deste tutorial para verificar o funcionamento do merge.
df| name | age | title | |
|---|---|---|---|
| 0 | Alice | 25 | Sherlock |
| 1 | Bob | 30 | The Walking Dead |
| 2 | Charlie | 35 | Dark |
| 3 | Alice | 25 | Friends |
| 4 | David | 40 | Orange Is the New Black |
| 5 | Bob | 30 | The Walking Dead |
| 6 | Camille | 20 | Narcos |
e vamos criar um pequeno df com dados do Netflix:
netflix_5 = {'season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Part 1','Part 2'],
'title': ['Sherlock', 'The Walking Dead', 'Dark', 'Friends', 'Orange Is the New Black', 'The Walking Dead','Narcos']}
netflix_5_df = pd.DataFrame(netflix_5)
netflix_5_df| season | title | |
|---|---|---|
| 0 | Season 1 | Sherlock |
| 1 | Season 2 | The Walking Dead |
| 2 | Season 3 | Dark |
| 3 | Season 4 | Friends |
| 4 | Season 5 | Orange Is the New Black |
| 5 | Part 1 | The Walking Dead |
| 6 | Part 2 | Narcos |
Vamos realizar o merge:
netflix_5_df.merge(df, on='title', how='left')| season | title | name | age | |
|---|---|---|---|---|
| 0 | Season 1 | Sherlock | Alice | 25 |
| 1 | Season 2 | The Walking Dead | Bob | 30 |
| 2 | Season 2 | The Walking Dead | Bob | 30 |
| 3 | Season 3 | Dark | Charlie | 35 |
| 4 | Season 4 | Friends | Alice | 25 |
| 5 | Season 5 | Orange Is the New Black | David | 40 |
| 6 | Part 1 | The Walking Dead | Bob | 30 |
| 7 | Part 1 | The Walking Dead | Bob | 30 |
| 8 | Part 2 | Narcos | Camille | 20 |