Tutorial 1 - Pandas

O Pandas é uma biblioteca Python fundamental para análise de dados. Ela fornece estruturas de dados eficientes e ferramentas para manipulação, limpeza e análise de dados.

Você pode baixar os dados aqui.

import pandas as pd

Criando um dataframe

data = {'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob','Camille'],
        'age': [25, 30, 35, 25, 40, 30,20],
        'title':['Sherlock','The Walking Dead','Dark',
        'Friends','Orange Is the New Black',
        'The Walking Dead','Narcos']}
df = pd.DataFrame(data)

print(df)

      name  age                    title
0    Alice   25                 Sherlock
1      Bob   30         The Walking Dead
2  Charlie   35                     Dark
3    Alice   25                  Friends
4    David   40  Orange Is the New Black
5      Bob   30         The Walking Dead
6  Camille   20                   Narcos

Carregando os dados

netflix = pd.read_excel("data1/netflix_series_limpo.xlsx")
imdb = pd.read_excel("data1/imdb_series.xlsx")

print(netflix.head)

<bound method NDFrame.head of      series_title     season               episode       Date
0            Away   Season 1                  Home 2020-10-06
1            Away   Season 1                Spektr 2020-10-06
2            Away   Season 1           Vital Signs 2020-10-05
3            Away   Season 1        Goodnight Mars 2020-10-05
4            Away   Season 1        A Little Faith 2020-10-05
..            ...        ...                   ...        ...
918  Orphan Black   Season 1     Natural Selection 2015-08-21
919   Bates Motel   Season 2   The Immutable Truth 2015-08-15
920   Bates Motel   Season 2               The Box 2015-08-15
921   Bates Motel   Season 2              Meltdown 2015-08-14
922   Bates Motel   Season 2     Presumed Innocent 2015-08-13

[923 rows x 4 columns]>

Também podemos ver rapidamente a estrutura de cada dataset, utilizando a função dtypes e describe().

netflix.describe()

	Date
count	923
mean	2018-05-29 23:58:26.392199424
min	2015-08-13 00:00:00
25%	2017-01-06 00:00:00
50%	2018-09-10 00:00:00
75%	2019-07-30 00:00:00
max	2020-10-06 00:00:00

netflix.dtypes

series_title            object
season                  object
episode                 object
Date            datetime64[ns]
dtype: object

Vamos converter a coluna season em variável categórica.

netflix['season'].astype('category')

0       Season 1
1       Season 1
2       Season 1
3       Season 1
4       Season 1
         ...    
918     Season 1
919     Season 2
920     Season 2
921     Season 2
922     Season 2
Name: season, Length: 923, dtype: category
Categories (32, object): [' 1ª temporada', ' Back in Business', ' Berry Bitty Adventures', ' Chapter Eight', ..., ' Turma da Mônica', ' Volume 1', ' Volume 2', ' Welcome to Ever After High']

Podemos ver as dimensões do nosso dataframe:

netflix.shape

(923, 4)

Selecionando colunas

Para selecionar colunas utilizamos a forma df[['coluna']] onde df é o nome do dataframe e ‘coluna’ é o nome da coluna ou colunas que queremos selecionar.

netflix[['series_title','season']]

	series_title	season
0	Away	Season 1
1	Away	Season 1
2	Away	Season 1
3	Away	Season 1
4	Away	Season 1
...	...	...
918	Orphan Black	Season 1
919	Bates Motel	Season 2
920	Bates Motel	Season 2
921	Bates Motel	Season 2
922	Bates Motel	Season 2

923 rows × 2 columns

Ordenando os dados

Para ordenar as linhas, podemos utilizar a função sort_values() de forma a termos, por exemplo, uma lista de maior a menor de um determinado valor. Vamos usar o dataframe imdb para exemplificar, ordenando as linhas por ordem crescente de UserRating:

imdb.head()

	series_name	Episode	series_ep	season	season_ep	url	UserRating	UserVotes	r1	r2	r3	r4	r5	r6	r7	r8	r9	r10
0	13 Reasons Why	Tape 1, Side A	1	1	1	http://www.imdb.com/title/tt5174246/?ref_=ttep...	8.3	7016	0.055445	0.004418	0.004704	0.005844	0.014681	0.032640	0.105188	0.237030	0.246579	0.293472
1	13 Reasons Why	Tape 1, Side B	2	1	2	http://www.imdb.com/title/tt5174248/?ref_=ttep...	8.0	5859	0.056665	0.004438	0.005803	0.008022	0.016726	0.045400	0.138420	0.311145	0.171872	0.241509
2	13 Reasons Why	Tape 2, Side A	3	1	3	http://www.imdb.com/title/tt5174250/?ref_=ttep...	7.9	5509	0.058813	0.003993	0.005627	0.009984	0.022509	0.051734	0.160828	0.302414	0.140860	0.243238
3	13 Reasons Why	Tape 2, Side B	4	1	4	http://www.imdb.com/title/tt5174252/?ref_=ttep...	8.1	5309	0.063477	0.003956	0.005086	0.007911	0.019213	0.045960	0.129215	0.298550	0.171219	0.255415
4	13 Reasons Why	Tape 3, Side A	5	1	5	http://www.imdb.com/title/tt5174254/?ref_=ttep...	8.2	5252	0.066832	0.002666	0.004570	0.008378	0.018850	0.037129	0.114242	0.257997	0.203542	0.285796

imdb = imdb.sort_values('UserRating',
                ascending=False)

Agora vamos ver como ficou:

imdb[['series_name','UserRating']]

	series_name	UserRating
2359	Lúcifer	9.8
134	Dark	9.7
1404	The Walking Dead	9.7
1183	Friends	9.7
2371	Lúcifer	9.7
...	...	...
773	Pânico: A Série de TV	5.5
774	Pânico: A Série de TV	5.2
1908	Dracula	5.2
42	13 Reasons Why	5.2
137	Dracula	5.2

2597 rows × 2 columns

Filtrando Linhas

Podemos filtrar linhas utilizando o método df.query() onde df é o nome do dataframe.

imdb.query('UserVotes > 10000')

	series_name	Episode	series_ep	season	season_ep	url	UserRating	UserVotes	r1	r2	r3	r4	r5	r6	r7	r8	r9	r10
2359	Lúcifer	A Devil of My Word	55	3	24	http://www.imdb.com/title/tt8253126/?ref_=ttep...	9.8	10712	0.021004	0.001587	0.000840	0.000747	0.002054	0.002801	0.007468	0.018764	0.063107	0.881628
134	Dark	The Paradise	26	3	8	http://www.imdb.com/title/tt12557704/?ref_=tte...	9.7	17468	0.020151	0.002633	0.003263	0.002805	0.004351	0.006698	0.012308	0.022556	0.059824	0.865411
1404	The Walking Dead	No Way Out	57	6	9	http://www.imdb.com/title/tt4575388/?ref_=ttep...	9.7	24358	0.027424	0.003490	0.002094	0.002669	0.005665	0.007390	0.015519	0.033008	0.101979	0.800764
1371	The Walking Dead	Too Far Gone	24	4	8	http://www.imdb.com/title/tt2948638/?ref_=ttep...	9.7	22061	0.044876	0.002584	0.001541	0.001496	0.004624	0.006029	0.013100	0.039255	0.126286	0.760210
985	The Walking Dead	No Way Out	57	6	9	http://www.imdb.com/title/tt4575388/?ref_=ttep...	9.7	24358	0.027424	0.003490	0.002094	0.002669	0.005665	0.007390	0.015519	0.033008	0.101979	0.800764
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1411	The Walking Dead	Last Day on Earth	64	6	16	http://www.imdb.com/title/tt4589574/?ref_=ttep...	6.6	26981	0.263667	0.036878	0.037285	0.037211	0.052259	0.049590	0.068826	0.090693	0.104221	0.259368
992	The Walking Dead	Last Day on Earth	64	6	16	http://www.imdb.com/title/tt4589574/?ref_=ttep...	6.6	26981	0.263667	0.036878	0.037285	0.037211	0.052259	0.049590	0.068826	0.090693	0.104221	0.259368
1337	Stranger Things	Chapter Seven: The Lost Sister	15	2	7	http://www.imdb.com/title/tt6020810/?ref_=ttep...	6.1	21371	0.166955	0.045716	0.047869	0.059239	0.100791	0.118198	0.157784	0.116794	0.058958	0.127696
1417	The Walking Dead	Swear	70	7	6	http://www.imdb.com/title/tt5207734/?ref_=ttep...	5.6	11971	0.197978	0.042686	0.047615	0.060647	0.102164	0.140506	0.133823	0.087127	0.037591	0.149862
998	The Walking Dead	Swear	70	7	6	http://www.imdb.com/title/tt5207734/?ref_=ttep...	5.6	11971	0.197978	0.042686	0.047615	0.060647	0.102164	0.140506	0.133823	0.087127	0.037591	0.149862

124 rows × 18 columns

Se quisermos combinar filtrado de linhas com seleção de colunas:

imdb_10 = imdb.query('UserVotes > 10000')[['series_name','Episode','UserRating','UserVotes']]

imdb_10.head(10)

	series_name	Episode	UserRating	UserVotes
2359	Lúcifer	A Devil of My Word	9.8	10712
134	Dark	The Paradise	9.7	17468
1404	The Walking Dead	No Way Out	9.7	24358
1371	The Walking Dead	Too Far Gone	9.7	22061
985	The Walking Dead	No Way Out	9.7	24358
1314	Sherlock	The Reichenbach Fall	9.7	33904
1307	Friends	The Last One	9.7	11634
952	The Walking Dead	Too Far Gone	9.7	22061
961	The Walking Dead	No Sanctuary	9.6	23608
133	Dark	Between the Time	9.6	13301

imdb_10.shape

(124, 4)

vamos identificar se há duplicados.

duplicates = imdb_10.duplicated()
print(duplicates)

2359    False
134     False
1404    False
1371    False
985      True
        ...  
1411    False
992      True
1337    False
1417    False
998      True
Length: 124, dtype: bool

Notamos que existem duplicados, portanto precisamos removê-los utilizando o método drop_duplicates().

imdb_10 = imdb_10.drop_duplicates(subset=['Episode'])

Vamos ver quantas linhas temos agora.

imdb_10.shape

(91, 4)

também podemos filtrar por strings. Por exemplo, vamos filtrar pelo seriado “Dark”:

imdb['series_name'].isin(['Dark'])

2359    False
134      True
1404    False
1183    False
2371    False
        ...  
773     False
774     False
1908    False
42      False
137     False
Name: series_name, Length: 2597, dtype: bool

Vemos que o código acima retorna uma série com valores lógicos (True-False) para cada linha. Para efetivamente ver o resultado do seriado Dark devemos:

imdb[imdb['series_name'].isin(['Dark'])]

	series_name	Episode	series_ep	season	season_ep	url	UserRating	UserVotes	r1	r2	r3	r4	r5	r6	r7	r8	r9	r10
134	Dark	The Paradise	26	3	8	http://www.imdb.com/title/tt12557704/?ref_=tte...	9.7	17468	0.020151	0.002633	0.003263	0.002805	0.004351	0.006698	0.012308	0.022556	0.059824	0.865411
133	Dark	Between the Time	25	3	7	http://www.imdb.com/title/tt12557700/?ref_=tte...	9.6	13301	0.020374	0.001880	0.002556	0.002707	0.004285	0.006842	0.014811	0.031577	0.073904	0.841065
124	Dark	An Endless Cycle	16	2	6	http://www.imdb.com/title/tt10454734/?ref_=tte...	9.6	11149	0.007983	0.000807	0.001256	0.001435	0.001884	0.005292	0.016055	0.045834	0.124765	0.794690
131	Dark	Life and Death	23	3	5	http://www.imdb.com/title/tt12557688/?ref_=tte...	9.5	10359	0.021238	0.001834	0.002510	0.002993	0.005502	0.010908	0.020948	0.044985	0.123661	0.765421
126	Dark	Endings and Beginnings	18	2	8	http://www.imdb.com/title/tt10457654/?ref_=tte...	9.5	11379	0.012040	0.002109	0.001846	0.002373	0.004482	0.009667	0.020301	0.045698	0.123473	0.778012
122	Dark	The Travelers	14	2	4	http://www.imdb.com/title/tt10454726/?ref_=tte...	9.5	9968	0.007925	0.000903	0.001204	0.001505	0.002307	0.006120	0.019161	0.056180	0.166132	0.738563
123	Dark	Lost and Found	15	2	5	http://www.imdb.com/title/tt10454732/?ref_=tte...	9.4	9210	0.007600	0.000869	0.001303	0.001629	0.002606	0.007166	0.023127	0.075570	0.197720	0.682410
125	Dark	The White Devil	17	2	7	http://www.imdb.com/title/tt10457652/?ref_=tte...	9.3	9050	0.008066	0.001657	0.001878	0.001768	0.003757	0.009392	0.026851	0.083315	0.204972	0.658343
132	Dark	Light and Shadow	24	3	6	http://www.imdb.com/title/tt12557694/?ref_=tte...	9.3	10060	0.024453	0.001889	0.002783	0.004175	0.006362	0.010934	0.025547	0.063221	0.127137	0.733499
116	Dark	As You Sow, so You Shall Reap	8	1	8	http://www.imdb.com/title/tt7313316/?ref_=ttep...	9.2	9294	0.007209	0.001076	0.000753	0.001829	0.003443	0.011298	0.030773	0.101033	0.257586	0.585001
118	Dark	Alpha and Omega	10	1	10	http://www.imdb.com/title/tt7313322/?ref_=ttep...	9.2	9758	0.010248	0.001947	0.001230	0.003279	0.005944	0.011273	0.032589	0.086801	0.229248	0.617442
121	Dark	Ghosts	13	2	3	http://www.imdb.com/title/tt10454722/?ref_=tte...	9.2	9138	0.007004	0.000766	0.001970	0.002079	0.004487	0.008645	0.030422	0.100897	0.226855	0.616875
114	Dark	Sic Mundus Creatus Est	6	1	6	http://www.imdb.com/title/tt7313312/?ref_=ttep...	9.1	9434	0.006996	0.000742	0.001272	0.001272	0.004240	0.014204	0.035616	0.115222	0.264151	0.556286
130	Dark	The Origin	22	3	4	http://www.imdb.com/title/tt12557686/?ref_=tte...	9.1	9529	0.022248	0.003883	0.004722	0.004198	0.006611	0.013118	0.033582	0.082695	0.163816	0.665128
113	Dark	Truths	5	1	5	http://www.imdb.com/title/tt7313308/?ref_=ttep...	9.0	9563	0.005751	0.000837	0.001046	0.002301	0.004183	0.014744	0.045070	0.134790	0.278992	0.512287
120	Dark	Dark Matter	12	2	2	http://www.imdb.com/title/tt10454716/?ref_=tte...	9.0	9221	0.006941	0.001084	0.001193	0.002603	0.003579	0.010303	0.034920	0.122655	0.247479	0.569244
127	Dark	Deja-vu	19	3	1	http://www.imdb.com/title/tt10414808/?ref_=tte...	9.0	10609	0.022622	0.003205	0.003959	0.004430	0.006221	0.014704	0.039872	0.098407	0.202658	0.603921
129	Dark	Adam and Eva	21	3	3	http://www.imdb.com/title/tt12557682/?ref_=tte...	8.9	9364	0.022747	0.002777	0.004806	0.004058	0.008330	0.016339	0.040474	0.109141	0.194682	0.596647
128	Dark	The Survivors	20	3	2	http://www.imdb.com/title/tt12557670/?ref_=tte...	8.9	9666	0.022967	0.003311	0.004138	0.004552	0.006104	0.017381	0.045520	0.119284	0.202773	0.573971
119	Dark	Beginnings and Endings	11	2	1	http://www.imdb.com/title/tt7787482/?ref_=ttep...	8.9	9969	0.007122	0.001505	0.001404	0.003611	0.004213	0.013442	0.049955	0.154780	0.238239	0.525730
115	Dark	Crossroads	7	1	7	http://www.imdb.com/title/tt7305824/?ref_=ttep...	8.8	8725	0.005845	0.001261	0.001032	0.002636	0.005501	0.015931	0.051920	0.194269	0.275186	0.446418
117	Dark	Everything Is Now	9	1	9	http://www.imdb.com/title/tt7313320/?ref_=ttep...	8.8	8599	0.006978	0.001744	0.001512	0.001279	0.006280	0.018025	0.055472	0.185719	0.260147	0.462845
111	Dark	Past and Present	3	1	3	http://www.imdb.com/title/tt7305820/?ref_=ttep...	8.7	9585	0.006051	0.001148	0.001669	0.002608	0.006886	0.023996	0.067710	0.208242	0.255921	0.425769
109	Dark	Secrets	1	1	1	http://www.imdb.com/title/tt6305578/?ref_=ttep...	8.3	11242	0.007917	0.001245	0.003202	0.003914	0.011564	0.030422	0.106298	0.274239	0.194094	0.367105
112	Dark	Double Lives	4	1	4	http://www.imdb.com/title/tt7305818/?ref_=ttep...	8.3	9088	0.005722	0.001761	0.001540	0.003631	0.009793	0.029820	0.102993	0.281690	0.180238	0.382812
110	Dark	Lies	2	1	2	http://www.imdb.com/title/tt7305776/?ref_=ttep...	8.2	9903	0.006160	0.001616	0.003130	0.002827	0.011108	0.034939	0.119661	0.293547	0.165303	0.361709

vamos guardar este dataframe com o nome darkdf:

darkdf = imdb[imdb['series_name'].isin(['Dark'])]

Agrupando dados por categorias

Vamos agrupar os dados de imdb e calcular algumas informações: a contagem de capítulos por seriado, a média do UserRating por seriado e a soma total dos votos.

imdb_summary = imdb.groupby(['series_name']).agg({'series_name':"count",'UserRating':'mean','UserVotes':'sum'})

imdb_summary

	series_name	UserRating	UserVotes
series_name
13 Reasons Why	49	7.248980	151452
Arquivo X	356	7.988202	969628
Away	10	7.080000	5388
Como Defender um Assassino	180	8.568889	228352
Dark	26	9.076923	264631
Dracula	6	6.966667	28858
Era Uma Vez	312	8.266026	397332
Friends	235	8.451915	853287
Greenleaf	60	7.586667	2777
Grimm: Contos de Terror	123	8.435772	104967
Hemlock Grove	33	7.490909	12538
La Casa de Papel	31	8.209677	161782
Lúcifer	75	8.744000	298412
Motel Bates	50	8.592000	78773
Narcos	30	8.760000	122160
O Bom Lugar	50	8.272000	98562
Orange Is the New Black	91	8.172527	156565
Orphan Black	50	8.528000	60532
Os Originais	184	8.828261	179686
Pânico: A Série de TV	29	7.537931	28078
Ratched	8	7.812500	8234
Sherlock	15	8.800000	373114
Sleepy Hollow	62	7.680645	33079
Sobrenatural	211	8.389573	499046
Stranger Things	25	8.676000	391288
The Walking Dead	288	8.006944	2394586
The Witcher	8	8.487500	122656

Podemos fazer o mesmo para o dataframe darkdf:

darkdf_grouped = darkdf.groupby(['series_name']).count()

darkdf_grouped

	Episode	series_ep	season	season_ep	url	UserRating	UserVotes	r1	r2	r3	r4	r5	r6	r7	r8	r9	r10
series_name
Dark	26	26	26	26	26	26	26	26	26	26	26	26	26	26	26	26	26

Uma boa prática é retornar o nome da coluna, neste caso series_name para uma nova coluna, e desta forma, deixar o index novamente numérico, para isto utilizamos a função reset_index.

darkdf_grouped.reset_index()

	series_name	Episode	series_ep	season	season_ep	url	UserRating	UserVotes	r1	r2	r3	r4	r5	r6	r7	r8	r9	r10
0	Dark	26	26	26	26	26	26	26	26	26	26	26	26	26	26	26	26	26

Vamos fazer o mesmo para o imdb_summary:

imdb_summary2 = imdb_summary.rename(columns={'series_name':'Episodes'})
imdb_summary2.reset_index(inplace=True)

imdb_summary2.head(7)

	series_name	Episodes	UserRating	UserVotes
0	13 Reasons Why	49	7.248980	151452
1	Arquivo X	356	7.988202	969628
2	Away	10	7.080000	5388
3	Como Defender um Assassino	180	8.568889	228352
4	Dark	26	9.076923	264631
5	Dracula	6	6.966667	28858
6	Era Uma Vez	312	8.266026	397332

Modificando ou criando novas colunas

Similar à função mutate() no R, podemos usar o método df.assign em Python.

imdb_rtotal = imdb.assign(r_total=imdb['r1']+imdb['r2']+imdb['r3']+imdb['r4']+imdb['r5']+imdb['r6']+imdb['r7']+imdb['r8']+imdb['r9']+imdb['r10'])

Agora podemos selecionar apenas as colunas que nos interessam.

imdb_rtotal[['series_name','Episode','r_total']]

	series_name	Episode	r_total
2359	Lúcifer	A Devil of My Word	1.0
134	Dark	The Paradise	1.0
1404	The Walking Dead	No Way Out	1.0
1183	Friends	The One Where Everybody Finds Out	1.0
2371	Lúcifer	Who's da New King of Hell?	1.0
...	...	...	...
773	Pânico: A Série de TV	Blindspots	1.0
774	Pânico: A Série de TV	Endgame	1.0
1908	Dracula	The Dark Compass	1.0
42	13 Reasons Why	Senior Camping Trip	1.0
137	Dracula	The Dark Compass	1.0

2597 rows × 3 columns

também podemos criar novas colunas utilizando outra notação, vamos assumir que queremos criar uma nova coluna no dataframe imdb_summary :

imdb_summary2['VotesPerEpisode']=imdb_summary2['UserVotes']/imdb_summary2['Episodes']

imdb_summary2.head(8)

	series_name	Episodes	UserRating	UserVotes	VotesPerEpisode
0	13 Reasons Why	49	7.248980	151452	3090.857143
1	Arquivo X	356	7.988202	969628	2723.674157
2	Away	10	7.080000	5388	538.800000
3	Como Defender um Assassino	180	8.568889	228352	1268.622222
4	Dark	26	9.076923	264631	10178.115385
5	Dracula	6	6.966667	28858	4809.666667
6	Era Uma Vez	312	8.266026	397332	1273.500000
7	Friends	235	8.451915	853287	3631.008511

e agora vamos ordenar pela nova coluna VotesPerEpisode e arredondar para dois decimais:

imdb_summary2.sort_values('VotesPerEpisode',
                ascending=False).round(decimals=2)

	series_name	Episodes	UserRating	UserVotes	VotesPerEpisode
21	Sherlock	15	8.80	373114	24874.27
24	Stranger Things	25	8.68	391288	15651.52
26	The Witcher	8	8.49	122656	15332.00
4	Dark	26	9.08	264631	10178.12
25	The Walking Dead	288	8.01	2394586	8314.53
11	La Casa de Papel	31	8.21	161782	5218.77
5	Dracula	6	6.97	28858	4809.67
14	Narcos	30	8.76	122160	4072.00
12	Lúcifer	75	8.74	298412	3978.83
7	Friends	235	8.45	853287	3631.01
0	13 Reasons Why	49	7.25	151452	3090.86
1	Arquivo X	356	7.99	969628	2723.67
23	Sobrenatural	211	8.39	499046	2365.15
15	O Bom Lugar	50	8.27	98562	1971.24
16	Orange Is the New Black	91	8.17	156565	1720.49
13	Motel Bates	50	8.59	78773	1575.46
6	Era Uma Vez	312	8.27	397332	1273.50
3	Como Defender um Assassino	180	8.57	228352	1268.62
17	Orphan Black	50	8.53	60532	1210.64
20	Ratched	8	7.81	8234	1029.25
18	Os Originais	184	8.83	179686	976.55
19	Pânico: A Série de TV	29	7.54	28078	968.21
9	Grimm: Contos de Terror	123	8.44	104967	853.39
2	Away	10	7.08	5388	538.80
22	Sleepy Hollow	62	7.68	33079	533.53
10	Hemlock Grove	33	7.49	12538	379.94
8	Greenleaf	60	7.59	2777	46.28

Juntando dois ou mais df

merge une dois DataFrames (df1 e df2) com base em colunas especificadas (on, left_on, right_on). O tipo de junção (how) define como as tabelas serão combinadas:

'inner': Mantém apenas as linhas onde há correspondência nas colunas de junção (interseção).
'left': Mantém todas as linhas do DataFrame esquerdo (df1) e as correspondentes do direito (df2).
'right': Mantém todas as linhas do DataFrame direito (df2) e as correspondentes do esquerdo (df1).
'outer': Mantém todas as linhas de ambos os DataFrames, preenchendo com NaN onde não há correspondência (união).

Use left_on e right_on quando as colunas de junção têm nomes diferentes nos DataFrames. Vamos a usar o pequeno df que foi criado no início deste tutorial para verificar o funcionamento do merge.

df

	name	age	title
0	Alice	25	Sherlock
1	Bob	30	The Walking Dead
2	Charlie	35	Dark
3	Alice	25	Friends
4	David	40	Orange Is the New Black
5	Bob	30	The Walking Dead
6	Camille	20	Narcos

e vamos criar um pequeno df com dados do Netflix:

netflix_5 = {'season': ['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Part 1','Part 2'],
        'title': ['Sherlock', 'The Walking Dead', 'Dark', 'Friends', 'Orange Is the New Black', 'The Walking Dead','Narcos']}

netflix_5_df = pd.DataFrame(netflix_5)

netflix_5_df

	season	title
0	Season 1	Sherlock
1	Season 2	The Walking Dead
2	Season 3	Dark
3	Season 4	Friends
4	Season 5	Orange Is the New Black
5	Part 1	The Walking Dead
6	Part 2	Narcos

Vamos realizar o merge:

netflix_5_df.merge(df, on='title', how='left')

	season	title	name	age
0	Season 1	Sherlock	Alice	25
1	Season 2	The Walking Dead	Bob	30
2	Season 2	The Walking Dead	Bob	30
3	Season 3	Dark	Charlie	35
4	Season 4	Friends	Alice	25
5	Season 5	Orange Is the New Black	David	40
6	Part 1	The Walking Dead	Bob	30
7	Part 1	The Walking Dead	Bob	30
8	Part 2	Narcos	Camille	20