The raw Netflix dataset was loaded into a Jupyter Notebook using pandas. Data cleaning tasks included handling missing values, removing duplicates, and ensuring data consistency.
Exploratory Data Analysis (EDA) was conducted to gain a deeper understanding of the dataset. Key statistical metrics and summary statistics were calculated to identify central tendencies and outliers.
Utilizing the Plotly library, interactive visualizations were created to illustrate trends and patterns in the data. Graphs and charts were generated to present insights related to genres, release dates, and viewer ratings.
Interpretations and conclusions were drawn based on the analyses and visualizations. Findings provided valuable insights into user preferences, popular genres, and temporal patterns.
import numpy as np
import pandas as pd
import seaborn as sb
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.offline import iplot, plot
from plotly.subplots import make_subplots
data = pd.read_csv('netflix_titles.csv')
data.shape
(8807, 12)
print(f"Number of Rows : {data.shape[0]} \nNumber of Columns : {data.shape[1]}")
Number of Rows : 8807 Number of Columns : 12
data.columns
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description'], dtype='object')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8807 entries, 0 to 8806 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 8807 non-null object 1 type 8807 non-null object 2 title 8807 non-null object 3 director 6173 non-null object 4 cast 7982 non-null object 5 country 7976 non-null object 6 date_added 8797 non-null object 7 release_year 8807 non-null int64 8 rating 8803 non-null object 9 duration 8804 non-null object 10 listed_in 8807 non-null object 11 description 8807 non-null object dtypes: int64(1), object(11) memory usage: 825.8+ KB
data.describe()
release_year | |
---|---|
count | 8807.000000 |
mean | 2014.180198 |
std | 8.819312 |
min | 1925.000000 |
25% | 2013.000000 |
50% | 2017.000000 |
75% | 2019.000000 |
max | 2021.000000 |
data.describe(exclude=np.number)
show_id | type | title | director | cast | country | date_added | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 8807 | 8807 | 8807 | 6173 | 7982 | 7976 | 8797 | 8803 | 8804 | 8807 | 8807 |
unique | 8807 | 2 | 8807 | 4528 | 7692 | 748 | 1767 | 17 | 220 | 514 | 8775 |
top | s1 | Movie | Dick Johnson Is Dead | Rajiv Chilaka | David Attenborough | United States | January 1, 2020 | TV-MA | 1 Season | Dramas, International Movies | Paranormal activity at a lush, abandoned prope... |
freq | 1 | 6131 | 1 | 19 | 19 | 2818 | 109 | 3207 | 1793 | 362 | 4 |
data.sample(5)
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3748 | s3749 | TV Show | El desconocido | Not Given | Guillermo Iván, César Manjarrez, Estrella Solí... | Not Given | June 14, 2019 | 2019 | TV-MA | 2 Seasons | Crime TV Shows, International TV Shows, Spanis... | Based on real events, the fictional story of M... |
7223 | s7224 | Movie | Kon-Tiki | Joachim Rønning, Espen Sandberg | Pål Sverre Hagen, Anders Baasmo Christiansen, ... | United Kingdom, Norway, Denmark, Germany, Sweden | April 26, 2019 | 2012 | PG-13 | 96 min | Action & Adventure, Dramas, International Movies | With five loyal friends in tow, explorer Thor ... |
6389 | s6390 | TV Show | Bure Kaam Bura Natija, Kyun Bhai Chacha Haan B... | Not Given | Not Given | Not Given | March 31, 2018 | 2017 | TV-PG | 1 Season | Kids' TV | A clever uncle-nephew duo solves mysteries, cr... |
2810 | s2811 | Movie | Bypass Road | Naman Nitin Mukesh | Neil Nitin Mukesh, Adah Sharma, Rajit Kapoor, ... | India | March 15, 2020 | 2019 | TV-14 | 135 min | International Movies, Thrillers | On the night his ex-lover mysteriously dies, a... |
4083 | s4084 | Movie | Bert Kreischer: The Machine | Ryan Polito | Bert Kreischer | United States | February 22, 2019 | 2016 | TV-MA | 70 min | Stand-Up Comedy | From his run-in with a grizzly bear to partyin... |
data.head()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
data.tail()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
8802 | s8803 | Movie | Zodiac | David Fincher | Mark Ruffalo, Jake Gyllenhaal, Robert Downey J... | United States | November 20, 2019 | 2007 | R | 158 min | Cult Movies, Dramas, Thrillers | A political cartoonist, a crime reporter and a... |
8803 | s8804 | TV Show | Zombie Dumb | NaN | NaN | NaN | July 1, 2019 | 2018 | TV-Y7 | 2 Seasons | Kids' TV, Korean TV Shows, TV Comedies | While living alone in a spooky town, a young g... |
8804 | s8805 | Movie | Zombieland | Ruben Fleischer | Jesse Eisenberg, Woody Harrelson, Emma Stone, ... | United States | November 1, 2019 | 2009 | R | 88 min | Comedies, Horror Movies | Looking to survive in a world taken over by zo... |
8805 | s8806 | Movie | Zoom | Peter Hewitt | Tim Allen, Courteney Cox, Chevy Chase, Kate Ma... | United States | January 11, 2020 | 2006 | PG | 88 min | Children & Family Movies, Comedies | Dragged from civilian life, a former superhero... |
8806 | s8807 | Movie | Zubaan | Mozez Singh | Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan... | India | March 2, 2019 | 2015 | TV-14 | 111 min | Dramas, International Movies, Music & Musicals | A scrappy but poor boy worms his way into a ty... |
data.isnull().sum()
show_id 0 type 0 title 0 director 2634 cast 825 country 831 date_added 10 release_year 0 rating 4 duration 3 listed_in 0 description 0 dtype: int64
data.fillna('Not Given',inplace=True)
data.isnull().sum()
show_id 0 type 0 title 0 director 0 cast 0 country 0 date_added 0 release_year 0 rating 0 duration 0 listed_in 0 description 0 dtype: int64
data.head(5)
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | Not Given | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
1 | s2 | TV Show | Blood & Water | Not Given | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | Not Given | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
3 | s4 | TV Show | Jailbirds New Orleans | Not Given | Not Given | Not Given | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
4 | s5 | TV Show | Kota Factory | Not Given | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
data.duplicated().sum()
0
data.duplicated(subset=['title']).sum()
0
del data['description']
data.head()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | Not Given | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries |
1 | s2 | TV Show | Blood & Water | Not Given | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries |
2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | Not Given | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... |
3 | s4 | TV Show | Jailbirds New Orleans | Not Given | Not Given | Not Given | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV |
4 | s5 | TV Show | Kota Factory | Not Given | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... |
print(f"Number of Rows : {data.shape[0]}\nNumber of Columns : {data.shape[1]}")
Number of Rows : 8807 Number of Columns : 11
type_count = data['type'].value_counts()
type_count
type Movie 6131 TV Show 2676 Name: count, dtype: int64
type_count.index
Index(['Movie', 'TV Show'], dtype='object', name='type')
barChart = px.bar(type_count,
text_auto = True,
title='Type Wise Count of Titles',
labels=dict(type="Type",value="Count"),
color = type_count.index,
color_discrete_map = {'Movie' : '#189ad3', 'TV Show' : '#e9724d'})
barChart.show()
pieChart = px.pie(values = type_count,
names = ['Movie','TV Show'],
title = 'Type Wise Count of Titles',
color = type_count.index,
color_discrete_map = {'Movie' : '#189ad3', 'TV Show' : '#e9724d'}
)
pieChart.update_traces(textinfo='label+value+percent')
pieChart.show()
data['rating'].unique()
array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R', 'TV-G', 'G', 'NC-17', '74 min', '84 min', '66 min', 'NR', nan, 'TV-Y7-FV', 'UR'], dtype=object)
ratings = data.groupby(data['rating']).size()
print(ratings)
rating 66 min 1 74 min 1 84 min 1 G 41 NC-17 3 NR 80 PG 287 PG-13 490 R 799 TV-14 2160 TV-G 220 TV-MA 3207 TV-PG 863 TV-Y 307 TV-Y7 334 TV-Y7-FV 6 UR 3 dtype: int64
del ratings['84 min']
ratings
rating G 41 NC-17 3 NR 80 PG 287 PG-13 490 R 799 TV-14 2160 TV-G 220 TV-MA 3207 TV-PG 863 TV-Y 307 TV-Y7 334 TV-Y7-FV 6 UR 3 dtype: int64
Rating | Description |
---|---|
G | Suitable for General Audiences |
NC-17 | Inappropriate for ages 17 and under |
NR | Not Rated |
PG | Parental Guidance suggested |
PG-13 | Parents strongly cautioned. May be inappropriate for ages under 13 |
R | Restricted. May be inappropriate for ages under 17 |
TV-14 | Parents strongly cautioned. May not be suitable for ages under 14 |
TV-G | Suitable for General Audiences |
TV-MA | For Mature Audiences |
TV-PG | Parental Guidance suggested |
TV-Y | Designed to be appropriate for all children |
TV-Y7 | Suitable for ages 7 and up |
TV-Y7-FV | Suitable for ages 7 and up and contains Fantasy Voilence |
UR | UnRated |
ratings_bar = px.bar(ratings,
text_auto = True,
title = 'Distribution of Content based on Maturity Ratings',
color = ratings.index,
labels=dict(rating="Rating",value="No. of Titles")
)
ratings_bar.show()
df = data['country'].value_counts()
# del df['Not Given']
top10 = df.head(10)
top10
country United States 2818 India 972 United Kingdom 419 Japan 245 South Korea 199 Canada 181 Spain 145 France 124 Mexico 110 Egypt 106 Name: count, dtype: int64
bar = px.bar(top10,
text_auto = True,
title='Top 10 Countries based on number of titles',
labels=dict(country='Country',value="No. of Titles"),
color = top10.index
)
bar.show()
pie = px.pie(values = top10,
title='Top 10 Countries based on number of titles',
color = top10.index,
color_discrete_sequence = px.colors.sequential.Plasma)
pie.update_traces(textinfo='value+percent')
pie.show()
world_directors = data.groupby(data['director']).size()
sorted_dirs = world_directors.sort_values(ascending=False)
sorted_dirs
director Rajiv Chilaka 19 Raúl Campos, Jan Suter 18 Suhas Kadav 16 Marcus Raboy 16 Jay Karas 14 .. Jos Humphrey 1 Jose Gomez 1 Jose Javier Reyes 1 Joseduardo Giordano, Sergio Goyri Jr. 1 Khaled Youssef 1 Length: 4528, dtype: int64
top10_dirs = sorted_dirs.head(10)
top10_dirs
director Rajiv Chilaka 19 Raúl Campos, Jan Suter 18 Suhas Kadav 16 Marcus Raboy 16 Jay Karas 14 Cathy Garcia-Molina 13 Jay Chapman 12 Youssef Chahine 12 Martin Scorsese 12 Steven Spielberg 11 dtype: int64
top10_dirs_bar = px.bar(top10_dirs,
text_auto = True,
orientation = 'h',
title = "Top 10 Directors in World",
labels = dict(director='Director',value="Total Content"),
color = top10_dirs.index)
top10_dirs_bar.show()
indian_content = data[(data['country']=='India') & (data['director']!= 'Not Given')]
indian_directors = indian_content.groupby(indian_content['director']).size()
sorted_ind_dir = indian_directors.sort_values(ascending=False)
top10_ind_dir = sorted_ind_dir.head(10)
top10_ind_dir.head(10)
director David Dhawan 9 Ram Gopal Varma 7 Imtiaz Ali 6 Sooraj R. Barjatya 6 Anees Bazmee 6 Rajkumar Santoshi 6 Anurag Kashyap 5 Prakash Jha 5 Umesh Mehra 5 Madhur Bhandarkar 5 dtype: int64
top_ind_dir_bar = px.bar(top10_ind_dir,
text_auto=True,
orientation='h',
color=top10_ind_dir.index,
color_discrete_sequence=px.colors.sequential.YlOrRd_r,
title = 'Top 10 Indian Directors',
labels = dict(director='Director',value='Total Content')
)
top_ind_dir_bar.show()
all_gen = data['listed_in'].str.split(', ',expand=True)
b = all_gen.melt(value_name='category').dropna()
top10_gen = b['category'].value_counts().head(10)
top10_gen
category International Movies 2752 Dramas 2427 Comedies 1674 International TV Shows 1351 Documentaries 869 Action & Adventure 859 TV Dramas 763 Independent Movies 756 Children & Family Movies 641 Romantic Movies 616 Name: count, dtype: int64
top10_gen_bar = px.bar(top10_gen,
text_auto=True,
orientation='h',
color=top10_gen.index,
color_discrete_sequence=px.colors.sequential.Plasma,
title = 'Top 10 Popular Genres',
labels = dict(category='Genre',value='No. of Shows')
)
top10_gen_bar.show()
release_year = data.groupby(data['release_year']).size()
release_year
release_year 1925 1 1942 2 1943 3 1944 3 1945 4 ... 2017 1032 2018 1147 2019 1030 2020 953 2021 592 Length: 74, dtype: int64
iplot(px.area(release_year,
x=release_year.index,
y=release_year,
labels = dict(release_year='Release Year',y='No. of Shows')))
df = data[['release_year','type']]
prod2 = df.groupby(['release_year','type']).size().reset_index(name='Total Content')
prod2 = prod2[prod2['release_year']>=2010]
prod2 = prod2.rename(columns={'release_year':'Release Year'})
prod2
Release Year | type | Total Content | |
---|---|---|---|
95 | 2010 | Movie | 154 |
96 | 2010 | TV Show | 40 |
97 | 2011 | Movie | 145 |
98 | 2011 | TV Show | 40 |
99 | 2012 | Movie | 173 |
100 | 2012 | TV Show | 64 |
101 | 2013 | Movie | 225 |
102 | 2013 | TV Show | 63 |
103 | 2014 | Movie | 264 |
104 | 2014 | TV Show | 88 |
105 | 2015 | Movie | 398 |
106 | 2015 | TV Show | 162 |
107 | 2016 | Movie | 658 |
108 | 2016 | TV Show | 244 |
109 | 2017 | Movie | 767 |
110 | 2017 | TV Show | 265 |
111 | 2018 | Movie | 767 |
112 | 2018 | TV Show | 380 |
113 | 2019 | Movie | 633 |
114 | 2019 | TV Show | 397 |
115 | 2020 | Movie | 517 |
116 | 2020 | TV Show | 436 |
117 | 2021 | Movie | 277 |
118 | 2021 | TV Show | 315 |
prod_trend = px.line(prod2,
x="Release Year",
y="Total Content",
color='type',
title='Trend of content produced over the years on Netflix')
prod_trend.show()