Netflix Data Analysis¶

By Anubhav Verma¶

This project involves the analysis of Netflix data using Python and various data science libraries such as pandas, numpy, and Plotly. The primary goals were to clean and preprocess the raw data, perform insightful analyses, and create meaningful visualizations to uncover patterns and trends in the dataset.¶


Objectives¶

1. Data Cleaning:¶

The raw Netflix dataset was loaded into a Jupyter Notebook using pandas. Data cleaning tasks included handling missing values, removing duplicates, and ensuring data consistency.

2. Data Analysis:¶

Exploratory Data Analysis (EDA) was conducted to gain a deeper understanding of the dataset. Key statistical metrics and summary statistics were calculated to identify central tendencies and outliers.

3. Data Visualization:¶

Utilizing the Plotly library, interactive visualizations were created to illustrate trends and patterns in the data. Graphs and charts were generated to present insights related to genres, release dates, and viewer ratings.

4. Insights and Conclusions:¶

Interpretations and conclusions were drawn based on the analyses and visualizations. Findings provided valuable insights into user preferences, popular genres, and temporal patterns.


Importing Needed Libraries¶

In [2]:
import numpy as np
import pandas as pd
import seaborn as sb
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.offline import iplot, plot
from plotly.subplots import make_subplots

Loading Dataset¶

In [4]:
data = pd.read_csv('netflix_titles.csv')

Checking Data Info¶

1. Shape of Dataset¶

In [6]:
data.shape
Out[6]:
(8807, 12)

The data has 8807 Rows and 12 Columns¶

In [12]:
print(f"Number of Rows : {data.shape[0]} \nNumber of Columns : {data.shape[1]}")
Number of Rows : 8807 
Number of Columns : 12

2. Column Names¶

In [28]:
data.columns
Out[28]:
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

3. Summary of Data¶

In [15]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB

4. Description of Data¶

  • describe() function is used to get description of thah dataframe.
  • The percentile, include, exclude, datetime_is_numeric parameters are keyword arguments.
In [18]:
data.describe()
Out[18]:
release_year
count 8807.000000
mean 2014.180198
std 8.819312
min 1925.000000
25% 2013.000000
50% 2017.000000
75% 2019.000000
max 2021.000000
In [31]:
data.describe(exclude=np.number)
Out[31]:
show_id type title director cast country date_added rating duration listed_in description
count 8807 8807 8807 6173 7982 7976 8797 8803 8804 8807 8807
unique 8807 2 8807 4528 7692 748 1767 17 220 514 8775
top s1 Movie Dick Johnson Is Dead Rajiv Chilaka David Attenborough United States January 1, 2020 TV-MA 1 Season Dramas, International Movies Paranormal activity at a lush, abandoned prope...
freq 1 6131 1 19 19 2818 109 3207 1793 362 4

5. Showing some random rows¶

In [64]:
data.sample(5)
Out[64]:
show_id type title director cast country date_added release_year rating duration listed_in description
3748 s3749 TV Show El desconocido Not Given Guillermo Iván, César Manjarrez, Estrella Solí... Not Given June 14, 2019 2019 TV-MA 2 Seasons Crime TV Shows, International TV Shows, Spanis... Based on real events, the fictional story of M...
7223 s7224 Movie Kon-Tiki Joachim Rønning, Espen Sandberg Pål Sverre Hagen, Anders Baasmo Christiansen, ... United Kingdom, Norway, Denmark, Germany, Sweden April 26, 2019 2012 PG-13 96 min Action & Adventure, Dramas, International Movies With five loyal friends in tow, explorer Thor ...
6389 s6390 TV Show Bure Kaam Bura Natija, Kyun Bhai Chacha Haan B... Not Given Not Given Not Given March 31, 2018 2017 TV-PG 1 Season Kids' TV A clever uncle-nephew duo solves mysteries, cr...
2810 s2811 Movie Bypass Road Naman Nitin Mukesh Neil Nitin Mukesh, Adah Sharma, Rajit Kapoor, ... India March 15, 2020 2019 TV-14 135 min International Movies, Thrillers On the night his ex-lover mysteriously dies, a...
4083 s4084 Movie Bert Kreischer: The Machine Ryan Polito Bert Kreischer United States February 22, 2019 2016 TV-MA 70 min Stand-Up Comedy From his run-in with a grizzly bear to partyin...

6. Showing top 5 and last 5 rows¶

In [35]:
data.head()
Out[35]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States September 25, 2021 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 s4 TV Show Jailbirds New Orleans NaN NaN NaN September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...
In [36]:
data.tail()
Out[36]:
show_id type title director cast country date_added release_year rating duration listed_in description
8802 s8803 Movie Zodiac David Fincher Mark Ruffalo, Jake Gyllenhaal, Robert Downey J... United States November 20, 2019 2007 R 158 min Cult Movies, Dramas, Thrillers A political cartoonist, a crime reporter and a...
8803 s8804 TV Show Zombie Dumb NaN NaN NaN July 1, 2019 2018 TV-Y7 2 Seasons Kids' TV, Korean TV Shows, TV Comedies While living alone in a spooky town, a young g...
8804 s8805 Movie Zombieland Ruben Fleischer Jesse Eisenberg, Woody Harrelson, Emma Stone, ... United States November 1, 2019 2009 R 88 min Comedies, Horror Movies Looking to survive in a world taken over by zo...
8805 s8806 Movie Zoom Peter Hewitt Tim Allen, Courteney Cox, Chevy Chase, Kate Ma... United States January 11, 2020 2006 PG 88 min Children & Family Movies, Comedies Dragged from civilian life, a former superhero...
8806 s8807 Movie Zubaan Mozez Singh Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan... India March 2, 2019 2015 TV-14 111 min Dramas, International Movies, Music & Musicals A scrappy but poor boy worms his way into a ty...

Data Cleaning¶

  • Checking Null Values¶

In [43]:
data.isnull().sum()
Out[43]:
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

From above we can see the total number of NULL values in each column.¶

  • Populating NULL Values¶

Now we will populate the NULL values with 'Not Given'¶

In [52]:
data.fillna('Not Given',inplace=True)

Now you can see that there are no NULL values in the data.¶

In [53]:
data.isnull().sum()
Out[53]:
show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

All the NULL values are replaced with 'Not Given'¶

In [61]:
data.head(5)
Out[61]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 Movie Dick Johnson Is Dead Kirsten Johnson Not Given United States September 25, 2021 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 s2 TV Show Blood & Water Not Given Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... Not Given September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 s4 TV Show Jailbirds New Orleans Not Given Not Given Not Given September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 s5 TV Show Kota Factory Not Given Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...
  • Checking Duplicates¶

In [71]:
data.duplicated().sum()
Out[71]:
0
In [72]:
data.duplicated(subset=['title']).sum()
Out[72]:
0

There are no duplicates in the dataset.¶

  • Removing Unneeded Columns¶

We will remove the 'description' column from the data because we don'r need this column.¶

In [77]:
del data['description']

You can see that description column is removed.¶

In [78]:
data.head()
Out[78]:
show_id type title director cast country date_added release_year rating duration listed_in
0 s1 Movie Dick Johnson Is Dead Kirsten Johnson Not Given United States September 25, 2021 2020 PG-13 90 min Documentaries
1 s2 TV Show Blood & Water Not Given Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries
2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... Not Given September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act...
3 s4 TV Show Jailbirds New Orleans Not Given Not Given Not Given September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV
4 s5 TV Show Kota Factory Not Given Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ...

Shape of dataset after removing the column¶

In [85]:
print(f"Number of Rows : {data.shape[0]}\nNumber of Columns : {data.shape[1]}")
Number of Rows : 8807
Number of Columns : 11

Data Analysis and Visualization¶

  • Type wise count of Titles¶

The below plots shows the total number of 'Movies' and 'TV Shows'¶

In [5]:
type_count = data['type'].value_counts()
type_count
Out[5]:
type
Movie      6131
TV Show    2676
Name: count, dtype: int64
In [6]:
type_count.index
Out[6]:
Index(['Movie', 'TV Show'], dtype='object', name='type')
In [7]:
barChart = px.bar(type_count,
                  text_auto = True,
                  title='Type Wise Count of Titles',
                  labels=dict(type="Type",value="Count"),
                  color = type_count.index,
                  color_discrete_map = {'Movie' : '#189ad3', 'TV Show' : '#e9724d'})
In [8]:
barChart.show()
In [9]:
pieChart = px.pie(values = type_count, 
                  names = ['Movie','TV Show'],
                  title = 'Type Wise Count of Titles',
                  color = type_count.index,
                  color_discrete_map = {'Movie' : '#189ad3', 'TV Show' : '#e9724d'}
                  )
pieChart.update_traces(textinfo='label+value+percent')
pieChart.show()

  • Distribution of Content based on Maturity Ratings¶

Types of Maturity Ratings¶

In [39]:
data['rating'].unique()
Out[39]:
array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R',
       'TV-G', 'G', 'NC-17', '74 min', '84 min', '66 min', 'NR', nan,
       'TV-Y7-FV', 'UR'], dtype=object)
In [10]:
ratings = data.groupby(data['rating']).size()
print(ratings)
rating
66 min         1
74 min         1
84 min         1
G             41
NC-17          3
NR            80
PG           287
PG-13        490
R            799
TV-14       2160
TV-G         220
TV-MA       3207
TV-PG        863
TV-Y         307
TV-Y7        334
TV-Y7-FV       6
UR             3
dtype: int64
In [16]:
del ratings['84 min']
In [17]:
ratings
Out[17]:
rating
G             41
NC-17          3
NR            80
PG           287
PG-13        490
R            799
TV-14       2160
TV-G         220
TV-MA       3207
TV-PG        863
TV-Y         307
TV-Y7        334
TV-Y7-FV       6
UR             3
dtype: int64

Netflix Maturity Ratings¶

Rating Description
G Suitable for General Audiences
NC-17 Inappropriate for ages 17 and under
NR Not Rated
PG Parental Guidance suggested
PG-13 Parents strongly cautioned. May be inappropriate for ages under 13
R Restricted. May be inappropriate for ages under 17
TV-14 Parents strongly cautioned. May not be suitable for ages under 14
TV-G Suitable for General Audiences
TV-MA For Mature Audiences
TV-PG Parental Guidance suggested
TV-Y Designed to be appropriate for all children
TV-Y7 Suitable for ages 7 and up
TV-Y7-FV Suitable for ages 7 and up and contains Fantasy Voilence
UR UnRated
In [18]:
ratings_bar = px.bar(ratings,
                     text_auto = True,
                     title = 'Distribution of Content based on Maturity Ratings',
                     color = ratings.index,
                     labels=dict(rating="Rating",value="No. of Titles")
                    )
ratings_bar.show()

The graph above shows that the maximum content on Netflix is for mature audience (above 18 years) and not suitable for children below age 18.¶


  • Top 10 Countries based on number of Titles¶

In [19]:
df = data['country'].value_counts()
# del df['Not Given']
top10 = df.head(10)
top10
Out[19]:
country
United States     2818
India              972
United Kingdom     419
Japan              245
South Korea        199
Canada             181
Spain              145
France             124
Mexico             110
Egypt              106
Name: count, dtype: int64
In [20]:
bar = px.bar(top10,
             text_auto = True,
             title='Top 10 Countries based on number of titles',
             labels=dict(country='Country',value="No. of Titles"),
             color = top10.index
            )
bar.show()
In [21]:
pie = px.pie(values = top10,
            title='Top 10 Countries based on number of titles',
            color = top10.index,
            color_discrete_sequence = px.colors.sequential.Plasma)
pie.update_traces(textinfo='value+percent')
pie.show()

  • Top 10 Directors in World¶

In [22]:
world_directors = data.groupby(data['director']).size()
In [23]:
sorted_dirs = world_directors.sort_values(ascending=False)
sorted_dirs
Out[23]:
director
Rajiv Chilaka                            19
Raúl Campos, Jan Suter                   18
Suhas Kadav                              16
Marcus Raboy                             16
Jay Karas                                14
                                         ..
Jos Humphrey                              1
Jose Gomez                                1
Jose Javier Reyes                         1
Joseduardo Giordano, Sergio Goyri Jr.     1
Khaled Youssef                            1
Length: 4528, dtype: int64
In [24]:
top10_dirs = sorted_dirs.head(10)
top10_dirs
Out[24]:
director
Rajiv Chilaka             19
Raúl Campos, Jan Suter    18
Suhas Kadav               16
Marcus Raboy              16
Jay Karas                 14
Cathy Garcia-Molina       13
Jay Chapman               12
Youssef Chahine           12
Martin Scorsese           12
Steven Spielberg          11
dtype: int64
In [25]:
top10_dirs_bar = px.bar(top10_dirs,
                        text_auto = True,
                        orientation = 'h',
                        title = "Top 10 Directors in World",
                        labels = dict(director='Director',value="Total Content"),
                        color = top10_dirs.index)
top10_dirs_bar.show()

  • Top 10 Indian Directors¶

In [26]:
indian_content = data[(data['country']=='India') & (data['director']!= 'Not Given')]
In [27]:
indian_directors = indian_content.groupby(indian_content['director']).size()
In [28]:
sorted_ind_dir = indian_directors.sort_values(ascending=False)
In [29]:
top10_ind_dir = sorted_ind_dir.head(10)
top10_ind_dir.head(10)
Out[29]:
director
David Dhawan          9
Ram Gopal Varma       7
Imtiaz Ali            6
Sooraj R. Barjatya    6
Anees Bazmee          6
Rajkumar Santoshi     6
Anurag Kashyap        5
Prakash Jha           5
Umesh Mehra           5
Madhur Bhandarkar     5
dtype: int64
In [30]:
top_ind_dir_bar = px.bar(top10_ind_dir,
                           text_auto=True,
                           orientation='h',
                           color=top10_ind_dir.index,
                           color_discrete_sequence=px.colors.sequential.YlOrRd_r,
                           title = 'Top 10 Indian Directors',
                           labels = dict(director='Director',value='Total Content')
                          )
top_ind_dir_bar.show()

  • Top 10 Popular Genres¶

In [31]:
all_gen = data['listed_in'].str.split(', ',expand=True)
b = all_gen.melt(value_name='category').dropna()
top10_gen = b['category'].value_counts().head(10)
top10_gen
Out[31]:
category
International Movies        2752
Dramas                      2427
Comedies                    1674
International TV Shows      1351
Documentaries                869
Action & Adventure           859
TV Dramas                    763
Independent Movies           756
Children & Family Movies     641
Romantic Movies              616
Name: count, dtype: int64
In [32]:
top10_gen_bar = px.bar(top10_gen,
               text_auto=True,
               orientation='h',
               color=top10_gen.index,
               color_discrete_sequence=px.colors.sequential.Plasma,
               title = 'Top 10 Popular Genres',
               labels = dict(category='Genre',value='No. of Shows')
              )
top10_gen_bar.show()

  • Year Wise Production¶

In [33]:
release_year = data.groupby(data['release_year']).size()
In [34]:
release_year
Out[34]:
release_year
1925       1
1942       2
1943       3
1944       3
1945       4
        ... 
2017    1032
2018    1147
2019    1030
2020     953
2021     592
Length: 74, dtype: int64
In [35]:
iplot(px.area(release_year,
             x=release_year.index,
             y=release_year,
             labels = dict(release_year='Release Year',y='No. of Shows')))

From above plot we can see that the production of content increased massively after 2011 and reached maximum in year 2018 with 1147 shows.¶


  • Trend of production in last 10 years¶

In [36]:
df = data[['release_year','type']]
prod2 = df.groupby(['release_year','type']).size().reset_index(name='Total Content')
prod2 = prod2[prod2['release_year']>=2010]
prod2 = prod2.rename(columns={'release_year':'Release Year'})
prod2
Out[36]:
Release Year type Total Content
95 2010 Movie 154
96 2010 TV Show 40
97 2011 Movie 145
98 2011 TV Show 40
99 2012 Movie 173
100 2012 TV Show 64
101 2013 Movie 225
102 2013 TV Show 63
103 2014 Movie 264
104 2014 TV Show 88
105 2015 Movie 398
106 2015 TV Show 162
107 2016 Movie 658
108 2016 TV Show 244
109 2017 Movie 767
110 2017 TV Show 265
111 2018 Movie 767
112 2018 TV Show 380
113 2019 Movie 633
114 2019 TV Show 397
115 2020 Movie 517
116 2020 TV Show 436
117 2021 Movie 277
118 2021 TV Show 315
In [37]:
prod_trend = px.line(prod2, 
                     x="Release Year", 
                     y="Total Content", 
                     color='type',
                     title='Trend of content produced over the years on Netflix')
prod_trend.show()

From above plot it is seen that there is decline in production of Movies after 2018.¶


Insights and Conclusions¶

  • There are total 6131 Movies and 2676 TV Shows published on Netflix till 2021.¶

  • Maximum content on Netflix is for mature audience (above 18 years) with total 3207 number of titles.¶

  • USA is on top in terms of publishing content on Netflix and India is on 2nd number.¶

  • David Dhawan is top Director of India with total 9 titles.¶

  • Top 3 popular genres on Netflix are International Movies, Dramas and Comedies.¶

  • The production of content increased massively after 2011 and reached maximum in year 2018
    with 1147 titles.

  • There is decline in production of Movies after 2018.¶