Art of becoming The Sherlock Holmes: A comprehensive guide to Pandas
A patient needs a doctor, a hungry needs food, a victim needs a Sherlock Holmes and so does an organisation needs YOU (a data analyst), period.
The era of mining fossil fuels has been long taken over by the practices of data mining. In just one internet minute, loads of worthy data is generated which if dug carefully with a business objective can bring high profit to an organisation.
The picture on the left clearly illustrates the amount of huge data generated per minute in 2018. Technology giants (in picture) maintain a record of each and every activity of yours (on your consent), store them as their raw data, convert that data into valuable information and keep earning profit. But, who derives such valuable insights from raw data? Do these firms got chance to hire some kind of Sherlock Holmes!!!
Well, may be YES!
So, let’s learn how to make yourself The Sherlock (at least a confident beginner) and detect hidden answers from raw data.
The In and Out
This tutorial teaches you the necessary skills to deliver valuable insights from data using Python’s data analysis library, Pandas. The data sets used in this tutorial is available and taken from Kaggle.
So what’re the prerequisites?
- Mandatory — Working knowledge of python(3.x).
- Mandatory — Jupyter notebook provided by Anaconda.
- Optional — Working knowledge of basic python data science libraries like NumPy, Scipy, Matplotlib, and Seaborn.
What will you gain?
Get a complete hands-on guide on Pandas methods and attributes listed below and learn not just syntax, rather how to use them in a scenario to perform extensive exploratory data analysis.
Note — In between the tutorial, you can expect hands-on questions and tryout on Pandas functions. Plus use of methods borrowed from above mentioned optional libraries.
Enough comments are provided along the way to understand the meaning of code. But you can always refer documentation for elaborate description.
All codes used in this tutorial are available as a single Jupyter notebook; link is provided under reference section.
Go Panda!
Case objective
To deliver what does it takes for an app to get a 5 star rating on Google Play Store.
The tutorial has been partitioned into two sections —
I. Getting the case ready — here you’ll tidy the data by using basic functionalities of Pandas like data cleaning, handling missing data, duplicate data and outliers.
II. Answering the mysteries — here you’ll fetch information and insights using Pandas visualisation (plus visualisation from other lib) to uncover rumours and arrive at a decision.
I. Getting the case ready
Consider a data set named Google Play Store Apps available on Kaggle. Download it and extract the data which will give you two CSV files.
Next, we need to load the file (googleplaystore.csv
) into python environment using Pandas. Think of Pandas as an alternative to MS Excel available within Python but with more advanced features and of course with a great community support, not to mention the Stack Overflow community support ;). Pandas consists of two objects namely Series
and DataFrame
. A DataFrame holds a 2D matrix while a Series holds a 1D matrix. Since we have a CSV file to load, hence we’ll rely on a DataFrame.
1. Loading the data set into a Pandas DataFrame
Open a new Jupyter notebook, import necessary libraries and load the googleplaystore.csv
file using read_csv
function into a DataFrame df
as shown:
# Importing necessary libraries for tutorial
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns# Loading the data into a dataframe df
df = pd.read_csv('googleplaystore.csv')
Now, Sherlock has a case to resolve. Let’s observe the raw data using df.info(memory_usage='deep')
. We used memory_usage='deep'
argument to retrieve the space consumed by object
(alias to string) datatype columns.
df.info(memory_usage='deep')# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10841 entries, 0 to 10840
# Data columns (total 13 columns):
# App 10841 non-null object
# Category 10841 non-null object
# Rating 9367 non-null float64
# Reviews 10841 non-null object
# Size 10841 non-null object
# Installs 10841 non-null object
# Type 10840 non-null object
# Price 10841 non-null object
# Content Rating 10840 non-null object
# Genres 10841 non-null object
# Last Updated 10841 non-null object
# Current Ver 10833 non-null object
# Android Ver 10838 non-null object
# dtypes: float64(1), object(12)
# memory usage: 8.3 MB
We can observe that there’re 10,841 rows and 13 columns; feature names and datatypes of which are listed above. float64
means 64-bit decimal values and pointing out again, object
here means string
datatype. Overall, the whole data set consumes 8.3 MB
of disk space.
To access only column names, index values, main body of dataframe and datatypes, one can use following commands:
df.columns # <- retrieves all column names
df.index # <- retrieves all index values
df.values # <- retrieves main body of df
df.dtypes # <- retrieves datatype of each column
Question 1 — Rename the following columns:
FROM TO
Content Rating ContentRating
Last Updated LastUpdated
Current Ver CurrentVer
Android Ver AndroidVer
Hint: Use pandas.DataFrame.rename
. You can use inplace
argument which helps to perform changes to the dataframe without assigning the operation back to df
. Hence, with inplace=True
, no need of df=df.rename….
2. Data cleaning
As we have observed from info()
, there is just 1 numeric column and rest all are strings. However, columns such as Reviews
(i.e. number of reviews), Size
, Installs
and Price
are expected to be numeric rather than being object. So let us investigate the matter.
Let’s fetch the row which has string elements in all these columns -
# Fetching the entity which has stringdf.Reviews[df.Reviews.str.contains('[^0-9]')]
10472 3.0M
Name: Reviews, dtype: objectdf.Size[df.Size.str.contains('[^0-9]')].head()
0 19M
1 14M
2 8.7M
3 25M
4 2.8M
Name: Size, dtype: objectdf.Installs[df.Installs.str.contains('[a-z]')]
10472 Free
Name: Installs, dtype: objectdf.Price[df.Price.str.contains('[a-z]')]
10472 Everyone
Name: Price, dtype: object
Here comes regex as a saviour pointing us to the right direction. Observe that column Size
has almost all the elements as strings, however other three columns have just one row with index 10472
which brings problem. So let us drop this row and convert the datatype of these columns back to numeric.
# Dropping row 10472
df.drop(10472, inplace=True)# Changing the datatype of Reviews column back to numeric
df.Reviews = df.Reviews.astype(int)
Beware!!! columns Installs
and Price
have a special format which needs to be fixed before you can convert them to numeric. You can observe their current format using loc
or iloc
. loc
is used if you have to access the elements through their label whereas iloc
is used to access elements based on their index position.
df.loc[5:10, 'Installs']
# 5 50,000+
# 6 50,000+
# 7 1,000,000+
# 8 1,000,000+
# 9 10,000+
# 10 1,000,000+
# Name: Installs, dtype: objectdf.iloc[1832:1839, 7] # <- Here, 7 is the index of Price column
# 1832 $1.99
# 1833 $4.99
# 1834 $4.99
# 1835 $4.99
# 1836 $5.99
# 1837 $6.99
# 1838 $9.99
# Name: Price, dtype: object
So, let us convert the format of Installs
column from 12,345+
to 12345
and Price
from $1.23
to 1.23
.
# Install columndf.Installs = df.Installs.str.replace(',','') #Replacing ',' with ''
df.Installs.head()
# 0 10000+
# 1 500000+
# 2 5000000+
# 3 50000000+
# 4 100000+
# Name: Installs, dtype: objectdf.Installs = df.Installs.str.replace('+','').astype(int) #Replacing '+' with '' and converting datatype back to numeric
df.Installs.head()
# 0 10000
# 1 500000
# 2 5000000
# 3 50000000
# 4 100000
# Name: Installs, dtype: int32 # <- observe the new datatype
# ---------------------------------------------------------------# Price columndf.Price = df.Price.str.lstrip('$').astype(np.float32) # Using left strip to cut '$' and later converting the column to float 32-bit
df.iloc[1832:1839, 7]
# 1832 1.99
# 1833 4.99
# 1834 4.99
# 1835 4.99
# 1836 5.99
# 1837 6.99
# 1838 9.99
# Name: Price, dtype: float32 # <- observe the new datatype
Question 2 — Reformatting Size
column
The Size
column has the mix format of values. Convert the values from current format to new format as shown:
FROM TO
19M 19000000
23.5k 23500
Varies with device Varies with device (no change)
Note — Values suffixed with M
or k
can also be present in decimal. E.g. 1.9M
. So handle them accordingly.
And now newbie Sherlock knows how to start with a dirty case. Here’s a glimpse of what you have achieved so far (look for the bold text) -
df.info()
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 10840 entries, 0 to 10840
# Data columns (total 13 columns):
# App 10840 non-null object
# Category 10840 non-null object
# Rating 9366 non-null float64
# Reviews 10840 non-null int32
# Size 10840 non-null object
# Installs 10840 non-null int32
# Type 10839 non-null object
# Price 10840 non-null float32
# ContentRating 10840 non-null object
# Genres 10840 non-null object
# LastUpdated 10840 non-null object
# CurrentVer 10832 non-null object
# AndroidVer 10838 non-null object
# dtypes: float32(1), float64(1), int32(2), object(9)
# memory usage: 1.0+ MB
3. Dealing with missing values
Now, let us observe the amount of missing values in each column and how to deal with them. This information can be retrieved textually in percentage of missing value per column. However, if there are more number of columns then visualisation of the same is recommended. Both the ways are shown here.
actual_values = df.count() # <- counts non-missing values per col
total_values = df.shape[0] # <- shape results into (n_row, n_col)# Rounding to 2 decimal places using np.round(val_to_round, 2)
miss_perct = np.round((1 - (actual_values/total_values)) * 100, 2)
miss_perct# App 0.00
# Category 0.00
# Rating 13.60
# Reviews 0.00
# Size 0.00
# Installs 0.00
# Type 0.01
# Price 0.00
# ContentRating 0.00
# Genres 0.00
# LastUpdated 0.00
# CurrentVer 0.07
# AndroidVer 0.02
# dtype: float64
The variable miss_perct
holds the percentage of missing values in each column rounded to 2 decimal places. But wait! Do you observe something new? Well try type(miss_perct)
. Hasn’t our Sherlock self-discovered Pandas Series
;)
A tryout
- Converting above series values into a dataframe named
df_miss
and giving a column namepercent
.
# Method 1
df_miss = pd.DataFrame(miss_perct.values, index=miss_perct.index, columns=['percent'])# Method 2
df_miss = pd.DataFrame(miss_perct, columns=['percent'])df_miss# percent
# App 0.00
# Category 0.00
# Rating 13.60
# Reviews 0.00
# Size 0.00
# Installs 0.00
# Type 0.01
# Price 0.00
# ContentRating 0.00
# Genres 0.00
# LastUpdated 0.00
# CurrentVer 0.07
# AndroidVer 0.02
2. Adding another column to dataframe df_miss
, named miss_count
which represents number of missing values.
# Number of missing values per column
df_miss['miss_count'] = len(df) - df.count() # <- in spite of using df.shape[0], another way of finding total dataframe length is by using len(df)df_miss.head() # <- prints top 5 rows if no number if passed# percent miss_count
# App 0.00 0
# Category 0.00 0
# Rating 13.60 1474
# Reviews 0.00 0
# Size 0.00 0df_miss.tail(8) # <- prints last 8 rows# percent miss_count
# Installs 0.00 0
# Type 0.01 1
# Price 0.00 0
# ContentRating 0.00 0
# Genres 0.00 0
# LastUpdated 0.00 0
# CurrentVer 0.07 8
# AndroidVer 0.02 2
Now, that we have covered retrieving the percentage of missing values textually, let us gather the same information visually. We will use df_miss
dataframe and plot percent
column to visualize percentage of missing value per column.
# Adding index values as new column to make it as tidy-dataframe for input to seaborndf_miss = df_miss.reset_index() # <- If you add reset_index(drop=True) then index column is droppeddf_miss.head(3)# index percent miss_count
# 0 App 0.0 0
# 1 Category 0.0 0
# 2 Rating 13.6 1474plt.figure(figsize=(8,5))
sns.barplot(x="percent", y="index", data=df_miss, palette='icefire')
sns.despine()
plt.show()
Since we have seen both ways to fetch percentage of missing values, now let us handle these missing values. One way is to fill them with median if feature (column) is quantitative (numeric) or fill them with mode if feature is qualitative (categorical). Other possible way include dropping the complete feature if missing value percentage is too high, interpolating the values and much more.
Since the only feature Rating
has 13.6%
of missing value and the rest have only minute percentage, so let us go with the first way viz. filling missing values of a quantitative feature(s) with median values and a qualitative feature(s) with its mode values.
# Filling missing values in Quantitative feature by median
for col in df.select_dtypes('float64'):
df.loc[:, col].fillna(df.loc[:, col].median(), inplace=True)# Filling missing values in Qualitative feature by mode
for col in df.select_dtypes('object'):
df.loc[:, col].fillna(sp.stats.mode(df.loc[:, col].astype(str))[0][0], inplace=True)df.info()# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 10840 entries, 0 to 10840
# Data columns (total 13 columns):
# App 10840 non-null object
# Category 10840 non-null object
# Rating 10840 non-null float64
# Reviews 10840 non-null int32
# Size 10840 non-null object
# Installs 10840 non-null int32
# Type 10840 non-null object
# Price 10840 non-null float32
# ContentRating 10840 non-null object
# Genres 10840 non-null object
# LastUpdated 10840 non-null object
# CurrentVer 10840 non-null object
# AndroidVer 10840 non-null object
# dtypes: float32(1), float64(1), int32(2), object(9)
# memory usage: 1.0+ MB
Now, we have a dataframe with no missing values as illustrated above.
4. Checking duplicate entries
Sometimes similar instances fill a data set which can be handled by removing them unless the sample size isn’t too small. So, let us count the number of duplicate instances in our data -
total_duplicates = len(df[df.duplicated()])
total_duplicates# 483print('The data has ' + str(np.round(total_duplicates/len(df) * 100, 2)) + '% of duplicate data')# The data has 4.46% of duplicate data
Let’s proceed with the removal of these instances to arrive at only unique rows.
df.drop_duplicates(inplace=True)
print('New shape of data set: ' + str(df.shape))# New shape of data set: (10357, 13)
Also, let us check if there are certain App
records which are repeated.
app_duplicates = len(df[df.App.duplicated()])
app_duplicates
# 698print('The data has ' + str(np.round(app_duplicates/len(df) * 100, 2)) + '% of duplicate app records')
# The data has 6.74% of duplicate app records
As can be seen there are several instances of apps which are repeated, so let’s drop the duplicates and keep only unique instances.
df = df[~df.App.duplicated()].reset_index(drop=True)
df.shape
# (9659, 13)
5. Checking for outliers
Outliers are the main concern while dealing with dirty data. They are easy to detect using box plot (boxenplot, for large data). Let us find the outliers in following features -
rating
, reviews
, installs
, price
and size
(replacing ‘Varies with device’ with 0).
# Selecting required features and storing them into a new dataframe
df_temp = df.loc[:,['Rating', 'Reviews', 'Size', 'Installs', 'Price']]# Replacing 'Varies with device' values under Size column with 0
df_temp.Size.replace('Varies with device', 0, inplace=True)# Normalizing values to visualize each feature in one plot
from sklearn.preprocessing import MinMaxScalertemp = pd.melt(pd.DataFrame(MinMaxScaler().fit_transform(df_temp)))plt.figure(figsize=(10,6))
sns.boxenplot(x='variable', y='value', data=temp, palette='Blues')
sns.stripplot(x='variable', y='value', data=temp, size=0.5, jitter=True, color="gray")
plt.xticks(np.arange(df_temp.shape[1]), df_temp.columns)
plt.yticks([])
plt.xlabel('')
plt.ylabel('')
sns.despine(left=True,trim=True)
plt.show()
An alternative to boxenplot
is boxplot
as shown:
plt.figure(figsize=(10,6))
sns.boxplot(x='variable', y='value', data=temp, palette='Blues')
sns.stripplot(x='variable', y='value', data=temp, size=0.5, jitter=True, color="gray")
plt.xticks(np.arange(df_temp.shape[1]), df_temp.columns)
plt.yticks([])
plt.xlabel('')
plt.ylabel('')
sns.despine(left=True,trim=True)
plt.show()
Each feature carries multitude of outliers, for this case, let us not break the peace of features by removing outliers.
II. Answering the mysteries
With clean and explored data , let us now bring insights out of a table.
1. Uncovering rumours
A new information always comes with rumours. Suppose your manager gives you an opportunity to transfer permanently to a new country with only 2 available options say, country A and B. Assuming you never visited any of these 2 countries before, your decision will be most probably based upon some people’s recommendations, common statistics and known rumours.
Same goes with a data set. Say we need to find the peak hours when maximum number of working people do chatting on WhatsApp between Monday to Friday. As per a common man understanding (referring to a rumour, hence differ person to person) a random guess would be in the late night hours. Its a random guess, but still for decision making you need a valid analysis to take further action.
Therefore, let us validate common understanding points of following data set:
First, let’s figure out the relationship between available quantitative feature using pairplot
or correlation
heatmap
.
You can observe from the heatmap
that there exists a good amount of correlation between number of installs done for an app against number of reviews given to app. Also, if outliers can be removed then we might have good correlation in other features as can be observed in pairplot
.
Here’s the code for following two figures —
# Heatmap of correlation plot
plt.figure(figsize=(7,5))
sns.heatmap(df_temp.corr(), cmap='Blues')plt.show()
# Pairplot
plt.figure(figsize=(7, 5))
sns.pairplot(df_temp, kind='reg', diag_kind='kde', markers='+')plt.show()
Now, let us list down some known information and validate it —
- Division of apps
rating
i.e. extracting number of apps per0.5
steprating
. - Ratio of app
type
as per apprating
i.e. in each0.5
steprating
how many number offree
andpaid
apps are available. - Ratio of
rating
percontent type
i.e. for each category what is the ratio of 0.5 steprating
. - Ratio of
rating
per top 6genres
i.e. selecting only top 6genres
based on the maximum count of apps and getting the ratio of 0.5 steprating
. - Number of apps per
category
. - Number of apps updated yearly.
- Number of apps updated monthly.
- Number of apps updated in each year’s months.
Take-away from above figure —
- Maximum number of apps are rated
4.5
on a scale of5
having step size0.5
. - Number of free apps are more than paid apps on each rating scale.
- Apps meant for
Everyone
take lead in maximum number of apps followed byTeen
andMature 17+
. Rating per content type is illustrated above. - Top
Genres
includeTools
,Entertainment
andEducation
, each having maximum number of rating as4.5
. - Top three categories with maximum number of apps include
Family
,Game
andTools
. - Maximum number of apps get updated as years pass by. Hence, an upward pattern.
- Combining all years data, maximum number of apps are updated during summer i.e. July.
- Each year, high update frequency is observed after month of April.
Here follows above figure’s code.
# 1. Division of apps rating
rate = df.Rating.apply(lambda x: np.round(x * 2)/2) # Rounding to the nearest 0.5
# Retrieving the frequency of each unique rating
val, cnt = np.unique(rate, return_counts=True)
df_rate = pd.DataFrame([val, cnt]).T
df_rate.columns = ['val', 'cnt']# 2. Ratio of app type as per app rating
df_rating_type = pd.DataFrame([rate, df.Type]).T# 3. App rating per type of content
df_rating_content = pd.DataFrame([rate, df.ContentRating]).T# 4. Ratio of rating per top 6 genres
df_rating_genres = pd.DataFrame([rate, df.Genres]).T
# Finding top 6 Genres with highest number of apps
val_g, cnt_g = np.unique(df_rating_genres.Genres, return_counts=True)
df_genres = pd.DataFrame([val_g, cnt_g]).T
df_genres.columns = ['Genres', 'Count']
df_genres.sort_values('Count', ascending=False, inplace=True)
df_genres = df_genres.iloc[:6,:].reset_index(drop=True)
# Subsetting dataframe for rows containing only these 6 generes
df_rating_genres = df_rating_genres[df_rating_genres.Genres.isin(df_genres.Genres)]# 5. Number of apps per category
ratio_cat_app = df.groupby('Category')['App'].count().reset_index()
ratio_cat_app.sort_values('App', ascending=False, inplace=True)# 6. Number of updates in each months, years and each year's months
# Converting the LastUpdated column to date column
df.LastUpdated = pd.to_datetime(df.LastUpdated, format='%B %d, %Y')
df.LastUpdated.head(2)# 0 2018-01-07
# 1 2018-01-15
# Name: LastUpdated, dtype: datetime64[ns]val_y, cnt_y = np.unique(df.LastUpdated.dt.year, return_counts=True)
val_m, cnt_m = np.unique(df.LastUpdated.dt.month, return_counts=True)df_date = pd.DataFrame(df.LastUpdated.astype(str).str.split('-').tolist(), columns=list('YMD'))
df_date.head(3)# Y M D
# 0 2018 01 07
# 1 2018 01 15
# 2 2018 08 01date_temp = df_date.groupby(['M','Y']).apply(lambda x: len(x)).unstack()
date_temp.fillna(0, inplace=True)# Generating sequence of Months
from datetime import datetime
months_seq = [datetime.strftime(datetime.strptime(str(_), "%m"), "%b") for _ in range(1,13)]# ===============PANDAS PLOTTING===================# 1
plt.figure(figsize=(20, 20))
sns.set_style('dark')
row_mask = 6
col_mask = 2ax1 = plt.subplot2grid((row_mask, col_mask), (0,0))
sns.barplot(x='val', y='cnt', data=df_rate, palette='Blues', ax=ax1)
sns.despine(ax=ax1)
ax1.set_ylabel('Count')
ax1.set_xlabel('Ratings')
ax1.set_title('Division of apps rating', weight='bold')# 2
ax2 = plt.subplot2grid((row_mask, col_mask), (0,1))
df_rating_type.groupby(['Rating', 'Type']).\
apply(lambda x: len(x)).unstack().\
plot.bar(stacked=True,
rot=0, color= ['lightblue', 'darkblue'], ax=ax2)
sns.despine(ax=ax2)
ax2.legend(loc='upper left')
ax2.set_xlabel('Rating')
ax2.set_ylabel('Count')
ax2.set_title('Ratio of app type as per app rating', weight='bold')# 3
ax3 = plt.subplot2grid((row_mask, col_mask), (1,0))
df_rating_content.groupby(['ContentRating', 'Rating']).\
apply(lambda x: len(x)).unstack().\
plot(kind='barh', stacked=True,
rot=0, colormap='Blues', ax=ax3)
sns.despine(ax=ax3)
ax3.legend(loc='upper right', ncol=3)
ax3.set_xlabel('Rating')
ax3.set_ylabel('Content Type')
ax3.set_title('Ratio of rating per content type', weight='bold')# 4
ax4 = plt.subplot2grid((row_mask, col_mask), (1,1))
df_rating_genres.groupby(['Genres', 'Rating']).\
apply(lambda x: len(x)).unstack().\
plot(kind='barh', stacked=True,
rot=0, colormap='Blues', ax=ax4)
sns.despine(ax=ax4)
ax4.legend(loc='lower right', ncol=3)
ax4.set_xlabel('Rating')
ax4.set_title('Ratio of rating per top 6 genres', weight='bold')# 5ax5 = plt.subplot2grid((row_mask, col_mask), (2,0), rowspan=2, colspan=2)
sns.barplot(x='App', y='Category', data=ratio_cat_app, palette='Blues_r', ax=ax5)
sns.despine(ax=ax5)
ax5.set_title('Number of apps per category', weight='bold')# 6a
ax6a = plt.subplot2grid((row_mask, col_mask), (4,0))
ax6a.plot(val_y, cnt_y, '--o', linewidth=2)
sns.despine(trim=True, ax=ax6a)
ax6a.set_title('An increasing trend in yearly app update', weight='bold')# 6b
ax6b = plt.subplot2grid((row_mask, col_mask), (4,1))
ax6b.plot(months_seq, cnt_m, '--o', linewidth=2)
ax6b.set_title('An upward rising and falling trend in montly app update', weight='bold')# 6c
ax6c = plt.subplot2grid((row_mask, col_mask), (5,0), colspan=2)
# Normalizing values so that all values can be visualized without any bias
pd.DataFrame(MinMaxScaler().fit_transform(date_temp), index=months_seq, columns=date_temp.columns).\
plot(kind='line', style='--o', colormap='Blues', xticks=np.arange(12), ax=ax6c)
ax6c.set_title('Number of updates (norm) in each year\'s months', weight='bold')
ax6c.legend(loc='upper left', ncol=5)
sns.despine(ax=ax6c)plt.tight_layout()
plt.show()
Let us also bring a visualisation of top apps as per maximum rating
, reviews
, size
and installs
using wordcloud.
Here’s the code for wordcloud subplots —
from wordcloud import WordCloud
import random# Function to create Gray wordcloud.
# Credit: https://amueller.github.io/word_cloud/auto_examples/a_new_hope.html
def grey_color_func(word, font_size, position, orientation, random_state=None,
**kwargs):
return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)# Creating dictionary of features against app names
wc_rating = pd.DataFrame(df_temp.Rating).set_index(df.App).to_dict()['Rating']
wc_reviews = pd.DataFrame(df_temp.Reviews).set_index(df.App).to_dict()['Reviews']
wc_size = pd.DataFrame(df_temp.Size).set_index(df.App).to_dict()['Size']
wc_installs = pd.DataFrame(df_temp.Installs).set_index(df.App).to_dict()['Installs']plt.figure(figsize=(14, 8))r_c = 8ax7 = plt.subplot2grid((r_c, r_c), (0,0), colspan=4, rowspan=4)
wc1 = WordCloud()
wc1.generate_from_frequencies(frequencies=wc_rating)
ax7.imshow(wc1.recolor(color_func=grey_color_func, random_state=3), interpolation="bilinear")
ax7.set_title('Rating', weight='bold', size=20)
ax7.axis("off")ax8 = plt.subplot2grid((r_c, r_c), (0,4), colspan=4, rowspan=4)
wc2 = WordCloud()
wc2.generate_from_frequencies(frequencies=wc_reviews)
ax8.imshow(wc2.recolor(color_func=grey_color_func, random_state=3), interpolation="bilinear")
ax8.set_title('Reviews', weight='bold', size=20)
ax8.axis("off")ax9 = plt.subplot2grid((r_c, r_c), (4,0), colspan=4, rowspan=4)
wc3 = WordCloud()
wc3.generate_from_frequencies(frequencies=wc_size)
ax9.imshow(wc3.recolor(color_func=grey_color_func, random_state=3), interpolation="bilinear")
ax9.set_title('Size', weight='bold', size=20)
ax9.axis("off")ax10 = plt.subplot2grid((r_c, r_c), (4,4), colspan=4, rowspan=4)
wc4 = WordCloud()
wc4.generate_from_frequencies(frequencies=wc_installs)
ax10.imshow(wc4.recolor(color_func=grey_color_func, random_state=3), interpolation="bilinear")
ax10.set_title('Installs', weight='bold', size=20)
ax10.axis("off")plt.show()
2. Mining insights
So far, we have learnt various attributes and methods available in Pandas and also verified various known information. Now let us fetch the insights i.e. unknown information which adds value in further decision making.
The goal of this case is to understand —
What does it takes for an app to be rated as 5 star on Google Play Store?
So let us try to answer it.
print('Number of unique apps with 5 star rating: ', len(df[df.Rating == 5.0]))# Number of unique apps with 5 star rating: 271
Out of these 271 apps, let us fetch the top most values for major features to deliver a decision.
df_5star = df[df.Rating == 5.0].reset_index() # Saving index to access Size from df_tempfeatures = [
df_5star.Category,
df_5star.Reviews,
df_temp.iloc[df_5star.iloc[:, 0],:].Size,
df_5star.Installs,
df_5star.Type,
df_5star.ContentRating,
df_5star.Genres,
df_5star.LastUpdated,
df_5star.AndroidVer
]df_expect = pd.DataFrame(index=['Category', 'Reviews', 'Size', 'Installs',
'Type', 'Content Rating', 'Genres', 'Last Updated', 'Android Version'],
columns=['Values'])cnt = 0
for fea in features:
val_t, cnt_t = np.unique(fea, return_counts=True)
df_expect.iloc[cnt, 0] = pd.DataFrame([val_t, cnt_t]).T.sort_values(1, ascending=False).iloc[0,0]
cnt += 1df_expect+-----------------+---------------------+
| | Values |
+-----------------+---------------------+
| Category | FAMILY |
| Reviews | 1 |
| Size | 11000000 |
| Installs | 100 |
| Type | Free |
| Content Rating | Everyone |
| Genres | Education |
| Last Updated | 2018-07-24 00:00:00 |
| Android Version | 4.1 and up |
+-----------------+---------------------+
Therefore, according to our analysis our newbie Sherlock can point out that for an app to have a 5 star rating above values are the keys.
In summary, to be a 5 star app, it should be free of cost, built for everyone and should emphasise on education and family. Also, it should maintain at least 1 review, 100 installs and can consume nearly 11 MB of disk space, hence need to be light weight. Finally, it should be frequently updated and kept for android version 4.1 and up.
This completes our guide to Pandas. Congratulate yourself for completing this sweet journey. You have learnt the use of major pandas methods and attributes.
For more learning refer Pandas original website, or join any interesting online course like Applied AI course, Coursera, Udemy, Udacity, etc.
References
GitHub — Jupyter notebook (holds complete code used in this tutorial)
Applied AI course, December 15th, 2018
Your comments are always welcomed for any valued suggestion and feedback. Further, you can always reach me out at kumar.vivek@acm.org
for any discussion. Happy learning!