1.Mean Rating: Calculate the mean rating for each movie, order with the highest rating listed first, and submit the top three (along with the mean scores for the top two).
import pandas as pd
import numpy as np
calculationCSV = pd.read_csv(r'D:\BaiduNetdiskDownload\Recommender Systems 专项课程\1、Introduction to Recommender Systems\03_non-personalized-and-stereotype-based-recommenders\05_module-assessments\HW1-data.csv')
calculationCSV.shape
print(list(calculationCSV.columns))
calculation = calculationCSV.drop(['User','Gender (1 =F, 0=M)'], axis=1)
print(calculation)
#mean rating
mean = calculation.mean()
print(mean.sort_values(axis = 0, ascending=False).head(3))
output:
2.Rating Count (popularity): Count the number of ratings for each movie, order with the most number of ratings first, and submit the top three (along with the counts for the top two).
#rating count
count = calculation.count()
print(count.sort_values(axis = 0, ascending=False).head(3))
output:
3.% of ratings 4+ (liking): Calculate the percentage of ratings for each movie that are 4 or higher. Order with the highest percentage first, and submit the top three (along with the percentage for the top two). Notice that the three different measures of “best” reflect different priorities and give different results; this should help you see why you need to be thoughtful about what metrics you use.
#% of ratings 4+
def ifgreaterthan4(x):
if (x >= 4):
return 1
else:
return 0
cols = calculation.columns
print(cols)
for col in cols:
calculation[col] = calculation[col].apply(lambda x: ifgreaterthan4(x))
sum=calculation.sum()
print(sum)
liking=sum/count
print(liking)
output:
4.Top movies for someone who has seen Toy Story: Calculate movies that most often occur with Movie #1: Toy Story, using the (x+y)/x method described in class. In other words, for each movie, calculate the percentage of Toy Story raters who also rated that movie. Order with the highest percentage first, and submit the top 3 (along with the correlations for the top two). Note, you will have ties - to break the ties, use the lowest- numbered movie as the higher-ranked one. In other words, if Movies 541 and 318 are tied, then 318 gets the higher rank.
#top movies for those who has seen toy story
newCalculation = calculationCSV.drop(['User','Gender (1 =F, 0=M)'], axis=1)
# print(toyStory)
for i in range(len(newCalculation)):
if np.isnan(newCalculation['1: Toy Story (1995)'][i]):
newCalculation = newCalculation.drop(i)
nC = newCalculation.drop(['1: Toy Story (1995)'], axis=1)
print(nC.count().sort_values(ascending=False).head(5))
5.Correlation with Toy Story: Calculate the correlation between the vectors of ratings for Toy Story and each other movie. You can use the built-in CORREL() function. Order by the highest correlation (positive only) and submit the top 3 along with the correlation values for the top 2. Notice the differences between co-occurrence and correlation; these metrics are showing different types of relationships.
# correlation
print(newCalculation.corr()[u'1: Toy Story (1995)'].sort_values(ascending=False).head(6))
6.Mean rating difference by gender: First, recompute the mean rating for each movie separately for males and for females. And calculate the overall mean rating (across all ratings) for males and females. Submit the two movies that have the greatest differences (one where men are most above women, and one where women are most above men) along with the differences in average. Also submit the difference in overall rating averages (female average - male average).
# mean rating difference by gender
#compute overall mean rating for males and females
cal = calculationCSV.drop(['User'], axis=1)
males = cal.loc[(cal['Gender (1 =F, 0=M)'] == '1')]
females = cal.loc[(cal['Gender (1 =F, 0=M)'] == '0')]
males_mean = males.mean()
females_mean = females.mean()
male = cal[cal['Gender (1 =F, 0=M)'].isin(['1'])]
print(male)
mean_male = male.mean()
print(mean_male.sort_values(ascending=False))
Output
female = cal[cal['Gender (1 =F, 0=M)'].isin(['0'])]
print(female)
mean_female = female.mean()
print(mean_female.sort_values(ascending=False))
Output
#submit two movies that have the greatest differences
difference1 = mean_female['1198: Raiders of the Lost Ark (1981)'] - mean_male['2916: Total Recall (1990)']
difference2 = mean_male['2396: Shakespeare in Love (1998)'] - mean_female['34: Babe (1995)']
print(difference1)
print(difference2)
Output:
#compute overall female average - male average
mean_male = mean_male.drop(['Gender (1 =F, 0=M)'],axis=0)
mean_female = mean_female.drop(['Gender (1 =F, 0=M)'],axis=0)
avg_male = mean_male.mean()
print(avg_male)
avg_female = mean_female.mean()
print(avg_female)
Output:
7.Next, compute the % of ratings 4+ separately for males and females. You’ll be asked to submit two movies as above (largest difference in each direction). And again you’ll indicate whether men or women are more likely to rate movies 4 stars or above.
#4+ for male and female
cols = male.columns
print(male)
for col in cols:
male[col] = male[col].apply(lambda x: ifgreaterthan4(x))
print(male)
sum=male.sum()
count = male.count()
liking=sum/count
print(liking.sort_values(ascending=False))
Output:
cols = female.columns
for col in cols:
female[col] = female[col].apply(lambda x: ifgreaterthan4(x))
sum=female.sum()
count = female.count()
liking=sum/count
print(liking.sort_values(ascending=False))
Output: