This is repository for a project of AI movies recommendation system based on k-means clustering algorithm with Flask-RESTFUL APIs. An associated article is published on medium, read it here AI Movies Recommendation System Based on K-Means Clustering Algorithm. You can also download Jupyter Notebook AI Movies Recommendation System Based on K-Means Clustering Algorithm from this repositry for the understanding of this whole project completely with step by step instructions and tutorials. This Python Notebook is similar as Medium article.
In this section, the complete documentation is given to use model and it's different APIs. This model is very flexible with a lot of APIs and analysis methods on datasets. This project can work both with csv files and databases. It built default with csv file with format of MovieLens dataset but it is described that how to use it with other sources.
Following libraries of Python are needed to be installed in your working environment
Pandas version: 0.25.1
NumPy version: 1.16.5
Matplotlib version: 3.1.1
Scikit-Learn version: 0.21.3
Pickle version: 4.0
Sys version: 3.7.7
To run this app, first install all required libraries then run file app.py
in root directory which is a FLASK
app. APIs for training and recommendations are already designed. The following data files are necessary for this app in the directory ~/Prepairing Data/From Data/ratings.csv
. rating.csv
file can be downloaded from The Movies Dataset | Kaggle. If you have your own Datasets then please see the section dataEngineering
-> loadUsersData()
-> csv file format
for format of dataset. If you want to load data from any database, then please edit the method dataEngineering
-> loadUsersData()
in its allowed edit section, such that the returning format must be a DataFrame with the format described in csv file format
. If you're using fresh model, then it is recommended to first run training API to train the model and save files in the directory to make it for recommendations.
Different modules are designed and their APIs are discussed as follows
- dataEngineering
- elbowMethod
- saveLoadFiles
- kmeansModel
- userRequestedFor
Import it as from Modules.dataEngineering import dataEngineering
Purpose: dataEngineering is a class which will be used to prepare data for k-means clustering model
Create instance of dataEngineering as
YOUR_VAR_NAME = dataEngineering()
- Arguments: No
- users -> Default: None | It will be a list of users IDs in descending order.
- users_movies_list -> Default: None | It will be list of strings where each string will contain user movies separated by comma ",".
loadUsersData(from_loc)
- Arguments:
-
from_loc
-> Default: './Prepairing Data/From Data/filtered_ratings.csv' | Must be a string with valid location of data csv file. csv file format must be followingcsv file format -> Number of Rows: Any | Columns:
['userId', 'movieId']
where'userId'
column should contain IDs of users and'movieId'
column should contain ID of movie which user has added into his favorite OR has watched OR any other type on which you want to make recommendations.'userId'
column could contain multiple entries of same ID.
-
- Purpose: It will be used to load users data from location and create list of users IDs in users data.
- Attribute Updates: False
- Associated Method: No
- Return: Python
list
of length 2.- Index 0:
True
: If data loaded successfully.False
: If any error arise.
- Index 1:
- If index 0
True
thendict
with following format{'users_data': A Pandas DataFrame containing data loaded from from_loc, 'users_list': A list of users IDs in descending order extracted from users_data}
- If index 0
False
thenstr
containing error information.
- If index 0
- Index 0:
- Note: If you're using any other way to load data, then it is recommended to edit this method to get your data and return in the same format as described above in the csv file format. Please write your code inside the indicated area.
- Arguments:
moviesListForUsers(from_loc)
:- Arguments:
from_loc
-> It will be used with associated methodloadUsersData(from_loc)
. SeeloadUsersData(from_loc)
method docs for further detail onfrom_loc
argument.
- Purpose: It will be used to create a list of strings containing IDs of movies of each user (obtained from users data) separated by comma ",". The order of list will be same as list of users obtained from
loadUsersData
which is in descending order. - Attribute Updates:
users
,users_movies_list
- Associated Method:
loadUsersData(from_loc)
- Return: None
- Arguments:
prepSparseMatrix(from_loc)
:- Arguments:
from_loc
-> It will be used with associated methodmoviesListForUsers(from_loc)
. It is required only if attribute users_movies_list not updated and stillNone
. SeeloadUsersData(from_loc)
method docs for further detail onfrom_loc
argument.
- Purpose: It will create a sparse matrix (NumPy Array) with dimensions
(Number of Users, Number of Movies)
with value1
if users has movie in its list, otherwise0
- Attribute Updates: False
- Associated Method:
moviesListForUsers(from_loc)
-> If attribute users_movies_list isNone
. - Return: Python
list
of length 2.- Index 0:
True
: If method runs successfully.
- Index 1:
dict
with following format{'sparse_matrix': Required sparse matrix as described in purpose, 'feature_names': It will be an array containing all movies IDs in the same order as the columns in **sparseMatrix**}
- Index 0:
- Arguments:
showSparseMatrix(sparseMatrix, feature_names, users)
:- Arguments:
sparseMatrix
-> A sparse matrix obtained fromprepSparseMatrix(from_loc)
method.feature_names
-> An array of feature names obtained fromprepSparseMatrix(from_loc)
method.users
-> An array of users IDs saved in users attribute if notNone
.
- Attribute Updates: False
- Associated Method: None
- Return: Panda DataFrame with presentation of sparse matrix containing indexes with users IDs and columns with movies IDs.
- Arguments:
Import it as from Modules.elbowMethod import elbowMethod
Purpose: It will be used to analyze the optimal number of clusters for k-means algorithm. It will not run with app but can be used by individual for only analysis purpose.
Create instance of elbowMethod as
YOUR_VAR_NAME = elbowMethod(sparseMatrix)
- Arguments:
sparseMatrix
: A sparse matrix obtained fromdataEngineering
module methodprepSparseMatrix()
.
- sparseMatrix -> Default: A sparse matrix given by argument
sparseMatrix
- wcss -> Default:
list()
| A list which will contain WCSS values obtained from k-means algorithm. - differences -> Default:
list()
| A list which will contain difference between each two consective WCSS values.
run(init, upto, max_iterations = 300)
:- Arguments:
init
-> Default: None | Initial number of clusters.upto
-> Default: None | Final number of clusters.max_iterations
- > Default: 300 | It can be any +ve int to set KMeans iterations during clustering.
- Purpose: It will calculate WCSS values and their difference between init to upto numbers of clusters.
- Attribute Updates:
sparseMatrix
,wcss
anddifferences
- Associated Method: None
- Return: None
- Arguments:
showPlot(boundary = 500, upto_cluster = None)
:- Arguments:
boundary
- > Default: 500 | A boundary which you want to set for minimum WCSS value.upto_cluster
-> Default: None | To show plot upto specific cluster numbers e.g. ifupto_cluster = 10
then it will return plot for clusters 1-10 only.
- Purpose: It will show plots of elbow method and differences of WCSS to analyze cluster numbers.
- Attribute Updates: False
- Associated Method: None
- Return: Matplotlib plots.
- Arguments:
Import it as from Modules.saveLoadFiles import saveLoadFiles
Purpose: To save and load files to local by using pickle
library.
Create instance of saveLoadFiles as
YOUR_VAR_NAME = saveLoadFiles()
- Arguments: None
None
save(filename, data)
:- Arguments:
filename
- > Default: None | A string containing pickle filename (no need to write pkl extension at the end) in which you want to write data inside the directory~/datasets/
data
-> Default: None | The data which you want to save inside filename.
- Purpose: It will be used to save or write data in the pkl file.
- Attribute Updates: False
- Associated Method: False
- Return: A
list
containing following[True]
: If file saved successfully[False, err]
: If file not saved, then returnFalse
and a stringerr
containing error information.
- Arguments:
load(filename)
:- Arguments:
filename
- > Default: None | A string containing pickle filename (no need to write pkl extension at the end) which you want to load from the directory~/datasets/
- Purpose: It will be used to load or read data in the pkl file.
- Attribute Updates: False
- Associated Method: False
- Return: It will return following:
data
: if file loaded or read successfully then it will return data from the source filename[False, err]
: If file not loaded, then returnFalse
and a stringerr
containing error information.
- Arguments:
saveClusterMoviesDataset(data)
:- Arguments:
data
- > Default: None | A data which you want to save in the location~/datasets/clusters_movies_dataset.pkl
.
- Purpose: It will save data in the location
~/datasets/clusters_movies_dataset.pkl
. It is designed to save list of clusters movies dataframes. - Attribute Updates: None
- Associated Method:
save(filename)
- Return: A
list
containing following[True]
: If file saved successfully[False, err]
: If file not saved, then returnFalse
and a stringerr
containing error information.
- Arguments:
loadClusterMoviesDataset()
:- Arguments: None
- Purpose: It will load data from location
~/datasets/clusters_movies_dataset.pkl
. It is designed to load list of clusters movies dataframes. - Attribute Updates: None
- Associated Method:
load(filename)
- Return: It will return following:
data
: if file loaded or read successfully then it will return data from the source filename[False, err]
: If file not loaded, then returnFalse
and a stringerr
containing error information.
saveUsersClusters(data)
:- Arguments:
data
- > Default: None | A data which you want to save in the location~/datasets/users_clusters.pkl
.
- Purpose: It will save data in the location
~/datasets/users_clusters.pkl
. It is designed to save dataframe of users clusters. - Attribute Updates: None
- Associated Method:
save(filename)
- Return: A
list
containing following[True]
: If file saved successfully[False, err]
: If file not saved, then returnFalse
and a stringerr
containing error information.
- Arguments:
loadUsersClusters()
:- Arguments: None
- Purpose: It will load data from location
~/datasets/users_clusters.pkl
. It is designed to load dataframe of users clusters. - Attribute Updates: None
- Associated Method:
load(filename)
- Return: It will return following:
data
: if file loaded or read successfully then it will return data from the source filename[False, err]
: If file not loaded, then returnFalse
and a stringerr
containing error information.
Import it as from Modules.kmeansModel import kmeansModel
Purpose: It will be used to make clusters of users, clusters movies lists, methods to fix small clusters.
Inherits: This method inherits KMeans
and saveLoadFiles
classes. So, it inhertis all the properties of KMeans
algorithm of sklearn
library and saveLoadFiles
module.
Create instance of kmeansModel as
YOUR_VAR_NAME = kmeansModel()
- Arguments: None
- It inherits all the attributes of
KMeans
class/object ofsklearn
. - users_cluster -> Default: None | It will be a pandas DataFrame of users clusters with structure
(Rows: The number of Users, Columns: ['userId', 'Cluster'])
. - clusters_movies_df -> Default: None | It will be a list containing panda DataFrames of each cluster movies list with following structure
[dataframe_of_cluster_1, dataframe_of_cluster_2, ..., dataframe_of_cluster_N]
where each cluster DataFrame will be of following structure(Rows: The number of movies in cluster, Columns: ['movieId', 'Counts'])
whereCounts
is the value telling the number of users in the clusters who has particular movie in their list.
It inherits all the methods of saveLoadFiles
clustersMovies(users_cluster, users_data)
:- Arguments:
users_cluster
- > Default: None | A panda DataFrame containing users clusters as described in Attributes.users_data
- > Default: None | A panda DataFrame containing users data as described in following -> Module:dataEngineering
-> Method:loadUsersData(from_loc)
-> Arguments:from_loc
-> csv file format.
- Purpose: It will be used to prepare a list of panda DataFrames containing each cluster movies as structure described in Attributes -> clusters_movies_df.
- Attribute Updates: None
- Associated Method: None
- Return: A list of panda DataFrames containing each cluster movies with structure described in Attributes -> clusters_movies_df
- Arguments:
fixClusters(clusters_movies_dataframes, users_cluster_dataframe, users_data, smallest_cluster_size = 11)
:- Arguments:
clusters_movies_dataframes
- > Default: None | A panda DataFrame obtained fromclustersMovies
method.users_cluster_dataframe
- > Default: None | A panda DataFrame with structure and information as described in Attributes -> users_cluster.users_data
- > Default: None | A panda DataFrame of users detail. For structure see -> Module:dataEngineering
-> Method:loadUsersData(from_loc)
-> Arguments:from_loc
-> csv file format.smallest_cluster_size
- > Default: 11 | Anint
value indicating the smallest cluster size. See below Purpose
- Purpose: It will be used to fix small clusters whose sizes are less than
smallest_cluster_size
. The small clusters will be deleted and the users belonging to those clusters will be shifted to others clusters which containing more relevant data with highest probability and users with more similar taste of them. Also the cluster in which users will be shifted will also updated with small clusters users records. - Attribute Updates: None
- Associated Method:
getMyMovies() from userRequestedFor
:userRequestedFor
is a module, read docs below in section 5. - Return: A
tuple
containing following:- Updated and fixed
clusters_movies_dataframes
- Updated and fixed
users_cluster_dataframe
- Updated and fixed
- Arguments:
run_model(sparseMatrix = None, fix_clusters = True, smallest_cluster = 6)
:- Arguments:
-
sparseMatrix
- > Default: None | A sparse matrix which can be obtained fromdataEngineering().prepSparseMatrix()
. SeedataEngineering
moduleNote: If not given, then it will calculate byself by using default location of
from_loc
. If you're using a different type data source to load model, then run it yourself. -
fix_clusters
- > Default: True |fixClusters
method will be called if True or Default to fix small clusters which are not enough for making recommendation. -
smallest_cluster
- > Default: 6 | Needed only iffix_clusters
is True. The smallest cluster size which we want.
-
- Purpose: It is the K-Means model which will run to make users clusters and each cluster movies collections based on matrix provided in
sparseMatrix
. This method will call itselfloadUsersData() from dataEngineering
to load users data as given in methodloadUsersData()
. - Attribute Updates:
users_cluster
andclusters_movies_df
. - Associated Method:
clustersMovies
,fixClusters
,prepSparseMatrix() from dataEngineering
andloadUsersData() from dataEngineering
. - Return: A
list
of length 2- Index 0:
- True: If run successfully
- _False: If any error arise.
- Index 1:
- If Index 0 is True: A
dict
with following format{'users_cluster': users_cluster, 'clusters_movies_df': clusters_movies_df}
- If Index 0 is False: A
str
containing error information.
- If Index 0 is True: A
- Index 0:
- Arguments:
saveFiles
:- Arguments: None
- Purpose: To save training data after call of
run_model
into files at default locations provided insaveLoadFiles
module. - Attribute Updates: None
- Associated Method:
saveClusterMoviesDataset(data)
andsaveUsersClusters(data)
- Return: A
list
of length 2- Index 0:
- True: If files saved successfully.
- False: Else
- Index 1:
- A
str
containing information of success or error.
- A
- Index 0:
Import it as from Modules.userRequestedFor import userRequestedFor
Purpose: It will be used for different types of requests on users data. E.g. getting users movies list, making recommendations or updating users movies lists etc.
Create instance of userRequestedFor as
YOUR_VAR_NAME = userRequestedFor(user_id, users_data, making_recommendations = False)
- Arguments:
user_id
: Default: None | A user ID for which request is being sent.users_data
: Default: None | A panda DataFrame containing users data as described in following -> Module:dataEngineering
-> Method:loadUsersData(from_loc)
-> Arguments:from_loc
-> csv file format.making_recommendations
: If request is being called for recommendation methods, then set it True else False. If set True then object will load trained data saved in default directories see -> Module:saveLoadFiles
-> Methods:loadClusterMoviesDataset()
andloadUsersClusters()
.
- user_cluster -> Default: None | if
making_recommendations = True
, then it will be user cluster number for whom request is being called. The associated for this attribute will beloadUsersClusters() from saveLoadFiles
- movies_list -> Default: None | if
making_recommendations = True
, then it will be a list containing DataFrames of each cluster movies. For further detail, see Moduel:kmeansModel
-> Attributes:clusters_movies_df
. The associated for this attribute will beloadClusterMoviesDataset() from saveLoadFiles
- cluster_movies -> Default: None | if
making_recommendations = True
, It will be a panda DataFrame of user cluster movies, obtained from Attribute movies_list. - cluster_movies_list -> Default: None | if
making_recommendations = True
, It will be a list containing all the movies IDs in cluster_movies DataFrame.
getMyMovies()
:- Arguments: None
- Purpose: It will be used to get list of all movies of user with "Attribute:
user_id
" from "Attribute:users_data
". - Attribute Updates: False
- Associated Method: None
- Return: See Purpose
updatedFavouriteMoviesList(new_movie_Id)
: Should be called only ifmaking_recommendations = True
- Arguments:
new_movie_Id
- > Default: None | A new movie ID of user with "Attribute:user_id
", which we want to update in user cluster movies DataFrame.
- Purpose: It should be called when
users_data
has updated for any new user movie entry, that if any user added another movie in his favourite list (or whatever recommendation analysis) then this method should be called to update already obtained data from KMeans to update movies data of user cluster. - Attribute Updates:
cluster_movies
,movies_list
- Associated Method:
saveClusterMoviesDataset() from saveLoadFiles
- Return: None
- Arguments:
recommendMostFavouriteMovies()
: Should be called only ifmaking_recommendations = True
- Arguments: None
- Purpose: It will be used to make recommendation to user with user id "Attribute:
user_id
" from the user cluster movies data and the movies which user has not in his/herusers_data
. - Attribute Updates: False
- Associated Method: None
- Return: A
list
with length 2.- Index 0:
True
: If method runs successfully.False
: If any error arise
- Index 1:
- If Index 0 is True then a
list
of movies IDs which are recommendation for user. - If Index 0 is False then a
str
containing error information.
- If Index 0 is True then a
- Index 0: