StackOverflow Question-based User Clustering

Using the Python driver for the Stackexchange API, questions are fetched for each user and stored in the database. On the basis of these questions, users are clustered into multiple groups using K-Means clustering available in the Scikit-learn Python library. The top keywords are displayed for each cluster. Each cluster thus defines a particular software domain wherein all users are active, and thus significant information about each user can be mined.

The workflow is as follows:

Extract user ids from Stack Overflow urls.
Use PyStackExchange API to extract all questions of each user.
Use NLTK for stemming and tokenizing the questions.
Create a Tf-Idf vector matrix, treating each set of questions of a user as a seperate document.
Run K-means on the above tf-idf matrix and obtain clusters.
Display top keywords in each cluster and corresponding cluster users.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
StackOverflow Data		StackOverflow Data
.DS_Store		.DS_Store
README.md		README.md
SOextract.py		SOextract.py
basic-bar.html		basic-bar.html
kmeans.py		kmeans.py
output.txt		output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StackOverflow Question-based User Clustering

About

Releases

Packages

Languages

Nishad94/StackOverflow-Question-based-User-Clustering

Folders and files

Latest commit

History

Repository files navigation

StackOverflow Question-based User Clustering

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages