Adult Income RFClassifier

1. Proposal

Americans income-to-expense ratio varies by income level.
What factors will allow us to accurately predict the annual income of adults in the US?
If we can predict income, we can predict expenditure.

2. Methodology

Clean the data to be categorical in nature.
Create a categorical model with a RandomForest algorithm.
Train a model to correctly categorize the income level of adults.

3. Cleaning the Data

Binary Income:
- 0 assigned <= $50k/year,
- 1 assigned >$50k/year
Education cleaned into 3 categories
- Little difference and representation in lower/higher education extremes
Marital status simplified to binary
- Little variation in income via technical status (ex: “widowed” vs. “never married” have similar income ratios)
Race & native country - Not enough data representation to be reliable in categorical model.
Gender and relationship show strong trends in relation to income.
- Relationship is in regards to others in the home.
Age shows strong correlation to income values; with a tail to the outlier ages of 70+.
- Removed the 1% outliers
Hours-per-week is focused in on an average of 40; as expected.
- < 40 hours/week not conducive with making >$50k?

4. Feature Identification

No feature has a higher score than 30% influence to predicting income.
As predicted, race had little value in predicting income from this dataset.
Top 5 features:
- Age ~ 30%
- Hours-per-week ~ 15%
- Marital Status ~12%
- Education ~ 12%
- Occupation ~12%

5. Results of Random Forest Model

Trained on 80% of data, tested on 20%.
- Training accuracy: 85%
- Testing Accuracy: 83.1%
Recall or True Positive Rate
- (TP/(TP+FN): 83%, better w/ <= $50k data
Precision
- (TP/(FP+TN): 82%, better w/ <= $50k data
F1-score
- (2PR/(P+R): 82%
- It is a harmonic mean of precision and recall
Accuracy
- ((TP+TN)/(N+P)): Overall ~83%
- Percentage of total items classified correctly

Great? Good? Bad?!
ROC-AUC Score:
- Likelihood of randomly choosing a positive case & negative case where the positive case outranks the negative case according to the classifier.
- Our model’s score: 88.4%; pretty good.

6. Conclusions

We can accurately predict the income level of an adult in the US with an error of < 12%.
- Allows us to predict expenses, savings, and taxes.
- Enables tailored marketing to demographics according to expenses.
For the future:
- Need more representative data of minorities & immigrants
- Consider adding more income tiers, changing from binary classification to a new model.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.ipynb_checkpoints		.ipynb_checkpoints
README images		README images
.gitattributes		.gitattributes
Adult Income EDA and RandomForest Classifier.ipynb		Adult Income EDA and RandomForest Classifier.ipynb
Predicting Adult Income.pdf		Predicting Adult Income.pdf
README.md		README.md
adult.csv		adult.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adult Income RFClassifier

1. Proposal

2. Methodology

3. Cleaning the Data

4. Feature Identification

5. Results of Random Forest Model

6. Conclusions

About

Releases

Packages

Languages

Shane-McCallum/Adult-Income-EDA-and-RF-Classifier

Folders and files

Latest commit

History

Repository files navigation

Adult Income RFClassifier

1. Proposal

2. Methodology

3. Cleaning the Data

4. Feature Identification

5. Results of Random Forest Model

6. Conclusions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages