Skip to content

Commit

Permalink
add xgboost link
Browse files Browse the repository at this point in the history
  • Loading branch information
ChenglongChen committed Jul 12, 2015
1 parent 129c5cd commit 32e3bb3
Show file tree
Hide file tree
Showing 6 changed files with 30 additions and 27 deletions.
31 changes: 14 additions & 17 deletions Doc/Kaggle_CrowdFlower_ChenglongChen.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This is pdfTeX, Version 3.1415926-2.3-1.40.12 (MiKTeX 2.9) (preloaded format=pdflatex 2013.11.4) 12 JUL 2015 19:01
This is pdfTeX, Version 3.1415926-2.3-1.40.12 (MiKTeX 2.9) (preloaded format=pdflatex 2013.11.4) 12 JUL 2015 21:56
entering extended mode
**Kaggle_CrowdFlower_ChenglongChen.tex
(F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.tex
Expand Down Expand Up @@ -799,7 +799,7 @@ Overfull \hbox (18.3055pt too wide) in paragraph at lines 336--337
[]

[11] [12]
Overfull \hbox (6.46352pt too wide) in paragraph at lines 394--395
Overfull \hbox (6.46352pt too wide) in paragraph at lines 392--393
\OT1/cmr/m/n/10.95 Since the rel-e-vance s-core is in $\OMS/cmsy/m/n/10.95 f\OT
1/cmr/m/n/10.95 1\OML/cmm/m/it/10.95 ; \OT1/cmr/m/n/10.95 2\OML/cmm/m/it/10.95
; \OT1/cmr/m/n/10.95 3\OML/cmm/m/it/10.95 ; \OT1/cmr/m/n/10.95 4\OMS/cmsy/m/n/1
Expand All @@ -808,13 +808,13 @@ tion
[]


Overfull \hbox (5.09244pt too wide) in paragraph at lines 403--404
Overfull \hbox (5.09244pt too wide) in paragraph at lines 401--402
[]\OT1/cmr/m/n/10.95 This al-so ap-plies to One-Against-All (OAA) clas-si-fi-ca
-tion, e.g., \OT1/cmtt/m/n/10.95 LogisticRegression
[]

[13]
Overfull \hbox (6.27373pt too wide) in paragraph at lines 423--424
Overfull \hbox (6.27373pt too wide) in paragraph at lines 421--422
\OT1/cmr/m/n/10.95 ing class prob-a-bil-i-ties.[][][] The ob-jec-tive is in fil
e \OT1/cmtt/m/n/10.95 ./Code/Model/utils.py\OT1/cmr/m/n/10.95 : \OT1/cmtt/m/n/1
0.95 softkappaObj\OT1/cmr/m/n/10.95 .
Expand All @@ -823,37 +823,37 @@ e \OT1/cmtt/m/n/10.95 ./Code/Model/utils.py\OT1/cmr/m/n/10.95 : \OT1/cmtt/m/n/1
[14] [15]
PGFPlots: reading {35lb_subs.txt}
[16]
Overfull \hbox (0.73491pt too wide) in paragraph at lines 545--546
Overfull \hbox (0.73491pt too wide) in paragraph at lines 543--544
[]\OT1/cmr/bx/n/10.95 combine[]feat[][LSA[]and[]stats[]feat[]Jun09][][Low].py\O
T1/cmr/m/n/10.95 : This file gen-er-ates one
[]

[17]
Overfull \hbox (1.45724pt too wide) in paragraph at lines 572--573
Overfull \hbox (1.45724pt too wide) in paragraph at lines 570--571
\OT1/cmr/m/n/10.95 pa. It is adopt-ed from []$\OT1/cmtt/m/n/10.95 https : / / g
ithub . com / benhamner / Metrics / tree / master /$
[]

[18]
Overfull \hbox (32.26485pt too wide) in paragraph at lines 593--594
Overfull \hbox (32.26485pt too wide) in paragraph at lines 591--592
[]\OT1/cmr/m/n/10.95 XGBoost-0.4.0 (Win-dows Ex-e-cutable, []$\OT1/cmtt/m/n/10.
95 https : / / github . com / dmlc / XGBoost / releases /$
[]


Overfull \hbox (11.5833pt too wide) in paragraph at lines 594--595
Overfull \hbox (11.5833pt too wide) in paragraph at lines 592--593
[]\OT1/cmr/m/n/10.95 ml[]metrics ([]$\OT1/cmtt/m/n/10.95 https : / / github . c
om / benhamner / Metrics / tree / master / Python / ml _$
[]


Overfull \hbox (12.11351pt too wide) in paragraph at lines 598--599
Overfull \hbox (12.11351pt too wide) in paragraph at lines 596--597
[]\OT1/cmr/m/n/10.95 rgf1.2 (Win-dows Ex-e-cutable, []$\OT1/cmtt/m/n/10.95 http
: / / stat . rutgers . edu / home / tzhang / software /$
[]


Overfull \hbox (2.8642pt too wide) in paragraph at lines 602--602
Overfull \hbox (2.8642pt too wide) in paragraph at lines 600--600
[]\OT1/cmr/bx/n/17.28 How To Gen-er-ate the So-lu-tion (a-ka README
[]

Expand Down Expand Up @@ -882,12 +882,12 @@ ication / forums / t / 13863 /$
[]

[20])
Package atveryend Info: Empty hook `BeforeClearDocument' on input line 637.
Package atveryend Info: Empty hook `BeforeClearDocument' on input line 635.
[21]
Package atveryend Info: Empty hook `AfterLastShipout' on input line 637.
Package atveryend Info: Empty hook `AfterLastShipout' on input line 635.
(F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.aux)
Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 637.
Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 637.
Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 635.
Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 635.


Package rerunfilecheck Warning: File `Kaggle_CrowdFlower_ChenglongChen.out' has
Expand All @@ -899,9 +899,6 @@ Package rerunfilecheck Info: Checksums for `Kaggle_CrowdFlower_ChenglongChen.ou
t':
(rerunfilecheck) Before: A531A6FD907444ED35169CFAC17839EF;10184
(rerunfilecheck) After: 5835D4BCA4B1BF337073CA56FA26B04F;3336.

LaTeX Warning: Label(s) may have changed. Rerun to get cross-references right.

)
Here is how much of TeX's memory you used:
21906 strings out of 495354
Expand Down
Binary file modified Doc/Kaggle_CrowdFlower_ChenglongChen.pdf
Binary file not shown.
Binary file modified Doc/Kaggle_CrowdFlower_ChenglongChen.synctex.gz
Binary file not shown.
10 changes: 4 additions & 6 deletions Doc/Kaggle_CrowdFlower_ChenglongChen.tex
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,7 @@ \subsubsection{Basic TF-IDF Features}
\item \textbf{Cosine Similarity Based on SVD Reduced Features}\\
We computed cosine similarity based on SVD reduced features (using common SVD).
\item \textbf{Statistical Cosine Similarity Based on SVD Reduced Features}\\
We computed statistical cosine similarity based on SVD Reduced Features.
We computed statistical cosine similarity based on SVD reduced features.
\end{itemize}
\subsubsection{Cooccurrence TF-IDF Features}
\label{subsubsec:Cooccurrence_TFIDF_Features}
Expand Down Expand Up @@ -341,7 +341,7 @@ \subsubsection{Query Id}
one-hot encoding of the \texttt{query} (generated via \texttt{genFeat\_id\_feat.py})

\subsection{Feature Selection}
For feature selection, we adopted the idea of ``untuned modeling'' as used in Marios Michailidis and Gert Jacobusse's 2nd place solution \cite{malware_2nd} to Microsoft Malware Classification Challenge. The same model is always used to perform cross validation on a (combined) set of features to test whether it improves the
For feature selection, we adopted the idea of ``untuned modeling'' as used in Marios Michailidis and Gert Jacobusse's 2nd place solution \cite{malware_2nd} to Microsoft Malware Classification Challenge on Kaggle. The same model is always used to perform cross validation on a (combined) set of features to test whether it improves the
score compared to earlier feature sets. For features of high dimension (denoted as ``High''), e.g., feature set including raw TF-IDF features, we used XGBoost with linear booster (MSE objective); otherwise, we used \texttt{ExtraTreesRegressor} in Sklearn for features of low dimension (denoted as ``Low'').

Note that with ensemble selection, one can train model library with various feature set and rely on ensemble selection to pick out the best ensemble within the model library. However, feature selection is still helpful. Using the above feature selection method, one can first identified some (possible) well performed feature set, and then trained model library with it. This helps to reduce the computation burden to some extent.
Expand All @@ -353,9 +353,7 @@ \subsubsection{The Split}
\subsubsection{Following the Same Logic}
Since this is an NLP related competition, it's common to use TF-IDF features. We have seen a few people fitting a TF-IDF transformer on the stacked training and testing set, and then transforming the training and testing set, respectively. They then use such feature vectors (\textbf{they are fixed}) for cross validation or grid search for the best parameters. They call such method as semi-supervised learning. In our opinion, if one is taking such method, he should refit the transformer using only the whole training set in CV, following the same logic.

On other hand, if one fit the transformer on the training set (for the final model building), then in CV, he should also refit the transformer on the training fold only. This is the method we used. Not only for TF-IDF transformer, but also for other transformations, e.g., normalization and SVD, one should make sure he is following the same logic in both CV and the final model building.


On the other hand, if one fit the transformer on the training set (for the final model building), then in CV, he should also refit the transformer on the training fold only. This is the method we used. Not only for TF-IDF transformer, but also for other transformations, e.g., normalization and SVD, one should make sure he is following the same logic in both CV and the final model building.

\begin{table}[t]
\centering
Expand Down Expand Up @@ -603,7 +601,7 @@ \section{How To Generate the Solution (aka README file)}
\begin{enumerate}
\item download data from the competition website and put all the data into folder \texttt{./Data}.
\item run \texttt{python ./Feat/run\_all.py} to generate feature set. This will take a few hours.
\item run \texttt{python ./Model/generate\_best\_single\_model.py} to generate the best single model submission. In my experience, it only takes a few trials to generate model of best performance or similar performance. See the training log in \\ \texttt{./Output/Log/[Pre@solution]\_[Feat@svd100\_and\_bow\_Jun27]\\
\item run \texttt{python ./Model/generate\_best\_single\_model.py} to generate the best single model submission. In our experience, it only takes a few trials to generate model of best performance or similar performance. See the training log in \\ \texttt{./Output/Log/[Pre@solution]\_[Feat@svd100\_and\_bow\_Jun27]\\
\_[Model@reg\_xgb\_linear]\_hyperopt.log} for example.
\item run \texttt{python ./Model/generate\_model\_library.py} to generate model library. This is quite time consuming. \textbf{But you don't have to wait for this script to finish: you can run the next step once you have some models trained.}
\item run \texttt{python ./Model/generate\_ensemble\_submission.py} to generate submission via ensemble selection.
Expand Down
14 changes: 10 additions & 4 deletions Doc/Kaggle_CrowdFlower_ChenglongChen.tex.bak
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Our solution consisted of two parts: feature engineering and model ensembling. W
\item distance features
\item TF-IDF features
\end{itemize}
Before generating features, we have found that it's helpful to process the text of the data with spelling correction, synonym replacement, and stemming. Model ensembling consisted of two main steps, Firstly, we trained model library using different models, different parameter settings, and different subsets of the features. Secondly, we generated ensemble submission from the model library predictions using bagged ensemble selection. Performance was estimated using cross validation within the training set. No external data sources were used in our winning submission.
Before generating features, we have found that it's helpful to process the text of the data with spelling correction, synonym replacement, and stemming. Model ensembling consisted of two main steps, Firstly, we trained model library using different models, different parameter settings, and different subsets of the features. Secondly, we generated ensemble submission from the model library predictions using bagged ensemble selection. Performance was estimated using cross validation within the training set. No external data sources were used in our winning submission. The flowchart of our method is shown in Figure \ref{fig:Flowchart}.

The best single model we have obtained during the competition was an XGBoost model with linear booster of Public LB score \textbf{0.69322} and Private LB score \textbf{0.70768}. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored \textbf{0.70807} on Public LB (our second best Public LB score) and \textbf{0.72189} on Private LB. \footnote{The best Public LB score was \textbf{0.70849} with corresponding Private LB score \textbf{0.72134}. It's a mean ensemble version of those 35 LB submissions.}

Expand All @@ -70,6 +70,12 @@ The best single model we have obtained during the competition was an XGBoost mod
\end{tikzpicture}
\end{figure}
\end{comment}
\begin{figure}[!htb]
\centering
\includegraphics[width=0.9\textwidth]{./FlowChart.pdf}
\caption{The flowchart of our method.}
\label{fig:Flowchart}
\end{figure}

\section{Preprocessing}
A few steps were performed to cleaning up the text.
Expand Down Expand Up @@ -294,7 +300,7 @@ We fit a SVD transformer for TF-IDF vectors of $\{q_i, t_i, d_i\}$, separately.
\item \textbf{Cosine Similarity Based on SVD Reduced Features}\\
We computed cosine similarity based on SVD reduced features (using common SVD).
\item \textbf{Statistical Cosine Similarity Based on SVD Reduced Features}\\
We computed statistical cosine similarity based on SVD Reduced Features.
We computed statistical cosine similarity based on SVD reduced features.
\end{itemize}
\subsubsection{Cooccurrence TF-IDF Features}
\label{subsubsec:Cooccurrence_TFIDF_Features}
Expand Down Expand Up @@ -455,8 +461,8 @@ Package & \multicolumn{2}{c|}{Model} & Feature & Weighting\\
& & Softkappa & &\\ \cline{3-3}
\hline
\multirow{7}*{Sklearn} & \multicolumn{2}{c|}{\texttt{GradientBoostingRegressor}} & Low & Yes\\ \cline{2-5}
& \multicolumn{2}{c|}{\texttt{RandomForestRegressor}} & Low & Yes\\ \cline{2-5}
& \multicolumn{2}{c|}{\texttt{ExtraTreesRegressor}} & Low & Yes\\ \cline{2-5}
& \multicolumn{2}{c|}{\texttt{ExtraTreesRegressor}} & Low & Yes\\ \cline{2-5}
& \multicolumn{2}{c|}{\texttt{RandomForestRegressor}} & Low & Yes\\ \cline{2-5}
& \multicolumn{2}{c|}{\texttt{SVR}} & Low & Yes\\ \cline{2-5}
& \multicolumn{2}{c|}{\texttt{Ridge}} & High/Low & Yes\\ \cline{2-5}
& \multicolumn{2}{c|}{\texttt{Lasso}} & High/Low & No\\ \cline{2-5}
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@

1st Place Solution for Search Results Relevance Competition on Kaggle (https://www.kaggle.com/c/crowdflower-search-relevance)

The best single model we have obtained during the competition was an [XGBoost](https://github.com/dmlc/xgboost) model with linear booster of Public LB score **0.69322** and Private LB score **0.70768**. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored **0.70807** on Public LB and **0.72189** on Private LB.

See `./Doc/Kaggle_CrowdFlower_ChenglongChen.pdf` for documentation.

## Instruction
Expand Down

0 comments on commit 32e3bb3

Please sign in to comment.