add xgboost link

ChenglongChen · Jul 12, 2015 · 32e3bb3 · 32e3bb3
1 parent 129c5cd
commit 32e3bb3
Show file tree

Hide file tree

Showing 6 changed files with 30 additions and 27 deletions.
diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.log b/Doc/Kaggle_CrowdFlower_ChenglongChen.log
@@ -1,4 +1,4 @@
-This is pdfTeX, Version 3.1415926-2.3-1.40.12 (MiKTeX 2.9) (preloaded format=pdflatex 2013.11.4)  12 JUL 2015 19:01
+This is pdfTeX, Version 3.1415926-2.3-1.40.12 (MiKTeX 2.9) (preloaded format=pdflatex 2013.11.4)  12 JUL 2015 21:56
 entering extended mode
 **Kaggle_CrowdFlower_ChenglongChen.tex
 (F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.tex
@@ -799,7 +799,7 @@ Overfull \hbox (18.3055pt too wide) in paragraph at lines 336--337
  []
 
 [11] [12]
-Overfull \hbox (6.46352pt too wide) in paragraph at lines 394--395
+Overfull \hbox (6.46352pt too wide) in paragraph at lines 392--393
 \OT1/cmr/m/n/10.95 Since the rel-e-vance s-core is in $\OMS/cmsy/m/n/10.95 f\OT
 1/cmr/m/n/10.95 1\OML/cmm/m/it/10.95 ; \OT1/cmr/m/n/10.95 2\OML/cmm/m/it/10.95 
 ; \OT1/cmr/m/n/10.95 3\OML/cmm/m/it/10.95 ; \OT1/cmr/m/n/10.95 4\OMS/cmsy/m/n/1
@@ -808,13 +808,13 @@ tion
  []
 
 
-Overfull \hbox (5.09244pt too wide) in paragraph at lines 403--404
+Overfull \hbox (5.09244pt too wide) in paragraph at lines 401--402
 []\OT1/cmr/m/n/10.95 This al-so ap-plies to One-Against-All (OAA) clas-si-fi-ca
 -tion, e.g., \OT1/cmtt/m/n/10.95 LogisticRegression
  []
 
 [13]
-Overfull \hbox (6.27373pt too wide) in paragraph at lines 423--424
+Overfull \hbox (6.27373pt too wide) in paragraph at lines 421--422
 \OT1/cmr/m/n/10.95 ing class prob-a-bil-i-ties.[][][] The ob-jec-tive is in fil
 e \OT1/cmtt/m/n/10.95 ./Code/Model/utils.py\OT1/cmr/m/n/10.95 : \OT1/cmtt/m/n/1
 0.95 softkappaObj\OT1/cmr/m/n/10.95 .
@@ -823,37 +823,37 @@ e \OT1/cmtt/m/n/10.95 ./Code/Model/utils.py\OT1/cmr/m/n/10.95 : \OT1/cmtt/m/n/1
 [14] [15]
 PGFPlots: reading {35lb_subs.txt}
  [16]
-Overfull \hbox (0.73491pt too wide) in paragraph at lines 545--546
+Overfull \hbox (0.73491pt too wide) in paragraph at lines 543--544
 []\OT1/cmr/bx/n/10.95 combine[]feat[][LSA[]and[]stats[]feat[]Jun09][][Low].py\O
 T1/cmr/m/n/10.95 : This file gen-er-ates one
  []
 
 [17]
-Overfull \hbox (1.45724pt too wide) in paragraph at lines 572--573
+Overfull \hbox (1.45724pt too wide) in paragraph at lines 570--571
 \OT1/cmr/m/n/10.95 pa. It is adopt-ed from []$\OT1/cmtt/m/n/10.95 https : / / g
 ithub . com / benhamner / Metrics / tree / master /$
  []
 
 [18]
-Overfull \hbox (32.26485pt too wide) in paragraph at lines 593--594
+Overfull \hbox (32.26485pt too wide) in paragraph at lines 591--592
 []\OT1/cmr/m/n/10.95 XGBoost-0.4.0 (Win-dows Ex-e-cutable, []$\OT1/cmtt/m/n/10.
 95 https : / / github . com / dmlc / XGBoost / releases /$
  []
 
 
-Overfull \hbox (11.5833pt too wide) in paragraph at lines 594--595
+Overfull \hbox (11.5833pt too wide) in paragraph at lines 592--593
 []\OT1/cmr/m/n/10.95 ml[]metrics ([]$\OT1/cmtt/m/n/10.95 https : / / github . c
 om / benhamner / Metrics / tree / master / Python / ml _$
  []
 
 
-Overfull \hbox (12.11351pt too wide) in paragraph at lines 598--599
+Overfull \hbox (12.11351pt too wide) in paragraph at lines 596--597
 []\OT1/cmr/m/n/10.95 rgf1.2 (Win-dows Ex-e-cutable, []$\OT1/cmtt/m/n/10.95 http
  : / / stat . rutgers . edu / home / tzhang / software /$
  []
 
 
-Overfull \hbox (2.8642pt too wide) in paragraph at lines 602--602
+Overfull \hbox (2.8642pt too wide) in paragraph at lines 600--600
 []\OT1/cmr/bx/n/17.28 How To Gen-er-ate the So-lu-tion (a-ka README
  []
 
@@ -882,12 +882,12 @@ ication / forums / t / 13863 /$
  []
 
 [20])
-Package atveryend Info: Empty hook `BeforeClearDocument' on input line 637.
+Package atveryend Info: Empty hook `BeforeClearDocument' on input line 635.
  [21]
-Package atveryend Info: Empty hook `AfterLastShipout' on input line 637.
+Package atveryend Info: Empty hook `AfterLastShipout' on input line 635.
  (F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.aux)
-Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 637.
-Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 637.
+Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 635.
+Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 635.
 
 
 Package rerunfilecheck Warning: File `Kaggle_CrowdFlower_ChenglongChen.out' has
@@ -899,9 +899,6 @@ Package rerunfilecheck Info: Checksums for `Kaggle_CrowdFlower_ChenglongChen.ou
 t':
 (rerunfilecheck)             Before: A531A6FD907444ED35169CFAC17839EF;10184
 (rerunfilecheck)             After:  5835D4BCA4B1BF337073CA56FA26B04F;3336.
-
-LaTeX Warning: Label(s) may have changed. Rerun to get cross-references right.
-
  ) 
 Here is how much of TeX's memory you used:
  21906 strings out of 495354

diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.pdf b/Doc/Kaggle_CrowdFlower_ChenglongChen.pdf
diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.synctex.gz b/Doc/Kaggle_CrowdFlower_ChenglongChen.synctex.gz
diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.tex b/Doc/Kaggle_CrowdFlower_ChenglongChen.tex
@@ -300,7 +300,7 @@ \subsubsection{Basic TF-IDF Features}
 \item \textbf{Cosine Similarity Based on SVD Reduced Features}\\
 We computed cosine similarity based on SVD reduced features (using common SVD).
 \item \textbf{Statistical Cosine Similarity Based on SVD Reduced Features}\\
-We computed statistical cosine similarity based on SVD Reduced Features.
+We computed statistical cosine similarity based on SVD reduced features.
 \end{itemize}
 \subsubsection{Cooccurrence TF-IDF Features}
 \label{subsubsec:Cooccurrence_TFIDF_Features}
@@ -341,7 +341,7 @@ \subsubsection{Query Id}
 one-hot encoding of the \texttt{query} (generated via \texttt{genFeat\_id\_feat.py})
 
 \subsection{Feature Selection}
-For feature selection, we adopted the idea of ``untuned modeling'' as used in Marios Michailidis and Gert Jacobusse's 2nd place solution \cite{malware_2nd} to Microsoft Malware Classification Challenge. The same model is always used to perform cross validation on a (combined) set of features to test whether it improves the
+For feature selection, we adopted the idea of ``untuned modeling'' as used in Marios Michailidis and Gert Jacobusse's 2nd place solution \cite{malware_2nd} to Microsoft Malware Classification Challenge on Kaggle. The same model is always used to perform cross validation on a (combined) set of features to test whether it improves the
 score compared to earlier feature sets. For features of high dimension (denoted as ``High''), e.g., feature set including raw TF-IDF features, we used XGBoost with linear booster (MSE objective); otherwise, we used \texttt{ExtraTreesRegressor} in Sklearn for features of low dimension (denoted as ``Low'').
 
 Note that with ensemble selection, one can train model library with various feature set and rely on ensemble selection to pick out the best ensemble within the model library. However, feature selection is still helpful. Using the above feature selection method, one can first identified some (possible) well performed feature set, and then trained model library with it. This helps to reduce the computation burden to some extent.
@@ -353,9 +353,7 @@ \subsubsection{The Split}
 \subsubsection{Following the Same Logic}
 Since this is an NLP related competition, it's common to use TF-IDF features. We have seen a few people fitting a TF-IDF transformer on the stacked training and testing set, and then transforming the training and testing set, respectively. They then use such feature vectors (\textbf{they are fixed}) for cross validation or grid search for the best parameters. They call such method as semi-supervised learning. In our opinion, if one is taking such method, he should refit the transformer using only the whole training set in CV, following the same logic.
 
-On other hand, if one fit the transformer on the training set (for the final model building), then in CV, he should also refit the transformer on the training fold only. This is the method we used. Not only for TF-IDF transformer, but also for other transformations, e.g., normalization and SVD, one should make sure he is following the same logic in both CV and the final model building.
-
-
+On the other hand, if one fit the transformer on the training set (for the final model building), then in CV, he should also refit the transformer on the training fold only. This is the method we used. Not only for TF-IDF transformer, but also for other transformations, e.g., normalization and SVD, one should make sure he is following the same logic in both CV and the final model building.
 
 \begin{table}[t]
 \centering
@@ -603,7 +601,7 @@ \section{How To Generate the Solution (aka README file)}
 \begin{enumerate}
 \item download data from the competition website and put all the data into folder \texttt{./Data}.
 \item run \texttt{python ./Feat/run\_all.py} to generate feature set. This will take a few hours.
-\item run \texttt{python ./Model/generate\_best\_single\_model.py} to generate the best single model submission. In my experience, it only takes a few trials to generate model of best performance or similar performance. See the training log in \\ \texttt{./Output/Log/[Pre@solution]\_[Feat@svd100\_and\_bow\_Jun27]\\
+\item run \texttt{python ./Model/generate\_best\_single\_model.py} to generate the best single model submission. In our experience, it only takes a few trials to generate model of best performance or similar performance. See the training log in \\ \texttt{./Output/Log/[Pre@solution]\_[Feat@svd100\_and\_bow\_Jun27]\\
     \_[Model@reg\_xgb\_linear]\_hyperopt.log} for example.
 \item run \texttt{python ./Model/generate\_model\_library.py} to generate model library. This is quite time consuming. \textbf{But you don't have to wait for this script to finish: you can run the next step once you have some models trained.}
 \item run \texttt{python ./Model/generate\_ensemble\_submission.py} to generate submission via ensemble selection.

diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.tex.bak b/Doc/Kaggle_CrowdFlower_ChenglongChen.tex.bak
@@ -57,7 +57,7 @@ Our solution consisted of two parts: feature engineering and model ensembling. W
 \item distance features
 \item TF-IDF features
 \end{itemize}
-Before generating features, we have found that it's helpful to process the text of the data with spelling correction, synonym replacement, and stemming. Model ensembling consisted of two main steps, Firstly, we trained model library using different models, different parameter settings, and different subsets of the features. Secondly, we generated ensemble submission from the model library predictions using bagged ensemble selection. Performance was estimated using cross validation within the training set. No external data sources were used in our winning submission.
+Before generating features, we have found that it's helpful to process the text of the data with spelling correction, synonym replacement, and stemming. Model ensembling consisted of two main steps, Firstly, we trained model library using different models, different parameter settings, and different subsets of the features. Secondly, we generated ensemble submission from the model library predictions using bagged ensemble selection. Performance was estimated using cross validation within the training set. No external data sources were used in our winning submission. The flowchart of our method is shown in Figure \ref{fig:Flowchart}.
 
 The best single model we have obtained during the competition was an XGBoost model with linear booster of Public LB score \textbf{0.69322} and Private LB score \textbf{0.70768}. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored \textbf{0.70807} on Public LB (our second best Public LB score) and \textbf{0.72189} on Private LB. \footnote{The best Public LB score was \textbf{0.70849} with corresponding Private LB score \textbf{0.72134}. It's a mean ensemble version of those 35 LB submissions.}
 
@@ -70,6 +70,12 @@ The best single model we have obtained during the competition was an XGBoost mod
 \end{tikzpicture}
 \end{figure}
 \end{comment}
+\begin{figure}[!htb]
+\centering
+\includegraphics[width=0.9\textwidth]{./FlowChart.pdf}
+\caption{The flowchart of our method.}
+\label{fig:Flowchart}
+\end{figure}
 
 \section{Preprocessing}
 A few steps were performed to cleaning up the text.
@@ -294,7 +300,7 @@ We fit a SVD transformer for TF-IDF vectors of $\{q_i, t_i, d_i\}$, separately.
 \item \textbf{Cosine Similarity Based on SVD Reduced Features}\\
 We computed cosine similarity based on SVD reduced features (using common SVD).
 \item \textbf{Statistical Cosine Similarity Based on SVD Reduced Features}\\
-We computed statistical cosine similarity based on SVD Reduced Features.
+We computed statistical cosine similarity based on SVD reduced features.
 \end{itemize}
 \subsubsection{Cooccurrence TF-IDF Features}
 \label{subsubsec:Cooccurrence_TFIDF_Features}
@@ -455,8 +461,8 @@ Package   & \multicolumn{2}{c|}{Model}    & Feature & Weighting\\
   &   &  Softkappa & &\\ \cline{3-3}
 \hline
 \multirow{7}*{Sklearn} & \multicolumn{2}{c|}{\texttt{GradientBoostingRegressor}} & Low & Yes\\ \cline{2-5}
-  & \multicolumn{2}{c|}{\texttt{RandomForestRegressor}}      & Low & Yes\\ \cline{2-5}
-  & \multicolumn{2}{c|}{\texttt{ExtraTreesRegressor}} & Low & Yes\\ \cline{2-5}
+  & \multicolumn{2}{c|}{\texttt{ExtraTreesRegressor}}      & Low & Yes\\ \cline{2-5}
+  & \multicolumn{2}{c|}{\texttt{RandomForestRegressor}} & Low & Yes\\ \cline{2-5}
   & \multicolumn{2}{c|}{\texttt{SVR}}                            & Low & Yes\\ \cline{2-5}
   & \multicolumn{2}{c|}{\texttt{Ridge}}                & High/Low & Yes\\ \cline{2-5}
   & \multicolumn{2}{c|}{\texttt{Lasso}}                 & High/Low & No\\ \cline{2-5}

diff --git a/README.md b/README.md
@@ -3,6 +3,8 @@
 
 1st Place Solution for Search Results Relevance Competition on Kaggle (https://www.kaggle.com/c/crowdflower-search-relevance)
 
+The best single model we have obtained during the competition was an [XGBoost](https://github.com/dmlc/xgboost) model with linear booster of Public LB score **0.69322** and Private LB score **0.70768**. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored **0.70807** on Public LB and **0.72189** on Private LB.
+
 See `./Doc/Kaggle_CrowdFlower_ChenglongChen.pdf` for documentation.
 
 ## Instruction