Skip to content

Commit

Permalink
update training instruction
Browse files Browse the repository at this point in the history
  • Loading branch information
ChenglongChen committed Jul 12, 2015
1 parent 7dae17c commit a066248
Show file tree
Hide file tree
Showing 7 changed files with 28 additions and 25 deletions.
6 changes: 3 additions & 3 deletions Doc/Kaggle_CrowdFlower_ChenglongChen.aux
Original file line number Diff line number Diff line change
Expand Up @@ -89,13 +89,13 @@
\bibstyle{plain}
\bibdata{reference}
\bibcite{owen}{1}
\@writefile{toc}{\contentsline {section}{\numberline {8}Additional Comments and Observations}{19}{section.8}}
\@writefile{toc}{\contentsline {section}{\numberline {9}Simple Features and Methods}{19}{section.9}}
\@writefile{toc}{\contentsline {section}{\numberline {10}Acknowledgement}{19}{section.10}}
\bibcite{Otto_1st}{2}
\bibcite{malware_2nd}{3}
\bibcite{hyperopt_url}{4}
\bibcite{hyperopt}{5}
\@writefile{toc}{\contentsline {section}{\numberline {8}Additional Comments and Observations}{19}{section.8}}
\@writefile{toc}{\contentsline {section}{\numberline {9}Simple Features and Methods}{19}{section.9}}
\@writefile{toc}{\contentsline {section}{\numberline {10}Acknowledgement}{19}{section.10}}
\bibcite{ebc}{6}
\bibcite{ensemble_selection}{7}
\bibcite{NLTK_Cookbook}{8}
Expand Down
18 changes: 9 additions & 9 deletions Doc/Kaggle_CrowdFlower_ChenglongChen.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This is pdfTeX, Version 3.1415926-2.3-1.40.12 (MiKTeX 2.9) (preloaded format=pdflatex 2013.11.4) 12 JUL 2015 14:29
This is pdfTeX, Version 3.1415926-2.3-1.40.12 (MiKTeX 2.9) (preloaded format=pdflatex 2013.11.4) 12 JUL 2015 16:34
entering extended mode
**Kaggle_CrowdFlower_ChenglongChen.tex
(F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.tex
Expand Down Expand Up @@ -858,7 +858,7 @@ Overfull \hbox (44.12091pt too wide) in paragraph at lines 4--5
[]kaggle-[]data-[]scientist-[]owen-[]zhang/$[]\OT1/cmr/m/n/10.95 .
[]


[19]
Overfull \hbox (5.08711pt too wide) in paragraph at lines 7--8
[][]$\OT1/cmtt/m/n/10.95 https : / / www . kaggle . com / c / otto-[]group-[]pr
oduct-[]classification-[]challenge /$
Expand All @@ -876,13 +876,13 @@ Underfull \hbox (badness 10000) in paragraph at lines 11--12
ication / forums / t / 13863 /$
[]

[19])
Package atveryend Info: Empty hook `BeforeClearDocument' on input line 630.
)
Package atveryend Info: Empty hook `BeforeClearDocument' on input line 631.
[20]
Package atveryend Info: Empty hook `AfterLastShipout' on input line 630.
Package atveryend Info: Empty hook `AfterLastShipout' on input line 631.
(F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.aux)
Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 630.
Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 630.
Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 631.
Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 631.


Package rerunfilecheck Warning: File `Kaggle_CrowdFlower_ChenglongChen.out' has
Expand All @@ -898,7 +898,7 @@ t':
Here is how much of TeX's memory you used:
21898 strings out of 495354
471435 string characters out of 3183859
685467 words of memory out of 3000000
684467 words of memory out of 3000000
24507 multiletter control sequences out of 15000+200000
23519 words of font info for 91 fonts, out of 3000000 for 9000
14 hyphenation exceptions out of 8191
Expand All @@ -915,7 +915,7 @@ c/amsfonts/cm/cmr17.pfb><D:/CTEX/MiKTeX/fonts/type1/public/amsfonts/cm/cmr7.pfb
type1/public/amsfonts/cm/cmsy10.pfb><D:/CTEX/MiKTeX/fonts/type1/public/amsfonts
/cm/cmti10.pfb><D:/CTEX/MiKTeX/fonts/type1/public/amsfonts/cm/cmtt10.pfb><D:/CT
EX/MiKTeX/fonts/type1/public/amsfonts/cm/cmtt12.pfb>
Output written on Kaggle_CrowdFlower_ChenglongChen.pdf (20 pages, 334775 bytes)
Output written on Kaggle_CrowdFlower_ChenglongChen.pdf (20 pages, 335464 bytes)
.
PDF statistics:
516 PDF objects out of 1000 (max. 8388607)
Expand Down
Binary file modified Doc/Kaggle_CrowdFlower_ChenglongChen.pdf
Binary file not shown.
Binary file modified Doc/Kaggle_CrowdFlower_ChenglongChen.synctex.gz
Binary file not shown.
15 changes: 8 additions & 7 deletions Doc/Kaggle_CrowdFlower_ChenglongChen.tex
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ \section{Summary}
\end{itemize}
Before generating features, we have found that it's helpful to process the text of the data with spelling correction, synonym replacement, and stemming. Model ensembling consisted of two main steps, Firstly, we trained model library using different models, different parameter settings, and different subsets of the features. Secondly, we generated ensemble submission from the model library predictions using bagged ensemble selection. Performance was estimated using cross validation within the training set. No external data sources were used in our winning submission.

The best single model we have obtained during the competition was an XGBoost model with linear booster of Public LB score \textbf{0.69322} and Private LB score \textbf{0.70768}. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored \textbf{0.70807} on Public LB (my second best Public LB score) and \textbf{0.72189} on Private LB. \footnote{The best Public LB score was \textbf{0.70849} with corresponding Private LB score \textbf{0.72134}. It's a mean version of those 35 LB submissions.}
The best single model we have obtained during the competition was an XGBoost model with linear booster of Public LB score \textbf{0.69322} and Private LB score \textbf{0.70768}. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored \textbf{0.70807} on Public LB (our second best Public LB score) and \textbf{0.72189} on Private LB. \footnote{The best Public LB score was \textbf{0.70849} with corresponding Private LB score \textbf{0.72134}. It's a mean ensemble version of those 35 LB submissions.}

\begin{comment}
\begin{figure}[!htb]
Expand Down Expand Up @@ -343,7 +343,7 @@ \subsection{Feature Selection}
\section{Modeling Techniques and Training}
\subsection{Cross Validation Methodology}
\subsubsection{The Split}
Early in the competition, we have been using \texttt{StratifiedKFold} on \texttt{median\_relevance} or \texttt{query} with $k = 5$ or $k = 10$, but there was a large gap between my CV score and Public LB score. We then changed our CV method to \texttt{StratifiedKFold} on \texttt{query} with $k = 3$, and used \emph{each 1 fold as training set} and \emph{the rest 2 folds as validation set}. This is to mimic the training-testing split of the data as pointed out by Kaggler @Silogram. With this strategy, our CV score tended to be more correlated with the Public LB score (see Table \ref{CV_LB}).
Early in the competition, we have been using \texttt{StratifiedKFold} on \texttt{median\_relevance} or \texttt{query} with $k = 5$ or $k = 10$, but there was a large gap between our CV score and Public LB score. We then changed our CV method to \texttt{StratifiedKFold} on \texttt{query} with $k = 3$, and used \emph{each 1 fold as training set} and \emph{the rest 2 folds as validation set}. This is to mimic the training-testing split of the data as pointed out by Kaggler @Silogram. With this strategy, our CV score tended to be more correlated with the Public LB score (see Table \ref{CV_LB}).
\subsubsection{Following the Same Logic}
Since this is an NLP related competition, it's common to use TF-IDF features. We have seen a few people fitting a TF-IDF transformer on the stacked training and testing set, and then transforming the training and testing set, respectively. They then use such feature vectors (\textbf{they are fixed}) for cross validation or grid search for the best parameters. They call such method as semi-supervised learning. In our opinion, if one is taking such method, he should refit the transformer using only the whole training set in CV, following the same logic.

Expand Down Expand Up @@ -435,7 +435,7 @@ \subsection{Ensemble Selection}
\subsubsection{Model Library Building via Guided Parameter Searching}
Ensemble selection needs a model library contains lots (hundreds or thousands) of models trained used different algorithm (e.g., XGBoost or NN, see Table \ref{tab:Model_Library} for the algorithms we used) or different parameters (how may trees/layers/hidden units) or different feature sets. For each algorithm, we specified a parameter space, and used TPE method \cite{hyperopt} in Hyperopt package \cite{hyperopt_url} for parameter searching. It not only find the best parameter setting for each algorithm, but also create a model library with various parameter settings guided or provided by Hyperopt.

During parameter searching, we trained a model with each parameter setting on training fold for each run and each fold in cross-validation, and saved the rank of the prediction to disk. Note that such rank was obtained using the corresponding decoding method as in step 2-4 of Sec. \ref{subsubsec:Classification}. They were used in ensemble selection to find the best ensemble. We also trained a model with the same parameter setting on the whole training set, and saved the rank of the prediction of the testing set. Such rank predictions were used for generating the final ensemble submission.
During parameter searching, we trained a model with each parameter setting on training fold for each run and each fold in cross-validation, and saved the rank of the prediction of the validation fold to disk. Note that such rank was obtained using the corresponding decoding method as in step 2-4 of Sec. \ref{subsubsec:Classification}. They were used in ensemble selection to find the best ensemble. We also trained a model with the same parameter setting on the whole training set, and saved the rank of the prediction of the testing set. Such rank predictions were used for generating the final ensemble submission.
\begin{table}[t]
\centering
\caption{Model Library}
Expand Down Expand Up @@ -596,9 +596,10 @@ \section{Dependencies}
\section{How To Generate the Solution (aka README file)}
\begin{enumerate}
\item download data from the competition website and put all the data into folder \texttt{./Data}.
\item run \texttt{python ./Feat/run\_all.py} to generate feature set.
\item run \texttt{python ./Model/generate\_best\_single\_model.py} to generate best single model submission.
\item run \texttt{python ./Model/generate\_model\_library.py} to generate model library.
\item run \texttt{python ./Feat/run\_all.py} to generate feature set. This will take a few hours.
\item run \texttt{python ./Model/generate\_best\_single\_model.py} to generate the best single model submission. In my experience, it only takes a few trials to generate model of best performance or similar performance. See the training log in \\ \texttt{./Output/Log/[Pre@solution]\_[Feat@svd100\_and\_bow\_Jun27]\\
\_[Model@reg\_xgb\_linear]\_hyperopt.log} for example.
\item run \texttt{python ./Model/generate\_model\_library.py} to generate model library. This is quite time consuming. \textbf{But you don't have to wait for this script to finish: you can run the next step once you have some models trained.}
\item run \texttt{python ./Model/generate\_ensemble\_submission.py} to generate submission via ensemble selection.
\end{enumerate}

Expand All @@ -613,7 +614,7 @@ \section{Additional Comments and Observations}
\end{itemize}

\section{Simple Features and Methods}
Without any stacking or ensembling, the best (Public LB) single model we have obtained during the competition was an XGBoost model with linear booster. It is with Public LB score: $0.69322$ and Private LB score: $0.70768$. Apart from the counting features and distance features, it used raw basic TF-IDF and raw cooccurrence TF-IDF.
Without any stacking or ensembling, the best (Public LB) single model we have obtained during the competition was an XGBoost model with linear booster. It is with Public LB score: \textbf{0.69322} and Private LB score: \textbf{0.70768}. Apart from the counting features and distance features, it used raw basic TF-IDF and raw cooccurrence TF-IDF.

To reproduce the best single model, run\\
\texttt{> python ./Code/Feat/combine\_feat\_[svd100\_and\_bow\_Jun27].py}\\
Expand Down
8 changes: 5 additions & 3 deletions Doc/Kaggle_CrowdFlower_ChenglongChen.tex.bak
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,15 @@ Before generating features, we have found that it's helpful to process the text

The best single model we have obtained during the competition was an XGBoost model with linear booster of Public LB score \textbf{0.69322} and Private LB score \textbf{0.70768}. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored \textbf{0.70807} on Public LB (my second best Public LB score) and \textbf{0.72189} on Private LB. \footnote{The best Public LB score was \textbf{0.70849} with corresponding Private LB score \textbf{0.72134}. It's a mean version of those 35 LB submissions.}

\begin{comment}
\begin{figure}[!htb]
\centering
\begin{tikzpicture}
\draw (-1.5,0) -- (1.5,0);
\draw (0,-1.5) -- (0,1.5);
\end{tikzpicture}
\end{figure}
\end{comment}

\section{Preprocessing}
A few steps were performed to cleaning up the text.
Expand Down Expand Up @@ -160,7 +162,7 @@ We also performed stemming before generating features (e.g., counting features a
\section{Feature Extraction/Selection}
Before proceeding to describe the features, we first introduce some notations. We use tuple $(q_i, t_i, d_i)$ to denote the $i$-th sample in \texttt{train.csv} or \texttt{test.csv}, where $q_i$ is the \texttt{query}, $t_i$ is the \texttt{product\_title}, and $d_i$ is the \texttt{product\_description}. For \texttt{train.csv}, we further use $r_i$ and $v_i$ to denote \texttt{median\_relevance} and \texttt{relevance\_variance}\footnote{This is actually the standard deviation (std).}, respectively. We use function $\text{ngram}(s, n)$ to extract string/sentence $s$'s $n$-gram (splitted by whitespace), where $n\in\{1,2,3\}$ if not specified. For example
\[
\text{ngram}(\text{bridal shower decorations}, 2) = [\text{bridal shower}, \text{shower decorations}]\footnote{Note that this is a list (e.g., \texttt{list} in python), not a set (e.g., \texttt{set} in python).}
\text{ngram}(\text{bridal shower decorations}, 2) = [\text{bridal shower}, \text{shower decorations}]\footnote{Note that this is a list (e.g., \texttt{list} in Python), not a set (e.g., \texttt{set} in Python).}
\]

\textbf{All the features are extracted for each run (i.e., repeated time) and fold (used in cross-validation and ensembling), and for the entire training and testing set (used in final model building and generating submission).}
Expand Down Expand Up @@ -611,7 +613,7 @@ Some interesting insights we got during the competition:
\end{itemize}

\section{Simple Features and Methods}
Without any stacking or ensembling, the best (Public LB) single model we have obtained during the competition was an XGBoost model with linear booster. It is with Public LB score: $0.69322$ and Private LB score: $0.70768$. Apart from the counting features and distance features, it used raw basic TF-IDF and raw cooccurrence TF-IDF.
Without any stacking or ensembling, the best (Public LB) single model we have obtained during the competition was an XGBoost model with linear booster. It is with Public LB score: \textbf{0.69322} and Private LB score: \textbf{0.70768}. Apart from the counting features and distance features, it used raw basic TF-IDF and raw cooccurrence TF-IDF.

To reproduce the best single model, run\\
\texttt{> python ./Code/Feat/combine\_feat\_[svd100\_and\_bow\_Jun27].py}\\
Expand All @@ -620,7 +622,7 @@ to generate the feature set we used, and\\
to train the XGBoost model with linear booster. Note that due to randomness in the Hyperopt routine, it won't generate exactly the same score, but a score very similar or even better. Note that, you can also try other linear models, e.g., \texttt{Ridge} in Sklearn.

\section{Acknowledgement}
We would like to thank the DMLC team for developing the great machine learning package XGBoost, Fran\c{c}ois Chollet for developing package Keras, James Bergstra for developing package Hyperopt.
We would like to thank the DMLC team for developing the great machine learning package XGBoost, Fran\c{c}ois Chollet for developing package Keras, James Bergstra for developing package Hyperopt. We would also like to thank the Kaggle team and CrowdFlower for organizing this competition.

\bibliographystyle{plain}
\bibliography{reference}
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ See `./Doc/Kaggle_CrowdFlower_ChenglongChen.pdf` for documentation.
## Instruction

* download data from the [competition website](https://www.kaggle.com/c/crowdflower-search-relevance/data) and put all the data into folder `./Data`.
* run `python ./Code/Feat/run_all.py` to generate features.
* run `python ./Code/Model/generate_best_single_model.py` to generate best single model submission.
* run `python ./Code/Model/generate_model_library.py` to generate model library.
* run `python ./Code/Feat/run_all.py` to generate features. This will take a few hours.
* run `python ./Code/Model/generate_best_single_model.py` to generate best single model submission. In my experience, it only takes a few trials to generate model of best performance or similar performance. See the training log in `./Output/Log/[Pre@solution]_[Feat@svd100_and_bow_Jun27]_[Model@reg_xgb_linear]_hyperopt.log` for example.
* run `python ./Code/Model/generate_model_library.py` to generate model library. This is quite time consuming. **But you don't have to wait for this script to finish: you can run the next step once you have some models trained.**
* run `python ./Code/Model/generate_ensemble_submission.py` to generate submission via ensemble selection.
* if you don't want to run the code, just submit the file in `./Output/Subm`.

0 comments on commit a066248

Please sign in to comment.