diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.aux b/Doc/Kaggle_CrowdFlower_ChenglongChen.aux index 7fd5419..774b913 100644 --- a/Doc/Kaggle_CrowdFlower_ChenglongChen.aux +++ b/Doc/Kaggle_CrowdFlower_ChenglongChen.aux @@ -89,13 +89,13 @@ \bibstyle{plain} \bibdata{reference} \bibcite{owen}{1} +\@writefile{toc}{\contentsline {section}{\numberline {8}Additional Comments and Observations}{19}{section.8}} +\@writefile{toc}{\contentsline {section}{\numberline {9}Simple Features and Methods}{19}{section.9}} +\@writefile{toc}{\contentsline {section}{\numberline {10}Acknowledgement}{19}{section.10}} \bibcite{Otto_1st}{2} \bibcite{malware_2nd}{3} \bibcite{hyperopt_url}{4} \bibcite{hyperopt}{5} -\@writefile{toc}{\contentsline {section}{\numberline {8}Additional Comments and Observations}{19}{section.8}} -\@writefile{toc}{\contentsline {section}{\numberline {9}Simple Features and Methods}{19}{section.9}} -\@writefile{toc}{\contentsline {section}{\numberline {10}Acknowledgement}{19}{section.10}} \bibcite{ebc}{6} \bibcite{ensemble_selection}{7} \bibcite{NLTK_Cookbook}{8} diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.log b/Doc/Kaggle_CrowdFlower_ChenglongChen.log index e36bd11..1b67486 100644 --- a/Doc/Kaggle_CrowdFlower_ChenglongChen.log +++ b/Doc/Kaggle_CrowdFlower_ChenglongChen.log @@ -1,4 +1,4 @@ -This is pdfTeX, Version 3.1415926-2.3-1.40.12 (MiKTeX 2.9) (preloaded format=pdflatex 2013.11.4) 12 JUL 2015 14:29 +This is pdfTeX, Version 3.1415926-2.3-1.40.12 (MiKTeX 2.9) (preloaded format=pdflatex 2013.11.4) 12 JUL 2015 16:34 entering extended mode **Kaggle_CrowdFlower_ChenglongChen.tex (F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.tex @@ -858,7 +858,7 @@ Overfull \hbox (44.12091pt too wide) in paragraph at lines 4--5 []kaggle-[]data-[]scientist-[]owen-[]zhang/$[]\OT1/cmr/m/n/10.95 . [] - +[19] Overfull \hbox (5.08711pt too wide) in paragraph at lines 7--8 [][]$\OT1/cmtt/m/n/10.95 https : / / www . kaggle . com / c / otto-[]group-[]pr oduct-[]classification-[]challenge /$ @@ -876,13 +876,13 @@ Underfull \hbox (badness 10000) in paragraph at lines 11--12 ication / forums / t / 13863 /$ [] -[19]) -Package atveryend Info: Empty hook `BeforeClearDocument' on input line 630. +) +Package atveryend Info: Empty hook `BeforeClearDocument' on input line 631. [20] -Package atveryend Info: Empty hook `AfterLastShipout' on input line 630. +Package atveryend Info: Empty hook `AfterLastShipout' on input line 631. (F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.aux) -Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 630. -Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 630. +Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 631. +Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 631. Package rerunfilecheck Warning: File `Kaggle_CrowdFlower_ChenglongChen.out' has @@ -898,7 +898,7 @@ t': Here is how much of TeX's memory you used: 21898 strings out of 495354 471435 string characters out of 3183859 - 685467 words of memory out of 3000000 + 684467 words of memory out of 3000000 24507 multiletter control sequences out of 15000+200000 23519 words of font info for 91 fonts, out of 3000000 for 9000 14 hyphenation exceptions out of 8191 @@ -915,7 +915,7 @@ c/amsfonts/cm/cmr17.pfb> -Output written on Kaggle_CrowdFlower_ChenglongChen.pdf (20 pages, 334775 bytes) +Output written on Kaggle_CrowdFlower_ChenglongChen.pdf (20 pages, 335464 bytes) . PDF statistics: 516 PDF objects out of 1000 (max. 8388607) diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.pdf b/Doc/Kaggle_CrowdFlower_ChenglongChen.pdf index fbad5e1..43acd86 100644 Binary files a/Doc/Kaggle_CrowdFlower_ChenglongChen.pdf and b/Doc/Kaggle_CrowdFlower_ChenglongChen.pdf differ diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.synctex.gz b/Doc/Kaggle_CrowdFlower_ChenglongChen.synctex.gz index deb3c89..66385c6 100644 Binary files a/Doc/Kaggle_CrowdFlower_ChenglongChen.synctex.gz and b/Doc/Kaggle_CrowdFlower_ChenglongChen.synctex.gz differ diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.tex b/Doc/Kaggle_CrowdFlower_ChenglongChen.tex index ca8f7ba..81925e2 100644 --- a/Doc/Kaggle_CrowdFlower_ChenglongChen.tex +++ b/Doc/Kaggle_CrowdFlower_ChenglongChen.tex @@ -59,7 +59,7 @@ \section{Summary} \end{itemize} Before generating features, we have found that it's helpful to process the text of the data with spelling correction, synonym replacement, and stemming. Model ensembling consisted of two main steps, Firstly, we trained model library using different models, different parameter settings, and different subsets of the features. Secondly, we generated ensemble submission from the model library predictions using bagged ensemble selection. Performance was estimated using cross validation within the training set. No external data sources were used in our winning submission. -The best single model we have obtained during the competition was an XGBoost model with linear booster of Public LB score \textbf{0.69322} and Private LB score \textbf{0.70768}. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored \textbf{0.70807} on Public LB (my second best Public LB score) and \textbf{0.72189} on Private LB. \footnote{The best Public LB score was \textbf{0.70849} with corresponding Private LB score \textbf{0.72134}. It's a mean version of those 35 LB submissions.} +The best single model we have obtained during the competition was an XGBoost model with linear booster of Public LB score \textbf{0.69322} and Private LB score \textbf{0.70768}. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored \textbf{0.70807} on Public LB (our second best Public LB score) and \textbf{0.72189} on Private LB. \footnote{The best Public LB score was \textbf{0.70849} with corresponding Private LB score \textbf{0.72134}. It's a mean ensemble version of those 35 LB submissions.} \begin{comment} \begin{figure}[!htb] @@ -343,7 +343,7 @@ \subsection{Feature Selection} \section{Modeling Techniques and Training} \subsection{Cross Validation Methodology} \subsubsection{The Split} -Early in the competition, we have been using \texttt{StratifiedKFold} on \texttt{median\_relevance} or \texttt{query} with $k = 5$ or $k = 10$, but there was a large gap between my CV score and Public LB score. We then changed our CV method to \texttt{StratifiedKFold} on \texttt{query} with $k = 3$, and used \emph{each 1 fold as training set} and \emph{the rest 2 folds as validation set}. This is to mimic the training-testing split of the data as pointed out by Kaggler @Silogram. With this strategy, our CV score tended to be more correlated with the Public LB score (see Table \ref{CV_LB}). +Early in the competition, we have been using \texttt{StratifiedKFold} on \texttt{median\_relevance} or \texttt{query} with $k = 5$ or $k = 10$, but there was a large gap between our CV score and Public LB score. We then changed our CV method to \texttt{StratifiedKFold} on \texttt{query} with $k = 3$, and used \emph{each 1 fold as training set} and \emph{the rest 2 folds as validation set}. This is to mimic the training-testing split of the data as pointed out by Kaggler @Silogram. With this strategy, our CV score tended to be more correlated with the Public LB score (see Table \ref{CV_LB}). \subsubsection{Following the Same Logic} Since this is an NLP related competition, it's common to use TF-IDF features. We have seen a few people fitting a TF-IDF transformer on the stacked training and testing set, and then transforming the training and testing set, respectively. They then use such feature vectors (\textbf{they are fixed}) for cross validation or grid search for the best parameters. They call such method as semi-supervised learning. In our opinion, if one is taking such method, he should refit the transformer using only the whole training set in CV, following the same logic. @@ -435,7 +435,7 @@ \subsection{Ensemble Selection} \subsubsection{Model Library Building via Guided Parameter Searching} Ensemble selection needs a model library contains lots (hundreds or thousands) of models trained used different algorithm (e.g., XGBoost or NN, see Table \ref{tab:Model_Library} for the algorithms we used) or different parameters (how may trees/layers/hidden units) or different feature sets. For each algorithm, we specified a parameter space, and used TPE method \cite{hyperopt} in Hyperopt package \cite{hyperopt_url} for parameter searching. It not only find the best parameter setting for each algorithm, but also create a model library with various parameter settings guided or provided by Hyperopt. -During parameter searching, we trained a model with each parameter setting on training fold for each run and each fold in cross-validation, and saved the rank of the prediction to disk. Note that such rank was obtained using the corresponding decoding method as in step 2-4 of Sec. \ref{subsubsec:Classification}. They were used in ensemble selection to find the best ensemble. We also trained a model with the same parameter setting on the whole training set, and saved the rank of the prediction of the testing set. Such rank predictions were used for generating the final ensemble submission. +During parameter searching, we trained a model with each parameter setting on training fold for each run and each fold in cross-validation, and saved the rank of the prediction of the validation fold to disk. Note that such rank was obtained using the corresponding decoding method as in step 2-4 of Sec. \ref{subsubsec:Classification}. They were used in ensemble selection to find the best ensemble. We also trained a model with the same parameter setting on the whole training set, and saved the rank of the prediction of the testing set. Such rank predictions were used for generating the final ensemble submission. \begin{table}[t] \centering \caption{Model Library} @@ -596,9 +596,10 @@ \section{Dependencies} \section{How To Generate the Solution (aka README file)} \begin{enumerate} \item download data from the competition website and put all the data into folder \texttt{./Data}. -\item run \texttt{python ./Feat/run\_all.py} to generate feature set. -\item run \texttt{python ./Model/generate\_best\_single\_model.py} to generate best single model submission. -\item run \texttt{python ./Model/generate\_model\_library.py} to generate model library. +\item run \texttt{python ./Feat/run\_all.py} to generate feature set. This will take a few hours. +\item run \texttt{python ./Model/generate\_best\_single\_model.py} to generate the best single model submission. In my experience, it only takes a few trials to generate model of best performance or similar performance. See the training log in \\ \texttt{./Output/Log/[Pre@solution]\_[Feat@svd100\_and\_bow\_Jun27]\\ + \_[Model@reg\_xgb\_linear]\_hyperopt.log} for example. +\item run \texttt{python ./Model/generate\_model\_library.py} to generate model library. This is quite time consuming. \textbf{But you don't have to wait for this script to finish: you can run the next step once you have some models trained.} \item run \texttt{python ./Model/generate\_ensemble\_submission.py} to generate submission via ensemble selection. \end{enumerate} @@ -613,7 +614,7 @@ \section{Additional Comments and Observations} \end{itemize} \section{Simple Features and Methods} -Without any stacking or ensembling, the best (Public LB) single model we have obtained during the competition was an XGBoost model with linear booster. It is with Public LB score: $0.69322$ and Private LB score: $0.70768$. Apart from the counting features and distance features, it used raw basic TF-IDF and raw cooccurrence TF-IDF. +Without any stacking or ensembling, the best (Public LB) single model we have obtained during the competition was an XGBoost model with linear booster. It is with Public LB score: \textbf{0.69322} and Private LB score: \textbf{0.70768}. Apart from the counting features and distance features, it used raw basic TF-IDF and raw cooccurrence TF-IDF. To reproduce the best single model, run\\ \texttt{> python ./Code/Feat/combine\_feat\_[svd100\_and\_bow\_Jun27].py}\\ diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.tex.bak b/Doc/Kaggle_CrowdFlower_ChenglongChen.tex.bak index 0eb55c9..bc5d61c 100644 --- a/Doc/Kaggle_CrowdFlower_ChenglongChen.tex.bak +++ b/Doc/Kaggle_CrowdFlower_ChenglongChen.tex.bak @@ -61,6 +61,7 @@ Before generating features, we have found that it's helpful to process the text The best single model we have obtained during the competition was an XGBoost model with linear booster of Public LB score \textbf{0.69322} and Private LB score \textbf{0.70768}. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored \textbf{0.70807} on Public LB (my second best Public LB score) and \textbf{0.72189} on Private LB. \footnote{The best Public LB score was \textbf{0.70849} with corresponding Private LB score \textbf{0.72134}. It's a mean version of those 35 LB submissions.} +\begin{comment} \begin{figure}[!htb] \centering \begin{tikzpicture} @@ -68,6 +69,7 @@ The best single model we have obtained during the competition was an XGBoost mod \draw (0,-1.5) -- (0,1.5); \end{tikzpicture} \end{figure} +\end{comment} \section{Preprocessing} A few steps were performed to cleaning up the text. @@ -160,7 +162,7 @@ We also performed stemming before generating features (e.g., counting features a \section{Feature Extraction/Selection} Before proceeding to describe the features, we first introduce some notations. We use tuple $(q_i, t_i, d_i)$ to denote the $i$-th sample in \texttt{train.csv} or \texttt{test.csv}, where $q_i$ is the \texttt{query}, $t_i$ is the \texttt{product\_title}, and $d_i$ is the \texttt{product\_description}. For \texttt{train.csv}, we further use $r_i$ and $v_i$ to denote \texttt{median\_relevance} and \texttt{relevance\_variance}\footnote{This is actually the standard deviation (std).}, respectively. We use function $\text{ngram}(s, n)$ to extract string/sentence $s$'s $n$-gram (splitted by whitespace), where $n\in\{1,2,3\}$ if not specified. For example \[ -\text{ngram}(\text{bridal shower decorations}, 2) = [\text{bridal shower}, \text{shower decorations}]\footnote{Note that this is a list (e.g., \texttt{list} in python), not a set (e.g., \texttt{set} in python).} +\text{ngram}(\text{bridal shower decorations}, 2) = [\text{bridal shower}, \text{shower decorations}]\footnote{Note that this is a list (e.g., \texttt{list} in Python), not a set (e.g., \texttt{set} in Python).} \] \textbf{All the features are extracted for each run (i.e., repeated time) and fold (used in cross-validation and ensembling), and for the entire training and testing set (used in final model building and generating submission).} @@ -611,7 +613,7 @@ Some interesting insights we got during the competition: \end{itemize} \section{Simple Features and Methods} -Without any stacking or ensembling, the best (Public LB) single model we have obtained during the competition was an XGBoost model with linear booster. It is with Public LB score: $0.69322$ and Private LB score: $0.70768$. Apart from the counting features and distance features, it used raw basic TF-IDF and raw cooccurrence TF-IDF. +Without any stacking or ensembling, the best (Public LB) single model we have obtained during the competition was an XGBoost model with linear booster. It is with Public LB score: \textbf{0.69322} and Private LB score: \textbf{0.70768}. Apart from the counting features and distance features, it used raw basic TF-IDF and raw cooccurrence TF-IDF. To reproduce the best single model, run\\ \texttt{> python ./Code/Feat/combine\_feat\_[svd100\_and\_bow\_Jun27].py}\\ @@ -620,7 +622,7 @@ to generate the feature set we used, and\\ to train the XGBoost model with linear booster. Note that due to randomness in the Hyperopt routine, it won't generate exactly the same score, but a score very similar or even better. Note that, you can also try other linear models, e.g., \texttt{Ridge} in Sklearn. \section{Acknowledgement} -We would like to thank the DMLC team for developing the great machine learning package XGBoost, Fran\c{c}ois Chollet for developing package Keras, James Bergstra for developing package Hyperopt. +We would like to thank the DMLC team for developing the great machine learning package XGBoost, Fran\c{c}ois Chollet for developing package Keras, James Bergstra for developing package Hyperopt. We would also like to thank the Kaggle team and CrowdFlower for organizing this competition. \bibliographystyle{plain} \bibliography{reference} diff --git a/README.md b/README.md index cda3190..8ccf323 100644 --- a/README.md +++ b/README.md @@ -8,8 +8,8 @@ See `./Doc/Kaggle_CrowdFlower_ChenglongChen.pdf` for documentation. ## Instruction * download data from the [competition website](https://www.kaggle.com/c/crowdflower-search-relevance/data) and put all the data into folder `./Data`. -* run `python ./Code/Feat/run_all.py` to generate features. -* run `python ./Code/Model/generate_best_single_model.py` to generate best single model submission. -* run `python ./Code/Model/generate_model_library.py` to generate model library. +* run `python ./Code/Feat/run_all.py` to generate features. This will take a few hours. +* run `python ./Code/Model/generate_best_single_model.py` to generate best single model submission. In my experience, it only takes a few trials to generate model of best performance or similar performance. See the training log in `./Output/Log/[Pre@solution]_[Feat@svd100_and_bow_Jun27]_[Model@reg_xgb_linear]_hyperopt.log` for example. +* run `python ./Code/Model/generate_model_library.py` to generate model library. This is quite time consuming. **But you don't have to wait for this script to finish: you can run the next step once you have some models trained.** * run `python ./Code/Model/generate_ensemble_submission.py` to generate submission via ensemble selection. * if you don't want to run the code, just submit the file in `./Output/Subm`. \ No newline at end of file