add decoding method comparison

ChenglongChen · Jul 16, 2015 · efebfeb · efebfeb
1 parent 634802d
commit efebfeb
Show file tree

Hide file tree

Showing 53 changed files with 168,405 additions and 71 deletions.
diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.aux b/Doc/Kaggle_CrowdFlower_ChenglongChen.aux
@@ -64,33 +64,37 @@
 \@writefile{toc}{\contentsline {subsubsection}{\numberline {4.2.1}Classification}{13}{subsubsection.4.2.1}}
 \newlabel{subsubsec:Classification}{{4.2.1}{13}{Classification\relax }{subsubsection.4.2.1}{}}
 \@writefile{toc}{\contentsline {subsubsection}{\numberline {4.2.2}Regression}{13}{subsubsection.4.2.2}}
-\@writefile{toc}{\contentsline {subsubsection}{\numberline {4.2.3}Pairwise Ranking}{13}{subsubsection.4.2.3}}
 \citation{ebc}
-\citation{cocr}
+\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces Histograms of raw prediction and predictions using various decoding methods grouped by true relevance.}}{14}{figure.2}}
+\newlabel{fig:MSE_decoding}{{2}{14}{Histograms of raw prediction and predictions using various decoding methods grouped by true relevance}{figure.2}{}}
+\@writefile{toc}{\contentsline {subsubsection}{\numberline {4.2.3}Pairwise Ranking}{14}{subsubsection.4.2.3}}
 \@writefile{toc}{\contentsline {subsubsection}{\numberline {4.2.4}Oridinal Regression}{14}{subsubsection.4.2.4}}
-\@writefile{toc}{\contentsline {subsubsection}{\numberline {4.2.5}Softkappa}{14}{subsubsection.4.2.5}}
+\citation{cocr}
+\@writefile{lot}{\contentsline {table}{\numberline {6}{\ignorespaces Performance of various decoding methods for MSE objective.}}{15}{table.6}}
+\newlabel{tab:MSE_decoding}{{6}{15}{Performance of various decoding methods for MSE objective}{table.6}{}}
+\@writefile{toc}{\contentsline {subsubsection}{\numberline {4.2.5}Softkappa}{15}{subsubsection.4.2.5}}
 \citation{ensemble_selection}
 \citation{hyperopt}
 \citation{hyperopt_url}
-\@writefile{toc}{\contentsline {subsection}{\numberline {4.3}Sample Weighting}{15}{subsection.4.3}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {4.4}Ensemble Selection}{15}{subsection.4.4}}
-\@writefile{toc}{\contentsline {subsubsection}{\numberline {4.4.1}Model Library Building via Guided Parameter Searching}{15}{subsubsection.4.4.1}}
-\@writefile{toc}{\contentsline {subsubsection}{\numberline {4.4.2}Model Weight Optimization}{15}{subsubsection.4.4.2}}
-\@writefile{lot}{\contentsline {table}{\numberline {6}{\ignorespaces Model Library}}{16}{table.6}}
-\newlabel{tab:Model_Library}{{6}{16}{Model Library\relax }{table.6}{}}
-\@writefile{toc}{\contentsline {subsubsection}{\numberline {4.4.3}Randomized Ensemble Selection}{16}{subsubsection.4.4.3}}
-\@writefile{toc}{\contentsline {section}{\numberline {5}Code Description}{16}{section.5}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {4.3}Sample Weighting}{16}{subsection.4.3}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {4.4}Ensemble Selection}{16}{subsection.4.4}}
+\@writefile{toc}{\contentsline {subsubsection}{\numberline {4.4.1}Model Library Building via Guided Parameter Searching}{16}{subsubsection.4.4.1}}
+\@writefile{toc}{\contentsline {subsubsection}{\numberline {4.4.2}Model Weight Optimization}{16}{subsubsection.4.4.2}}
+\@writefile{lot}{\contentsline {table}{\numberline {7}{\ignorespaces Model Library}}{17}{table.7}}
+\newlabel{tab:Model_Library}{{7}{17}{Model Library\relax }{table.7}{}}
+\@writefile{toc}{\contentsline {subsubsection}{\numberline {4.4.3}Randomized Ensemble Selection}{17}{subsubsection.4.4.3}}
+\@writefile{toc}{\contentsline {section}{\numberline {5}Code Description}{17}{section.5}}
 \citation{NLTK_Cookbook}
-\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces CV mean, Public LB, and Private LB scores of our 35 best Public LB submissions generating with randomized ensemble selection. One standard deviation of the CV score is plotted via error bar.}}{17}{figure.2}}
-\newlabel{fig:CV_Public_Private}{{2}{17}{CV mean, Public LB, and Private LB scores of our 35 best Public LB submissions generating with randomized ensemble selection. One standard deviation of the CV score is plotted via error bar}{figure.2}{}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {5.1}Setting}{17}{subsection.5.1}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {5.2}Feature}{17}{subsection.5.2}}
-\@writefile{toc}{\contentsline {subsection}{\numberline {5.3}Model}{18}{subsection.5.3}}
-\@writefile{toc}{\contentsline {section}{\numberline {6}Dependencies}{19}{section.6}}
+\@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces CV mean, Public LB, and Private LB scores of our 35 best Public LB submissions generating with randomized ensemble selection. One standard deviation of the CV score is plotted via error bar.}}{18}{figure.3}}
+\newlabel{fig:CV_Public_Private}{{3}{18}{CV mean, Public LB, and Private LB scores of our 35 best Public LB submissions generating with randomized ensemble selection. One standard deviation of the CV score is plotted via error bar}{figure.3}{}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {5.1}Setting}{18}{subsection.5.1}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {5.2}Feature}{18}{subsection.5.2}}
+\@writefile{toc}{\contentsline {subsection}{\numberline {5.3}Model}{19}{subsection.5.3}}
+\@writefile{toc}{\contentsline {section}{\numberline {6}Dependencies}{20}{section.6}}
 \citation{wmd}
-\@writefile{toc}{\contentsline {section}{\numberline {7}How To Generate the Solution (aka README file)}{20}{section.7}}
-\@writefile{toc}{\contentsline {section}{\numberline {8}Additional Comments and Observations}{20}{section.8}}
-\@writefile{toc}{\contentsline {section}{\numberline {9}Simple Features and Methods}{20}{section.9}}
+\@writefile{toc}{\contentsline {section}{\numberline {7}How To Generate the Solution (aka README file)}{21}{section.7}}
+\@writefile{toc}{\contentsline {section}{\numberline {8}Additional Comments and Observations}{21}{section.8}}
+\@writefile{toc}{\contentsline {section}{\numberline {9}Simple Features and Methods}{21}{section.9}}
 \bibstyle{plain}
 \bibdata{reference}
 \bibcite{owen}{1}
@@ -103,4 +107,4 @@
 \bibcite{ensemble_selection}{8}
 \bibcite{NLTK_Cookbook}{9}
 \bibcite{cocr}{10}
-\@writefile{toc}{\contentsline {section}{\numberline {10}Acknowledgement}{21}{section.10}}
+\@writefile{toc}{\contentsline {section}{\numberline {10}Acknowledgement}{22}{section.10}}
diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.log b/Doc/Kaggle_CrowdFlower_ChenglongChen.log
@@ -1,4 +1,4 @@
-This is pdfTeX, Version 3.1415926-2.3-1.40.12 (MiKTeX 2.9) (preloaded format=pdflatex 2013.11.4)  13 JUL 2015 07:58
+This is pdfTeX, Version 3.1415926-2.3-1.40.12 (MiKTeX 2.9) (preloaded format=pdflatex 2013.11.4)  17 JUL 2015 01:37
 entering extended mode
 **Kaggle_CrowdFlower_ChenglongChen.tex
 (F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.tex
@@ -813,51 +813,57 @@ Overfull \hbox (5.09244pt too wide) in paragraph at lines 401--402
 -tion, e.g., \OT1/cmtt/m/n/10.95 LogisticRegression
  []
 
-[13] [14] [15]
+<./compare_MSE_Decoding.pdf, id=359, 666.69077pt x 574.54652pt>
+File: ./compare_MSE_Decoding.pdf Graphic file (type pdf)
+
+<use ./compare_MSE_Decoding.pdf>
+Package pdftex.def Info: ./compare_MSE_Decoding.pdf used on input line 414.
+(pdftex.def)             Requested size: 422.77664pt x 364.34204pt.
+ [13] [14 <F:/CrowdFolwer/cleanup/Doc/compare_MSE_Decoding.pdf>] [15] [16]
 PGFPlots: reading {35lb_subs.txt}
- [16]
-Overfull \hbox (10.74371pt too wide) in paragraph at lines 555--556
+ [17]
+Overfull \hbox (10.74371pt too wide) in paragraph at lines 583--584
 \OT1/cmtt/m/n/10.95 ./Data\OT1/cmr/m/n/10.95 , i.e., \OT1/cmtt/m/n/10.95 strati
 fiedKFold.query.pkl \OT1/cmr/m/n/10.95 and \OT1/cmtt/m/n/10.95 stratifiedKFold.
 relevance.pkl\OT1/cmr/m/n/10.95 .
  []
 
-[17]
-Overfull \hbox (0.73491pt too wide) in paragraph at lines 565--566
+[18]
+Overfull \hbox (0.73491pt too wide) in paragraph at lines 593--594
 []\OT1/cmr/bx/n/10.95 combine[]feat[][LSA[]and[]stats[]feat[]Jun09][][Low].py\O
 T1/cmr/m/n/10.95 : This file gen-er-ates one
  []
 
-[18]
-Overfull \hbox (1.45724pt too wide) in paragraph at lines 592--593
+[19]
+Overfull \hbox (1.45724pt too wide) in paragraph at lines 620--621
 \OT1/cmr/m/n/10.95 pa. It is adopt-ed from []$\OT1/cmtt/m/n/10.95 https : / / g
 ithub . com / benhamner / Metrics / tree / master /$
  []
 
 
-Overfull \hbox (32.26485pt too wide) in paragraph at lines 613--614
+Overfull \hbox (32.26485pt too wide) in paragraph at lines 641--642
 []\OT1/cmr/m/n/10.95 XGBoost-0.4.0 (Win-dows Ex-e-cutable, []$\OT1/cmtt/m/n/10.
 95 https : / / github . com / dmlc / XGBoost / releases /$
  []
 
 
-Overfull \hbox (11.5833pt too wide) in paragraph at lines 614--615
+Overfull \hbox (11.5833pt too wide) in paragraph at lines 642--643
 []\OT1/cmr/m/n/10.95 ml[]metrics ([]$\OT1/cmtt/m/n/10.95 https : / / github . c
 om / benhamner / Metrics / tree / master / Python / ml _$
  []
 
 
-Overfull \hbox (12.11351pt too wide) in paragraph at lines 618--619
+Overfull \hbox (12.11351pt too wide) in paragraph at lines 646--647
 []\OT1/cmr/m/n/10.95 rgf1.2 (Win-dows Ex-e-cutable, []$\OT1/cmtt/m/n/10.95 http
  : / / stat . rutgers . edu / home / tzhang / software /$
  []
 
 
-Overfull \hbox (2.8642pt too wide) in paragraph at lines 622--622
+Overfull \hbox (2.8642pt too wide) in paragraph at lines 650--650
 []\OT1/cmr/bx/n/17.28 How To Gen-er-ate the So-lu-tion (a-ka README
  []
 
-[19] [20] (F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.bbl
+[20] [21] (F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.bbl
 Overfull \hbox (49.59592pt too wide) in paragraph at lines 4--5
 [][]$\OT1/cmtt/m/n/10.95 http : / / nycdatascience . com / featured-[]talk-[]1-
 []kaggle-[]data-[]scientist-[]owen-[]zhang/$[]\OT1/cmr/m/n/10.95 . 
@@ -882,12 +888,12 @@ ication / forums / t / 13863 /$
  []
 
 )
-Package atveryend Info: Empty hook `BeforeClearDocument' on input line 659.
- [21]
-Package atveryend Info: Empty hook `AfterLastShipout' on input line 659.
+Package atveryend Info: Empty hook `BeforeClearDocument' on input line 687.
+ [22]
+Package atveryend Info: Empty hook `AfterLastShipout' on input line 687.
  (F:\CrowdFolwer\cleanup\Doc\Kaggle_CrowdFlower_ChenglongChen.aux)
-Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 659.
-Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 659.
+Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 687.
+Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 687.
 
 
 Package rerunfilecheck Warning: File `Kaggle_CrowdFlower_ChenglongChen.out' has
@@ -901,10 +907,10 @@ t':
 (rerunfilecheck)             After:  5835D4BCA4B1BF337073CA56FA26B04F;3336.
  ) 
 Here is how much of TeX's memory you used:
- 21915 strings out of 495354
- 471649 string characters out of 3183859
- 685259 words of memory out of 3000000
- 24515 multiletter control sequences out of 15000+200000
+ 21925 strings out of 495354
+ 471854 string characters out of 3183859
+ 685519 words of memory out of 3000000
+ 24521 multiletter control sequences out of 15000+200000
  24220 words of font info for 93 fonts, out of 3000000 for 9000
  14 hyphenation exceptions out of 8191
  63i,19n,114p,722b,1949s stack positions out of 5000i,500n,10000p,200000b,50000s
@@ -922,10 +928,10 @@ m/cmsy10.pfb><D:/CTEX/MiKTeX/fonts/type1/public/amsfonts/cm/cmti10.pfb><D:/CTEX
 /MiKTeX/fonts/type1/public/amsfonts/cm/cmti8.pfb><D:/CTEX/MiKTeX/fonts/type1/pu
 blic/amsfonts/cm/cmtt10.pfb><D:/CTEX/MiKTeX/fonts/type1/public/amsfonts/cm/cmtt
 12.pfb>
-Output written on Kaggle_CrowdFlower_ChenglongChen.pdf (21 pages, 393463 bytes)
+Output written on Kaggle_CrowdFlower_ChenglongChen.pdf (22 pages, 417452 bytes)
 .
 PDF statistics:
- 545 PDF objects out of 1000 (max. 8388607)
- 124 named destinations out of 1000 (max. 500000)
- 386 words of extra memory for PDF output out of 10000 (max. 10000000)
+ 590 PDF objects out of 1000 (max. 8388607)
+ 127 named destinations out of 1000 (max. 500000)
+ 391 words of extra memory for PDF output out of 10000 (max. 10000000)
 
diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.pdf b/Doc/Kaggle_CrowdFlower_ChenglongChen.pdf
diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.synctex.gz b/Doc/Kaggle_CrowdFlower_ChenglongChen.synctex.gz
diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.tex b/Doc/Kaggle_CrowdFlower_ChenglongChen.tex
@@ -297,10 +297,10 @@ \subsubsection{Basic TF-IDF Features}
 \item \textbf{Individual SVD}\\
 We fit a SVD transformer for TF-IDF vectors of $\{q_i, t_i, d_i\}$, separately.
 \end{itemize}
-\item \textbf{Cosine Similarity Based on SVD Reduced Features}\\
+\item \textbf{Basic Cosine Similarity Based on SVD Reduced Features}\\
 We computed cosine similarity based on SVD reduced features (using common SVD).
 \item \textbf{Statistical Cosine Similarity Based on SVD Reduced Features}\\
-We computed statistical cosine similarity based on SVD reduced features.
+We computed statistical cosine similarity based on SVD reduced features as in Sec. \ref{subsubsec:Statistical_Distance_Features}.
 \end{itemize}
 \subsubsection{Cooccurrence TF-IDF Features}
 \label{subsubsec:Cooccurrence_TFIDF_Features}
@@ -403,7 +403,35 @@ \subsubsection{Classification}
 \subsubsection{Regression}
 Classification doesn't take into account the weight $w_{i,j}$ in $\kappa$, and the magnitude of the rating. With $w_{i,j}$'s form, it is convincing to apply regression (with mean-squared-error, MSE) to predict the relevance score. In prediction phase, we can convert the raw prediction score to $\{1,2,3,4\}$ following step 2-4 as in Sec. \ref{subsubsec:Classification}.
 
-It turns out that MSE is the best objective among all the alternatives we have tried during the competition. For this reason, we mostly used regression to predict \texttt{median\_relevance}.
+Figure \ref{fig:MSE_decoding} shows some histograms from our reproduced best single model for one run of CV (only one validation fold is used). In specific, we plot histograms of 1) raw prediction, 2) rounding decoding, 3) ceiling decoding, and 4) the above cdf decoding, grouped by the true relevance. It's most obvious that both rounding and ceiling decoding methods have difficulty in predicting relevance 4.
+
+Table \ref{tab:MSE_decoding} shows the kappa scores for each decoding method (using all 3 runs and 3 folds CV). The above cdf decoding method exhibits the best performance among the three methods we considered.
+
+It turns out that MSE (with the above decoding method) is the best objective among all the alternatives we have tried during the competition. For this reason, we mostly used regression to predict \texttt{median\_relevance}.
+
+\begin{figure}[t]
+\centering
+\includegraphics[width=0.9\textwidth]{./compare_MSE_Decoding.pdf}
+\caption{Histograms of raw prediction and predictions using various decoding methods grouped by true relevance.}
+\label{fig:MSE_decoding}
+\end{figure}
+
+\begin{table}[t]
+\centering
+\caption{Performance of various decoding methods for MSE objective.}
+\label{tab:MSE_decoding}
+\begin{tabular}{|c|c|c|}
+\hline
+Method & CV Mean & CV Std \\
+\hline
+Rounding & 0.404277 & 0.005069\\
+\hline
+Ceiling & 0.513138 & 0.006485\\
+\hline
+CDF & \textcolor{red}{0.681876} & 0.005259\\
+\hline
+\end{tabular}
+\end{table}
 
 \subsubsection{Pairwise Ranking}
 We have tried pairwise ranking (LambdaMart) within XGBoost, but didn't obtain acceptable performance (it was worse than softmax).
@@ -491,7 +519,7 @@ \subsubsection{Model Weight Optimization}
 In the original ensemble selection algorithm, the model is added to the ensemble with hard weight 1. However, this is not guaranteed for best performance. We have modified it to allow weight optimized for each model when adding to the ensemble. The weight is optimized with Hyeropt too. This gives better performance than hard weight 1 in our preliminary comparison.
 
 \subsubsection{Randomized Ensemble Selection}
-The final method we used to generate the winning solution is actually without model weight optimization. On the contrary, we replaced weight optimization with \textbf{random weight}. This is inspired by the \texttt{ExtraTreesRegressor} to reduce the model variance (or the risk of overfitting). 
+The final method we used to generate the winning solution is actually without model weight optimization. On the contrary, we replaced weight optimization with \textbf{random weight}. This is inspired by the \texttt{ExtraTreesRegressor} to reduce the model variance (or the risk of overfitting).
 
 Figure \ref{fig:CV_Public_Private} shows the CV mean, Public LB, and Private LB scores of our 35 best Public LB submissions generated with this method. As shown, CV score is correlated with the Public LB and Private LB, while it's more correlated with the latter. As time went by, we have trained more and more different models, which turned out to be helpful for ensemble selection in both CV and Private LB (as shown in Figure \ref{fig:CV_Public_Private}).
 

diff --git a/Doc/Kaggle_CrowdFlower_ChenglongChen.tex.bak b/Doc/Kaggle_CrowdFlower_ChenglongChen.tex.bak
@@ -297,10 +297,10 @@ We first concatenated the TF-IDF vectors of $\{q_i, t_i, d_i\}$ (using common vo
 \item \textbf{Individual SVD}\\
 We fit a SVD transformer for TF-IDF vectors of $\{q_i, t_i, d_i\}$, separately.
 \end{itemize}
-\item \textbf{Cosine Similarity Based on SVD Reduced Features}\\
+\item \textbf{Basic Cosine Similarity Based on SVD Reduced Features}\\
 We computed cosine similarity based on SVD reduced features (using common SVD).
 \item \textbf{Statistical Cosine Similarity Based on SVD Reduced Features}\\
-We computed statistical cosine similarity based on SVD reduced features.
+We computed statistical cosine similarity based on SVD reduced features as in Sec. \ref{subsubsec:Statistical_Distance_Features}.
 \end{itemize}
 \subsubsection{Cooccurrence TF-IDF Features}
 \label{subsubsec:Cooccurrence_TFIDF_Features}
@@ -405,6 +405,13 @@ Classification doesn't take into account the weight $w_{i,j}$ in $\kappa$, and t
 
 It turns out that MSE is the best objective among all the alternatives we have tried during the competition. For this reason, we mostly used regression to predict \texttt{median\_relevance}.
 
+\begin{figure}[!htb]
+\centering
+\includegraphics[width=0.9\textwidth]{./FlowChart.pdf}
+\caption{The flowchart of our method.}
+\label{fig:Flowchart}
+\end{figure}
+
 \subsubsection{Pairwise Ranking}
 We have tried pairwise ranking (LambdaMart) within XGBoost, but didn't obtain acceptable performance (it was worse than softmax).
 
@@ -491,11 +498,11 @@ RGF & \multicolumn{2}{c|}{Regression}                              & Low  & No
 In the original ensemble selection algorithm, the model is added to the ensemble with hard weight 1. However, this is not guaranteed for best performance. We have modified it to allow weight optimized for each model when adding to the ensemble. The weight is optimized with Hyeropt too. This gives better performance than hard weight 1 in our preliminary comparison.
 
 \subsubsection{Randomized Ensemble Selection}
-The final method we used to generate the winning solution is actually without model weight optimization. On the contrary, we replaced weight optimization with \textbf{random weight}. This is inspired by the \texttt{ExtraTreesRegressor} to reduce the model variance (or the risk of overfitting). 
+The final method we used to generate the winning solution is actually without model weight optimization. On the contrary, we replaced weight optimization with \textbf{random weight}. This is inspired by the \texttt{ExtraTreesRegressor} to reduce the model variance (or the risk of overfitting).
 
 Figure \ref{fig:CV_Public_Private} shows the CV mean, Public LB, and Private LB scores of our 35 best Public LB submissions generated with this method. As shown, CV score is correlated with the Public LB and Private LB, while it's more correlated with the latter. As time went by, we have trained more and more different models, which turned out to be helpful for ensemble selection in both CV and Private LB (as shown in Figure \ref{fig:CV_Public_Private}).
 
-The winning solution that scored \textbf{0.70807} on Public LB and \textbf{0.72189} on Private LB is just a median ensemble of these 35 best Public LB submissions.
+The winning submission that scored \textbf{0.70807} on Public LB and \textbf{0.72189} on Private LB is just a median ensemble of these 35 best Public LB submissions.
 
 \begin{figure}[t]
 \centering