Simple steps to start the data science competition (Bronze | Competition Record of SIGNATE Student Cup 2023 中古車の価格予測チャレンジ)

date

Aug 25, 2023

slug

ML_2

author

status

Public

Intro

These days, I am seriously starting my first data science competition. Although I have a deep learning research background and have taken a data science course, I still feel confused about where to start when dealing with real-world data. When I asked others for a starting point, I received answers such as "it depends on the data" and "just start with your intuition." If you are also receiving these kinds of answers and don't know how to start the competition, here is my experience.

First, here are my scores by date, which I would like to separate into four phases.

Rookie: Following my original "intuition"

Basic: Using the model taught in university

Competition entry level: Using the advanced framework ("lightGBM") and systemizing data cleaning

Bronze: More data engineering, such as data distillation

I would like to describe the details of each phase below. If you are not interested, you can go directly to the conclusion for the procedure I suggest, which will allow you to directly enter the competition entry level by skipping the basic and rookie periods.

Rookie

During this period, I had no idea how to handle the data. I just filled all the missing values with zero and processed the data based on intuition. For example, I normalized all the continuous data into the range of -1 to 1, one-hot encoded the discrete features that had unique values less than 5, and used hash embedding for other discrete features. Finally, by spending lots of time on inefficient data engineering, I fed the data into various basic models in default settings, such as NN, linear regression, decision tree, and random forest. Of course, it didn't go well no matter how I twisted the hyperparameters.

Basic

Something seemed wrong, but I couldn't understand the reason. Although I seemed to do a lot of work, were these works necessary? To figure it out, I needed a baseline to compare with. So I fit the decision tree with minimal data cleaning (only filling the missing values with 0) and got a score much better than doing the feature engineering based on my stupid "intuition." It's fair to say that I was wasting my time during the rookie period.

Competition Entry Level

With the baseline, I gave up the hash embedding and normalization since the decision tree is scale irrelevant. My goal was to make minimal changes to the original data. I checked the features column by column carefully and found some weird data. For example, some data was from the year 2999, the manufacturer feature had all 'toyota', 'ｔｏｙｏｔａ', and 'ＴＯＹＯＴＡ', which were supposed to be in the same category, and there was even a non-existent brand 'ᴄhrysler'. Also, in the size feature, 'full−size' and 'fullーsize' existed at the same time. These unclean data destroyed the algorithm so hard.

In addition to doing the data cleaning, I also found some commonly used ML frameworks for data science competitions, which are "lightgbm" and "XGboost". I fed the well-cleaned data into the models. However, I only achieved a rank between 60% and bronze.

Bronze

I was stuck again. So I did a lot of searching on Kaggle to find out what might be the key to the breakthrough. There were two techniques that caught my attention.

Target encoding: The idea is to encode the categorical features with some number derived from the target.

Feature selection: Choosing necessary features.

In this period, the score didn't change much. But 0.1 can raise your rank more than 10. I first tried to process all the categorical features using target encoding, which led my rank to the edge of bronze. Moreover, by doing backward stepwise selection and removing the low variance feature, I was able to raise my rank to the edge of silver.

My Opinion

Unfortunately, I didn't have time to implement further improvements such as stacking frameworks. However, this competition experience helped me grow from a rookie to a beginner in the competition world. Here are some steps that I think could be helpful in saving time for other rookies who want to start their first competition.

Steps to Start Your First Data Science Competition

So here are the simple steps to enter the competition level:

Basic cleaning of the data based on data understanding.

Fill the missing values with 0.

One-hot encode all the categorical features.

Check the legitimacy (range) of the data. For example, age should be ≥ 0, date ≤ current date.

Try various advanced frameworks. Here are some common popular frameworks.

catboost

XGboost

lightgbm

What can you do if you want more improvement:

Try target encoding

Find additional data or articles targeting similar problems

Stacking frameworks

Data distillation

Reference

competition link: https://signate.jp/competitions/1051

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html

https://www.kaggle.com/code/ryanholbrook/target-encoding#Target-Encoding