Data Science + (-Data Science) = The Formula of Participating in A Data Science Competition

The title is an oversimplification of what I wished someone would have told me before I participated in a data science competition back in 2020. Although if I had known from before, I could've decided not to participate, ended up not learning about what were the most important things about being a data scientist, and not have written about those learnings in this article. This is a story of my one-time experience of participating in a data science competition, where I learned that it actually took the opposite of my understanding of data science back then to rock it, thus negative data science.

How it started

I was regularly doing my undergrad thesis in the lab when the news of pandemic and lockdown eventually came. I went home, thinking that it was only going to be 2 weeks until I could start doing my thesis again. Then another two weeks came. And another month came.

Brainstorming and cramming during lockdown was hard, so we turned to discord to cram together virtually while listening to songs 😬

The Formula

We were asked to implement data science techniques on generating solutions to tackle problems in ASEAN that were related to UN SDGs and have it presented it in a deck. So here comes learning (variable) number one:

1. A data scientist is as good as how they're able to translate real world problems into problems that could be solved by data science.

You probably already know about this. But this is very crucial. Almost all problems can be solved with data if you know where to look for the data and what to do with it. We wanted to pick a problem that we both actually cared about, so we started from identifying problems about the environment. Since there were a lot of environmental issues occurring in ASEAN and it would be inefficient to try solving everything in a single take, we chose what was most relevant and close to us, rivers. Both I and Wayan lived near rivers, and both of the rivers were famous because of its bad impact to the environment.

2. The ability to crawl and pick the relevant data, then get any kind of insight from it is just as important.

The crawling is probably not relatable for data science practitioners in industries where the data most probably comes from internal or external partners databases. But choosing the relevant ones and generating insight from it is a universally important thing that we should always sharpen our skill on. Of course we can (almost) always use a null hypothesis then do a statistical analysis if the data is truly relevant. But, prior to this, domain knowledge could really be helpful before we do trial and error of anything.

The title slide of the deck that we submitted for the competition.

3. Do not be afraid of getting your hands on 'dirty' data.

Like in the previous example, never be afraid to extract data from forms that are unconventional to you. For industry practitioners, this could apply to uncleaned data with no governance yet. Preparing and preprocessing data is said to be the one of the most important task of a data scientist and that couldn't be more true. Extracting data from a pdf poster might sound challenging, but it was nothing compared to picking numerical data from paragraphs in a report. Not to mention the fact that in the competition we were only allowed to use SAP Analytics to did the analytics and generate the graphs and the charts, so the easiest way to did that was transforming the scrambled data into excel files, no matter how much effort it took.

4. Be creative on engineering the features.

This might not look too different than point 2, because feature engineering in the context means trying to get insight anyway. But, what I would like to highlight from this point is, to always experiment crazy. Nothing is the limit, create a crazy hypothesis then prove if it's right or wrong. Though again, having some domain knowledge would obviously help and gives you the background or sense if the crazy hypothesis is worth proving. I and Wayan started from a (rather) crazy hypothesis when we found the high correlation between river health criteria and city sustainability aspects.

One of the slides in our submitted deck

5. Always try to generalise the data science concepts and communicate it in a high level language.

This is the most important, and a lesson that I learned the hard way and on the contrary of my understanding towards data science at that time. When we won on the national round and got to represent our country in the regional level, we thought that the way to upscale our work was to use a more sophisticated data science technique. We made it look like our solution was advanced, without explaining in details what we were doing. That was our biggest mistake, and how we learned this negative data science variable the hard way:

The Result

Summing up, the 5 variables (learnings) are:

  1. The ability to crawl and pick the relevant data, then get any kind of insight from it is just as important.
  2. Do not be afraid of getting your hands on ‘dirty' data.
  3. Be creative on engineering the features.
  4. Always try to generalise the data science concepts and communicate it in a high level language.
I and Wayan in our traditional culture clothing for the regional finals 😁

Thank you for reading! I hope you found this helpful :)

Here's the link to the competition:

Full time feline slave 🐱

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store