Data Science + (-Data Science) = The Formula of Participating in A Data Science Competition
The title is an oversimplification of what I wished someone would have told me before I participated in a data science competition back in 2020. Although if I had known from before, I could've decided not to participate, ended up not learning about what were the most important things about being a data scientist, and not have written about those learnings in this article. This is a story of my one-time experience of participating in a data science competition, where I learned that it actually took the opposite of my understanding of data science back then to rock it, thus negative data science.
How it started
I was regularly doing my undergrad thesis in the lab when the news of pandemic and lockdown eventually came. I went home, thinking that it was only going to be 2 weeks until I could start doing my thesis again. Then another two weeks came. And another month came.
At first it really was a good opportunity to took online courses here and there, learning about statistics, data science techniques, all the sophisticated perks that came along with the term 'data science'. Since I have always been interested to purse a career in that field and my thesis was also an implementation of the technique, I didn't mind spending days and nights on the online courses and found the learning to be enjoyable. But coming from an engineering background, I didn't learn the specific field of data science but mere the implementation of it in my area, so my understanding and foundation on what was 'data science' were formed from those online courses. As months had gone by and taking online courses started to felt dull, I became exasperated that I couldn't graduate in time, and even more desperate because I couldn't work on my thesis. Naturally, I realised that I needed to look for another kind of distraction other than online courses.
Then one day I found the data science competition poster in a group chat, immediately thought it was a very different kind of distraction and would be an interesting one indeed. I asked two friends to be my partner in the competition and got rejected from both with the reason that they wanted to focus on their thesis (only to found out that they participated with someone else in the end), until I finally partnered up with Wayan Rezaldi!
We were asked to implement data science techniques on generating solutions to tackle problems in ASEAN that were related to UN SDGs and have it presented it in a deck. So here comes learning (variable) number one:
1. A data scientist is as good as how they're able to translate real world problems into problems that could be solved by data science.
You probably already know about this. But this is very crucial. Almost all problems can be solved with data if you know where to look for the data and what to do with it. We wanted to pick a problem that we both actually cared about, so we started from identifying problems about the environment. Since there were a lot of environmental issues occurring in ASEAN and it would be inefficient to try solving everything in a single take, we chose what was most relevant and close to us, rivers. Both I and Wayan lived near rivers, and both of the rivers were famous because of its bad impact to the environment.
We both knew it would be hard to collect environmental data that was specific to rivers only in ASEAN, but it was a risk that we would take. That takes us to learning (variable) number two:
2. The ability to crawl and pick the relevant data, then get any kind of insight from it is just as important.
The crawling is probably not relatable for data science practitioners in industries where the data most probably comes from internal or external partners databases. But choosing the relevant ones and generating insight from it is a universally important thing that we should always sharpen our skill on. Of course we can (almost) always use a null hypothesis then do a statistical analysis if the data is truly relevant. But, prior to this, domain knowledge could really be helpful before we do trial and error of anything.
I and Wayan read about 3 exclusive reports on rivers, and about 5 scientific journals on how to improve river healths. We were not trying to be river environment expert in a few weeks, but we just wanted to have enough general understanding of what we were working on. Because of reading those reports and journals, we could then track the related organisations and found data that were really helpful to us. One time, we found a website that described the Malaysia's rivers health in a table, but displayed in a single-page pdf poster. Which is why learning (variable) number three is:
3. Do not be afraid of getting your hands on 'dirty' data.
Like in the previous example, never be afraid to extract data from forms that are unconventional to you. For industry practitioners, this could apply to uncleaned data with no governance yet. Preparing and preprocessing data is said to be the one of the most important task of a data scientist and that couldn't be more true. Extracting data from a pdf poster might sound challenging, but it was nothing compared to picking numerical data from paragraphs in a report. Not to mention the fact that in the competition we were only allowed to use SAP Analytics to did the analytics and generate the graphs and the charts, so the easiest way to did that was transforming the scrambled data into excel files, no matter how much effort it took.
Although still, there are probably some data that would not generate the same insight in return of the effort you put on collecting them, and it's better to leave them. The reason why I and Wayan took the far length was that because we have determined that particular data was highly relevant to support our case, which we could be sure of because of domain knowledge. So naturally, the fourth learning (variable) is…….
4. Be creative on engineering the features.
This might not look too different than point 2, because feature engineering in the context means trying to get insight anyway. But, what I would like to highlight from this point is, to always experiment crazy. Nothing is the limit, create a crazy hypothesis then prove if it's right or wrong. Though again, having some domain knowledge would obviously help and gives you the background or sense if the crazy hypothesis is worth proving. I and Wayan started from a (rather) crazy hypothesis when we found the high correlation between river health criteria and city sustainability aspects.
If your crazy hypothesis is proven right, then what do you do? Explain it to the judge, audience or your stakeholders as is? These are the questions that lead us to the fifth and last learning (variable):
5. Always try to generalise the data science concepts and communicate it in a high level language.
This is the most important, and a lesson that I learned the hard way and on the contrary of my understanding towards data science at that time. When we won on the national round and got to represent our country in the regional level, we thought that the way to upscale our work was to use a more sophisticated data science technique. We made it look like our solution was advanced, without explaining in details what we were doing. That was our biggest mistake, and how we learned this negative data science variable the hard way:
Being able to communicate what you're doing and making sure the audience gets it is the most important part of being a data scientist, even if it means dumbing down and stripping the sophisticated terms of data science concepts and techniques. The main goal is not to impress them, but rather to convince them that our work is valuable, and for them to understand why.
Being a data scientist is not always about being able to implement sophisticated algorithms, or loving to code, or even being a genius in statistics. It's about knowing what you want to solve, being unafraid and creative on dealing with any kind of data, and most importantly, comprehending the what why and how of your work.
Summing up, the 5 variables (learnings) are:
- A data scientist is as good as how they’re able to translate real world problems into problems that could be solved by data science.
- The ability to crawl and pick the relevant data, then get any kind of insight from it is just as important.
- Do not be afraid of getting your hands on ‘dirty' data.
- Be creative on engineering the features.
- Always try to generalise the data science concepts and communicate it in a high level language.
If I had known that the 5th point was actually the most important one, I might have reconsidered on participating, as I used to feel like communicating to a big audience or doing presentations was not my forte. In the end, I am very very glad that I went through the 2 months long of jam-packed virtual cramming sessions, realising that I actually enjoy doing presentations, having the chance to represent Indonesia in the regional finals, and being able to write all this down. Still couldn't have done this with the help and support from our mentors, families, and friends.
Thank you for reading! I hope you found this helpful :)
Here's the link to the competition: https://aseandse.org/
If you're interested in participating, they are usually open for registration around April, and have the last year's regional finalists' decks in the website. However, if you want to see our deck now or have any kind of questions, please do hit us up!