Data | Gabriel Berardi

Simple Questions, Hard Answers

When business stakeholders ask data experts seemingly simple questions, they often expect a quick and straightforward answer. On the surface, it seems like a piece of cake. But in the messy reality of data, what appears to be a simple question can quickly turn into a multi-layered onion of a problem - each layer revealing increasing complexity and ambiguity. Example from the insurance domain: Let’s take a practical example from the insurance industry. A sales executive asks: ...

Data Leakage in Machine Learning

Recently, I read a thread on Twitter about several Machine Learning papers that contained severe cases of data leakage. The authors of the papers seemed unaware of this phenomenon and therefore trained models that performed exceptionally well. Unfortunately, this was mainly due to data leakage. Not many beginners are aware of this problem and in my opinion, not many courses emphasize this issue early enough. Therefore, I would like to tell you all the things you need to know about data leakage and some ways to prevent it in this post. ...

Detect Forged Banknotes with a Logistic Regression

Counterfeit money is a real problem both for individuals and for businesses. Counterfeiters constantly find new ways and techniques to produce fake banknotes, that are essentially indistinguishable from real money. At least for the human eye! Identifying forged banknotes is a typical example of a binary classification task in Machine Learning. If we have enough data of both real and forged banknotes, we can use this data to train a model that can classify new banknotes as either real or fake. ...

Linear and Logistic Regression

Linear and Logistic regression are among the most elementary algorithms for supervised learning. Supervised Learning describes the situation where we deal with labelled data, which means that we have labelled inputs and a target variable. Despite the fact that both have the word “regression” in their name, only one of them is typically being used for solving regression problems! Let’s see how they work! Linear Regression Linear regression is possibly the easiest, most intuitive way of making a quantitative prediction. The relationship between an independent and a dependent variable is assumed to be linear, meaning that the dependent variable can be predicted using a linear function of the independent variable. For example: ...

How to Create a Racing Bar Chart with Python

After reading this article from Pratap Vardhan with great interest, I wanted to build my own version of a Bar Chart Race that is smoother and a bit more beautiful. The biggest improvement is the interpolation (or augmentation) of the available data points in order to make the animation smoother. Here is the Bar Chart Race we are going to build in this article: For the purpose of this demonstration, we are going to use a GDP per capita forecast dataset provided by the OECD. You can find the original dataset here. ...

k-Nearest Neighbors

k-Nearest Neighbors, or k-NN as I am going to call it from now on, is one of the easiest algorithms to solve classification tasks. It can be used for regression problems as well, but I am going to focus on the more common use case of classification in this post. In a nutshell, k-NN will assign a new data point to the class that the majority of its k neighbours in the training set belong to. Let’s use another coffee-related example to see how that works. ...

Simple Facial Recognition with OpenCV

Have you ever seen some cool applications of computer vision tools, like this the one below? Perhaps your phone’s camera can autofocus on faces, or maybe you have uploaded a photo on a social media platform and it automatically recognized the person on the image? These are facial recognition applications and they all rely on Machine Learning. In this post, we are going to use a very easy package called OpenCV to build our own facial recognition program! ...

Where to eat in Munich?

I recently moved to a new city - Munich! I live in a very calm area, but soon realized that the neighbourhood is not really the best when it comes to eating outside. So, I decided to try to analyse review data from the web to find out which area is most compelling for me and other foodies. I scraped online reviews, cleaned the data and then visualized it on a map, showing the average rating of restaurants in different areas in Munich. ...

Scrape a Book Shop with BeautifulSoup

Web Scraping is the automated process of extracting data from websites. This is commonly done by retrieving the HTML code of a website through a request and then extracting the information hidden in the HTML code programmatically. This is especially convenient when there is no API available to you! There has been a lot of discussion going on about the legality and ethics of Web Scraping, which I do not want to get into in this article. You can check out this Wikipedia article and this blog post, if you want to know more about that. ...

k-Means Clustering

The k-means algorithm is used to divide unlabeled data into categories or classes, in order to draw useful conclusions from the resulting clusters. Let’s take a look at an imaginary dataset of n = 18 observations of different coffee brands. Note that we would never actually use the k-means algorithm on such a small data set. We plot the price of the coffee vs. the rating obtained by customers: ...

Blog

Categories

Articles