[{"content":"In this blog post, I am going to explain how to change the Realtek 8852BE WiFi card that comes with the Geekom A5.\nMany people (myself included) experience WiFi issues when using Linux on the Geekom A5. A lot of people report that the Realtek 8852BE WiFi card works well on Windows, but on Linux, WiFi would often suddenly disconnect and it would not be possible anymore to reconnect to the internet. Only a reboot helps in this situation.\nThe only solution is to replace the Realtek 8852BE WiFi chip with a better-supported Linux WiFi card, such as the Intel AX210.\nI am writing this blog post because I was looking for something like this before I changed the WiFi card on my Geekom A5, but I couldn\u0026rsquo;t find anything. So, this is why I want to provide this information and what I learned in the process here.\nStep 1: Prepare Your Tools I recommend preparing the following tools:\nA very small screwdriver for M.2 screws - preferrably magnetic A pair of tweezers - ideally made of plastic An adjustable wrench or pliers to remove a small screw nut Step 2: Open the Case Opening the Geekom A5 is straightforward. Simply remove the four screws on the underside and lift off the cover. Carefully lift away the cover in order to prevent damaging the connection to the STATA slot.\nThere are many videos online showing this step, such as this one from Geekom themselves:\nOnce we open the case, we see this:\nStep 3: Remove the SSD The WiFi card is located under the SSD, so we need to remove that first, by simply unscewing the screw that keeps it in place and then lift the card away:\nStep 4: Remove the Plastic Cover of the WiFi Card After we have removed the SSD, we can see the Realtek 8852BE WiFi card:\nNow here is where my struggle began: on top of the WiFi card and over the two WiFi antenna cables, there is a glued on transparent plastic cover. Apparently, this is to prevent a short circuit, but maybe also to keep the antenna cables in place.\nYou’ll need to carefully peel off this plastic cover. Unfortunately, there’s no elegant way to do this - just remove it gently.\nAfter you have removed the plastic cover, you can disconnect the black and the grey antenna cables.\nStep 5: Remove the Screw Nut The screw that holds down the WiFi card is called a \u0026ldquo;combination screw\u0026rdquo;, because it simulatenously holds down the WiFi card but it also holds the screw that held down the SSD card we removed before. Use a pair of small pliers or an adjustable wrench to loosen and remove the screw carefully.\nStep 6: Swap out the WiFi card Now you can simply pull out the old WiFi card and put in your new WiFi card. Connecting the antenna cables (black to main and gray to aux) can be tricky in the Geekom A5, as the cables tend to detach easily. Here you simply need some patience to connect the cable, to put in the new WiFi card and to add the screw nut. Now you can also put back the transparent plastic cover on top of the WiFi card:\nStep 7: Reassembly After that, we only need to put the SSD back in its slot and secure it with the M.2 screw. Finally, carefully reattach the case of the Geekom A5 and tighten the screws on the underside. That\u0026rsquo;s it!\nPerfect WiFi After replacing the WiFi card, WiFi on Linux Mint worked flawlessly, with no further issues. For future purchases, I would try to avoid Realtek WiFi cards, if possible.\n","permalink":"http://www.gabriel-berardi.com/blog/tech/2026-03-27-how-to-change-the-wifi-card-of-the-geekom-a5/","summary":"\u003cp\u003eIn this blog post, I am going to explain how to change the Realtek 8852BE WiFi card that comes with the Geekom A5.\u003c/p\u003e\n\u003cp\u003eMany people (myself included) \u003cstrong\u003eexperience WiFi issues when using Linux on the Geekom A5\u003c/strong\u003e. A lot of people report that the Realtek 8852BE WiFi card works well on Windows, but on Linux, WiFi would often suddenly disconnect and it would not be possible anymore to reconnect to the internet. Only a reboot helps in this situation.\u003c/p\u003e","title":"Changing the Geekom A5 WiFi Card"},{"content":"\nWhen business stakeholders ask data experts seemingly simple questions, they often expect a quick and straightforward answer. On the surface, it seems like a piece of cake. But in the messy reality of data, what appears to be a simple question can quickly turn into a multi-layered onion of a problem - each layer revealing increasing complexity and ambiguity.\nExample from the insurance domain:\nLet\u0026rsquo;s take a practical example from the insurance industry. A sales executive asks:\n\u0026ldquo;What was the lapse rate (churn rate) of a specific product last year?\u0026rdquo;\nIt seems like a simple question, right? Calculate how many policies lapsed as a percentage of the total policies at the beginning of the year. Easy!\nBut the moment a data analyst starts thinking into this question, complications begin to surface.\nStep 1: Data Extraction At first glance, the analyst\u0026rsquo;s job might appear to be as simple as pulling the data from a clean, ready-to-use database. Unfortunately, this is rarely the case, especially in large organizations. In the insurance industry, data is often scattered across multiple operational systems - different systems for various regions, business lines, customer segments and so on.\nEach system may store the same information differently. One system might clearly record the exact moment a contract lapsed, while another might blur the lines between true customer lapses and technical migrations. Harmonizing these heterogeneous systems to get consistent data is already a significant challenge (shoutout to all data engineers working on solving these kinds of problems!).\nStep 2: Defining the KPIs Assuming the data has been extracted and cleaned, the next hurdle arises: What exactly do we mean by \u0026ldquo;lapse rate\u0026rdquo;? In business, many key performance indicators (KPIs) are not universally defined, and lapse rate is no exception.\nDo we calculate lapse rate based on the number of policies or based on the premium volume of those policies? If we choose premium volume, do we consider premiums before or after taxes? Do we include sales commissions? Should we calculate using the premiums at the beginning of the year, the end of the year, or the average across the year? Or should we take a pro-rata approach, where each policy is weighted according to the length of time it was active at a certain premium level?\nAs you can see, what seemed like a straightforward question now opens the door to a wide range of follow-up questions that must be clarified.\nStep 3: Contract Timing Considerations Even if we focus purely on the number of contracts, complexities arise. For example, how do we treat contracts that lapsed early in the year versus those that lapsed at the very end? Should a contract that was active for nearly the entire year and lapsed on December 31st be treated the same as one that lapsed on January 2nd?\nSome might argue that contracts should be weighted by the time they were active during the year. A contract lapsed in January barely impacts the year\u0026rsquo;s overall performance, while one that lapses in December may have contributed significantly.\nThe payment frequency of the contract - whether monthly, quarterly, or yearly - also plays a role. Should we treat a yearly payment plan the same as a monthly one when calculating the lapse rate?\nStep 4: Managing Ambiguity By now, what seemed like a simple question has grown increasingly complex. Each answer leads to new questions, and the data analyst must navigate this ambiguity carefully.\nWhile data professionals are trained to think analytically and consider all possibilities, this can also lead to overcomplicating the analysis:\nA key part of the data analyst\u0026rsquo;s role is not only to answer these questions but also to determine when additional granularity no longer adds value to the analysis.\nThey must communicate with business stakeholders to clarify the purpose behind the question and agree on what level of detail is necessary. Here, Data Analysts that are more familiar with the industry they are working in have a huge advantage!\nCommunicating with Stakeholders One of the most critical aspects of data analysis is managing expectations. Analysts must not only plunge into the data and address technical nuances but also communicate effectively with the business. Understanding the \u0026ldquo;why\u0026rdquo; behind the question can help analysts determine the most appropriate methodology for the analysis.\nFor example, when a stakeholder asks for a lapse rate, are they primarily interested in the overall health of the product portfolio? Or are they looking to assess the financial performance in terms of premium volume?\nHaving a clear understanding of the stakeholder\u0026rsquo;s goals will guide the analyst in making decisions about how much detail is necessary.\nBusiness stakeholders can make the job of Data Analysts much easier by providing as much information as possible, when raising data questions.\nOpen dialogue is essential to prevent misalignment. Without it, the analyst could spend days perfecting an overly detailed calculation, only to discover that the stakeholder only needed a rough estimate to make a quick decision.\nThe Role of Data Governance Data governance plays an important role in all of this, by reducing ambiguity and facilitating more efficient data work. In an ideal scenario, data governance provides clear (but simple!) guidelines for data work, as well as definitions for KPIs, that can be applied consistently across the organization.\nFor instance, a robust data governance framework would include well-documented definitions for lapse rate, outlining whether it\u0026rsquo;s calculated based on number of contracts or premium volume, and specifying how those premiums should be treated - before or after taxes, with or without commissions, and so on.\nBy establishing these guidelines, organizations can prevent a situation where two analysts arrive at different results for the same KPI simply because they made different assumptions.\nHowever, it\u0026rsquo;s important to use data governance in moderation! Governance should aim to support and standardize analysis, not impose excessive bureaucracy. In many organizations, data governance is viewed as cumbersome, earning a reputation for adding complexity rather than simplifying workflows.\nThe goal of data governance should be to strike a balance: providing clarity where needed, while allowing enough flexibility for analysts to adapt to specific business needs.\nTurning Complexity into Clarity At first glance, business questions often appear simple. But as we\u0026rsquo;ve seen, the path to answering them is often fraught with complexity, especially in the world of data. For data analysts, the challenge is not only to navigate these complexities but also to communicate effectively with stakeholders to ensure the analysis serves its purpose.\nBy encouraging communication and leveraging effective data governance practices, organizations can turn this complexity into actionable insights - helping businesses make better, more informed decisions without getting lost in the data weeds.\nIn the end, while simple questions can lead to hard answers, the value lies in finding the right balance between depth and efficiency, ensuring that data drives meaningful business outcomes.\n","permalink":"http://www.gabriel-berardi.com/blog/data/2024-09-30-simple-questions-hard-answers/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/data/2024-09-30-simple-questions-hard-answers/images/img1.webp\"\u003e\u003c/p\u003e\n\u003cp\u003eWhen business stakeholders ask data experts seemingly simple questions, they often expect a quick and straightforward answer. On the surface, it seems like a piece of cake. But in the messy reality of data, what appears to be a simple question can quickly turn into a multi-layered onion of a problem - each layer revealing increasing complexity and ambiguity.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eExample from the insurance domain:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eLet\u0026rsquo;s take a practical example from the insurance industry. A sales executive asks:\u003c/p\u003e","title":"Simple Questions, Hard Answers"},{"content":"\nIn recent years, in light of the growing attention given to climate change and environmental concerns, insurance companies worldwide have been increasingly focused on ESG (Environmental, Social, and Governance) objectives. Indeed, according to a survey conducted by PricewaterhouseCoopers in 2022, 85% of global insurers believe that ESG will impact all functions of their business in the years to come.\nBitcoin on the other hand, is often denounced to be “terrible for the environment” and a “contributor to climate degradation”. Media outlets oftentimes quote statistics such as:\nIt’s estimated that Bitcoin consumes electricity at an annualized rate of 127 terawatt-hours (TWh). That usage exceeds the entire annual electricity consumption of Norway. In fact, Bitcoin uses 707 kilowatt-hours (kWh) of electricity per transaction, which is 11 times that of Ethereum.\nForbes\nThe general population, therefore, still believes that Bitcoin is a climate killer and too many people wrongly believe that other cryptocurrencies, like Ethereum, alleviate this shortcoming by using less energy-intensive validation mechanisms.\nHowever, the reality is the exact opposite. As surprising as it may be to some: Bitcoin could be the missing factor to achieve the transition to renewable energy that is needed to prevent a further increase of global temperatures. In this article, I will attempt to explain why that is the case and what implications this has for the insurance industry.\nWhat Is ESG All About? The term ‘ESG’ typically represents a set of criteria that investors and businesses use to evaluate the ethical and sustainability aspects of an investment or a company’s operations. Let’s quickly break down what each of these components means:\nEnvironmental: This aspect focuses on a company’s environmental impact, including its efforts to reduce carbon emissions, manage waste, conserve resources, and promote sustainability. Environmental criteria consider how a company’s activities affect the planet and its ecosystems. Social: The social dimension of ESG takes into account how a company manages its relationships with its employees, customers, suppliers, and the communities in which it operates. It assesses factors like labor practices, diversity and inclusion, community engagement, and product safety. Governance: Governance evaluates the way a company is managed and governed. It looks at factors such as board diversity, executive compensation, shareholder rights, and the overall transparency and accountability of the company’s leadership. Now that we have a clearer understanding of ESG, let’s now explore why the Bitcoin protocol requires energy – and why that isn’t a bad thing per se.\nWhy Is Bitcoin’s Energy Consumption Important? Information stored digitally can be duplicated very easily, without any difference whatsoever between the original information and the copy. When trying to create digital money, this fact creates a problem known as the ‘double-spend problem‘, which means that digital money could be copied and hence spent multiple times. This form of digital money would be inherently worthless, because anyone could create as much of it as they wanted.\nSatoshi Nakamoto, the mysterious creator of Bitcoin, worked out this problem. Bitcoin uses a mechanism called ‘proof of work‘ to solve the double-spend problem. Bitcoin miners need to use a lot of computational power, and therefore also a lot of electricity, to find new valid blocks in the Bitcoin blockchain.\nThere are thousands of competing miners in the world, and the Bitcoin blockchain automatically adjusts the difficulty with which new valid blocks can be found according to the competing computing power in the network. This creates an incredibly secure network! If one party wanted to change the Bitcoin blockchain, for example to steal or to double-spend Bitcoin, this one party would require more than 50% of the computing power in the Bitcoin network. This has become extremely unlikely in recent years, due to increasing computing power in the Bitcoin network.\nIn other words, Bitcoin manages to ensure real scarcity in the digital realm, by using electricity in the physical world. This sounds trivial, but it is the only digital asset that manages to do this in a truly decentralized, really secure and fully trustless manner!\nIn terms of Bitcoin’s energy consumption, we can therefore conclude that Bitcoin does not use too much energy. It uses exactly as much energy as is needed to secure a network that creates real value to its users. As such, the electricity consumption of Bitcoin is not a waste. It is as justified – if not more – as the energy consumption of Google’s data warehouses, the sum of all fridges in the world, or electric cars, for example.\nHow Can Bitcoin Impact ESG Goals? After we have established that Bitcoin does not waste energy, but merely needs energy to secure its network which provides real value, let’s now take a look at the impact Bitcoin has on ESG factors.\n1. Environment Grid Balancing As electricity is naturally difficult to store, electrical grids constantly need to balance the energy supply and demand. Fossil fuels offer a simple way to balance the production of electricity, for example by burning more or less coal according to the demand. In contrast, many renewable energy sources, like solar or wind energy, rely on natural weather phenomenon that are not controllable. Furthermore, renewable energy facilities often need to operate at their full capacity in order to meet their contractual obligations. This can result in facilities having an excess surplus of electricity and negative electricity prices. This means that the operator of a renewable electricity facility pays another party to consume their surplus electricity.\nBitcoin mining can utilize this surplus of electricity in a valuable manner. As a ‘buyer of last resort’ in times when the electricity demand is too low, Bitcoin mining can therefore improve the business case and economics of renewable energy facilities, by providing an additional revenue stream through the mining rewards. As such, it incentivizes an increased integration of renewable energy sources into the electricity grid.\nℹ️ What this means for insurers: Many insurance companies invest in renewable energy facilities like wind or solar parks.1 To fulfil ESG goals, such types of investments are likely to grow in coming years. Incorporating Bitcoin mining opportunities into the business cases of these investmends can increase their calculcated ROI and hence further incentivize such endeveaours. Furthermore, energy facilities with integrated Bitcoin mining require specialized risk coverage that can be offered by insurance companies.\nRecycled Heat Bitcoin miners use specialized hardware, so called ‘application-specific integrated circuit’ or ‘ASIC’ computers. This type of hardware transforms the electrical energy input into heat when in use. As such, Bitcoin mining rigs could be considered sophisticated electrical heaters, with the lucrative add-on of also mining Bitcoin.\nThe heat from these mining rigs can then be recycled for various use cases, such as heating homes, temporary shelters, offices, commercial buildings, greenhouses, swimming pools and so on. Another creative idea is to use the heat to regulate the temperature of algae tanks that not only filter CO2 from the air, but can also be used for nutrition or as a biofuel.\nThere are already a number of companies working on such use cases. The Canadian startup MintGreen, for example, uses miners to warm water for a whiskey distillery, and the Austrian startup 21ENERGY offers customized Bitcoin mining heaters for home or office usage.\nℹ️ What this means for insurers: When investing into renewable energy facilities, insurers could look at the potential of recycling the produced heat in innovative ways, for exampling by operating a green house or algae farm nearby. Furthermore, insurance companies could consider using Bitcoin mining rigs to replace carbon based heating solutions to heat up office spaces, or to incentivize the usage of such heating solution in the area of buildings insurance.\nMethane Reduction When we talk about climate change and the greenhouse effect, we often talk about Carbon Dioxide (CO2). Other greenhouse gases are frequently overlooked. Methane (CH4) is an 80 times more potent greenhouse gas than CO2 over a 20 years time frame and accounts for roughly 30% of global warming impacts.\nMethane is frequently burned in oil and gas fields. During this reaction, it turns into CO2 and water (H2O). Since CO2 is less potent than Methane, this process, also known as gas flaring, is considered to be less harmful to the environment compared to releasing Methane itself into the atmosphere. However, the heat released during flaring is oftentimes being wasted. This wasted heat could be monetized, by transforming it into electricity used for Bitcoin mining. One startup already working on this is Crusoe Energy. This company builds infrastructure to utilize the energy from flaring activities to operate mobile data centers, that can be used for different purposes, such as mining Bitcoin.\nAnother source for direct release of methane into the atmosphere are landfills around the globe. This methane can also be turned into electricity for Bitcoin mining, which would reduce the negative environmental impact of such facilities. Vespene Energy is a startup that is trying to turn this idea into reality. They have entered into a partnership with a renewable natural gas provider, for an initial project aimed at fueling on-site data Bitcoin mining at a municipal landfill.\nℹ️ What this means for insurers: For insurers, the reduction of methane emissions through innovative approaches such as utilizing waste energy for Bitcoin mining presents opportunities to invest in environmentally sustainable projects. By supporting methane reduction initiatives and partnering with innovative startups in this space, insurers can align with global sustainability goals while also exploring new revenue streams.\n2. Social Inclusive Insurance Inclusion is one of the main social goals of ESG and it includes financial inclusion. According to the World Bank, financial inclusion means that “individuals and businesses have access to useful and affordable financial products and services that meet their needs – transactions, payments, savings, credit and insurance – delivered in a responsible and sustainable way.”\nHowever, many people in economically underprivileged countries still lack the access to the traditional financial system. They don’t have bank accounts, they cannot get loans, they cannot save and invest money easily, and they do not enjoy insurance coverage for life’s perils.\nAs outlined in our article Microinsurance Unleashed – the Potential of Bitcoin and Lightning, Bitcoin and the Lightning network offer several ways to boost the adoption of inclusive insurance, such as their permissionless nature, low transaction fees and highly automated policies.\nℹ️ What this means for insurers: For insurers, the rise of Bitcoin and the Lightning Network as tools for financial inclusion provides an exciting avenue for innovation and expansion into new markets. The permissionless nature of these technologies enables insurers to reach populations that have historically been excluded from the financial system, without the need for intermediaries or traditional banking infrastructure. This democratization of financial services can open up a vast new customer base that is in need of affordable and accessible insurance products.\nMinigrid Electricity According to the International Energy Agency, almost 775 million people worldwide lack access to electricity. This sets the world behind the United Nation’s Sustainable Development Goals (SDG 7) and hinders human and economic development. One of the driving forces is the lack of energy infrastructure and conventional electricity grids in developing regions, especially in rural and remote areas.\nOne way to alleviate this problem are so called ‘minigrids‘, which are small electricity grids that work independent of conventional main grids. Such mini can be developed and operated by state utilities, private companies, communities, non-governmental organizations, or a mix of different players. An example for a simple minigrid could be a collection of solar panels in a remote village whose electricity is consumed directly by local residents.\nOne challenge for the adoption of minigrids are the initial costs for the infrastructure setup, as well as the frequent mismatching of demand and supply. According to a report by the International Energy Agency, a majority of firms in the off-grid industry indicate they risk bankruptcy in the next three years.\nAs already described above, Bitcoin mining could help to solve this problem. Bitcoin miners can co-locate themselves within a minigrid and monetize electricity in times of oversupply, thereby incentivizing investments into such minigrids. Consequently, this leads to a more stable and cost-effective power supply to local residents, by enhancing the efficiency of the minigrid.\nGridless, a startup backed by Twitter founder and Bitcoin advocate Jack Dorsey, is already doing that. They build minigrids with renewable energy generation in rural communities in East Africa, and use unneeded electricity for Bitcoin mining. By doing that, they report to have reduced electricity prices for locals from 35 cents to 25 cents per KWh.\nℹ️ What this means for insurers: Insurers have a significant role to play in supporting the development and expansion of minigrids in underserved regions. By investing in projects that use Bitcoin mining to enhance the viability of these mini electricity grids, insurance companies can contribute to greater energy access in remote areas. Moreover, insurance products can be designed to cover risks associated with the setup and operation of minigrids, such as damage to infrastructure, or business interruption due to system failure or natural disasters. This would make investments in minigrid projects more appealing to potential stakeholders by reducing their perceived risks.\nRemittance Payments \u0026amp; Donations Another way in which Bitcoin can contribute towards the social aspect of ESG goals, is by facilitating international remittance payments.\nRemittance payments typically describe money transfers made by a foreign worker for household income in their home country. The annual volume of remittance is estimated to be more than $600bn, which is roughly equivalent to the gross domestic product of countries such as Sweden or Argentina. According to the World Bank, the global average for such remittance payments is around 6.25%, more than double of the United Nation’s Sustainable Development Goal (SDG 10c) of remittance costs of 3%. In Sub-Saharan Africa, the average is 8.35%, and for some countries, remittance costs can take up to 30% of the transfer amount.\nHistorically, transaction fees average between $0.50 – $2.50. If we take the average of $1.50 fee per transaction, this means that a remittance amount of just $30 would equal to a fee of 5% when done in Bitcoin – which is substantially lower than the global average. In the case of a transfer of $100 the Bitcoin fee would only represent 1.5%. Sending on-chain Bitcoin to one’s family would therefore already be beneficial for migrant workers.\nHowever, with the Lightning network, it gets even better. The median cost of sending value across the Lightning Network is a negligible 0.0029% with a base fee of just 1 Sat.2 This makes sending money to any place in the world unprecedentedly cheap and can directly enhance the life of billions of people in the world who are dependent on such remittance payments.\nA similar benefit arises for donations made into catastrophic regions or for charitable causes. Traditional donation platforms often come with administrative costs and transaction fees that can eat into the amount actually received by the beneficiaries. With Bitcoin and especially the Lightning Network, these fees can be reduced to almost negligible amounts, ensuring that more of the donated funds actually reach those in need.\nℹ️ What this means for insurers: Insurance companies with a global workforce could encourage or even facilitate the use of Bitcoin and Lightning for employees wishing to send remittance payments. This could manifest as an employee benefit or service, making the remittance process easier, faster, and less costly for their international staff. Moreover, insurance companies frequently engage in philanthropic activities, particularly in responding to catastrophic events where rapid and efficient financial assistance is crucial. Utilizing Bitcoin could substantially reduce the administrative overhead and transaction fees associated with such donations. By leveraging this technology, insurers could ensure that a significantly larger percentage of their charitable contributions go directly to aid those in immediate need, rather than being diluted by operational costs.\n3. Governance As we have seen, Bitcoin can have a direct, measurable and positive impact on environmental and social factors, and help insurers to reach their ESG goals. When it comes to governance, the positive influence of Bitcoin is a bit more subtle and indirect. Nevertheless, I believe that Bitcoin posses a set of core design rules that can act as a role model or blueprint for traditional organizations and companies. These core design rules are:\nTransparency: Bitcoin’s blockchain operates on a decentralized, open ledger that records all transactions, providing complete transparency. Furthermore, the code that defines the Bitcoin protocol is open source and can be checked by anyone. This transparency can inspire traditional organizations and companies to adopt more open and accountable practices in their governance, fostering trust among stakeholders and demonstrating their commitment to ethical operations. Immutability: Bitcoin’s blockchain is extremely resistant to changes, ensuring the integrity of the historical data recorded. In the realm of governance, this principle encourages organizations to maintain consistency and stability in their policies, making them more reliable and less susceptible to arbitrary shifts that may harm stakeholders. Equality: Bitcoin operates on the principle of decentralization, where no single entity has undue control. This aspect can encourage traditional organizations to adopt more inclusive and democratic governance models, giving a greater voice to various stakeholders and fostering a fairer decision-making process. By considering these principles inspired by Bitcoin, traditional organizations, such as insurance companies, can better align with ESG objectives, especially in the realm of governance, by promoting transparency, immutability, and equality in their operations.\nSummary In summary, Bitcoin presents an unexpected but promising avenue for insurance companies striving to meet ESG (Environmental, Social, and Governance) objectives. Contrary to its reputation as an energy drain, Bitcoin mining can actually support renewable energy adoption by balancing the electrical grid and providing new revenue streams. It can also recycle waste heat and reduce potent methane emissions.\nOn the social front, Bitcoin and the Lightning Network offer game-changing possibilities for financial inclusion and remittance payments, widening insurers’ potential customer base and reducing transaction costs. Minigrid projects supported by Bitcoin mining can also foster sustainable development in underprivileged regions, providing both environmental and social benefits.\nAs for governance, Bitcoin’s core principles of transparency, immutability, and equality serve as a model for companies to align their governance practices with ESG goals.\nOverall, Bitcoin is not only compatible with ESG objectives, it offers innovative pathways to achieve them. For insurers committed to sustainability and social responsibility, it may be time to look past the negative headlines and consider the unique opportunities that Bitcoin presents.\nSee for example press releases from Allianz, Axa, Munich Re or Dai-ichi Life.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n1 Sat, also called 1 Satoshi is equal to 0.00000001 Bitcoin.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"http://www.gabriel-berardi.com/blog/bitcoin/2023-10-21-bitcoins-role-in-achieving-esg-goals-for-insurers/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/bitcoin/2023-10-21-bitcoins-role-in-achieving-esg-goals-for-insurers/images/img1.webp\"\u003e\u003c/p\u003e\n\u003cp\u003eIn recent years, in light of the growing attention given to climate change and environmental concerns, insurance companies worldwide have been increasingly focused on ESG (Environmental, Social, and Governance) objectives. Indeed, according to a \u003ca href=\"https://www.pwc.com/us/en/industries/financial-services/library/next-in-insurance-top-issues/esg-insurance-industry.html\" target=\"_blank\" \u003esurvey\u003c/a\u003e conducted by PricewaterhouseCoopers in 2022, 85% of global insurers believe that ESG will impact all functions of their business in the years to come.\u003c/p\u003e\n\u003cp\u003eBitcoin on the other hand, is often denounced to be “terrible for the environment” and a “\u003ca href=\"https://www.greenpeace.org/usa/news/climate-pledges-vs-bitcoin-investments-unveiling-the-1-5-billion-dollar-climate-bomb/\" target=\"_blank\" \u003econtributor to climate degradation\u003c/a\u003e”. Media outlets oftentimes quote statistics such as:\u003c/p\u003e","title":"Bitcoin’s Role in Achieving ESG Goals for Insurers"},{"content":"\nInsurance is a market-based risk management tool that is fundamentally based on sharing and diversifying risks among a community of policyholders. It creates a safety net against unpredictable events, protecting individuals and societies from potentially hazardous financial losses. This invaluable system not only provides a sense of security to its members but also fosters innovation and exploration, encouraging humans to embrace calculated risks. Thus, insurance makes our world more resilient, fostering innovation and stability.\nRegrettably, a significant portion of the global population lacks adequate access to insurance, a deficiency most acutely felt by those already struggling economically:\nConsider a Swiss farmer experiencing a drought that eliminates an entire year’s crops. Although this is a devastating event, it isn’t life-threatening given Switzerland’s excellent social security system. The most serious consequence for this farmer might be seeking governmental aid or using up some of their savings. Contrast this with a similar drought striking a farmer in an economically disadvantaged country. Such an event could lead to total loss of property, necessitate high-interest loans, or even drive them towards criminal activities. In extreme cases, this farmer faces dire outcomes like starvation, disease, or even death. The stark contrast experienced by both farmers underscores the crucial role insurance plays in providing economic stability and security.\nThere are many reasons why the insurance penetration is low in some countries, including underdeveloped financial systems, inadequate infrastructure, corruption, and low levels of education, among others. Microinsurance, also referred to as inclusive insurance, tries to reach people with a low income, by offering tailored insurance products with much lower premium than typical insurance policies.\nAlthough microinsurance has made significant strides, it has yet to reach its full potential. One intriguing avenue for potential expansion could lie with Bitcoin and the Lightning network. In this article, we will explore how these cutting-edge technologies could potentially foster the continued growth and development of the microinsurance market.\nUnderstanding Microinsurance So what exactly is microinsurance? Wikipedia offers a good definition:\nMicroinsurance is the protection of low-income people […] against specific perils in exchange for regular premium payment proportionate to the likelihood and cost of the risks involved.\nWhile this definition aligns with that of traditional insurance, it distinctly targets low-income individuals. Notably, ‘low-income’ here does not refer to the underprivileged proportion of citizens in developed nations, but to the 50% of the global population surviving on less than $6.25 per day. This demographic is often overlooked by mainstream commercial and social insurance schemes, or lacks adequate access to insurance products.\nJust like conventional insurance, microinsurance can cover a broad spectrum of potential hazards. This encompasses health-related risks like illness, accidents, or mortality to property-related risks such as damage or loss. There is a diverse array of microinsurance solutions available, ranging from crop and livestock insurance, to theft or fire protection, health coverage, term life policies, death benefits, disability coverage, and safeguards against natural calamities.\nMicroinsurance holds the potential to address some of the world’s most pressing challenges like poverty, hunger, gender disparity, and economic development. For both individuals and their communities, access to microinsurance can mark a transformative difference.\nHowever, the expansion of microinsurance is met with several impediments. Let’s delve into these challenges.\nCurrent Challenges in Microinsurance As per the “2022 Landscape of Microinsurance” report published by the Microinsurance Network, microinsurance has barely scratched the surface of its target market with a mere 8% penetration. This statistic highlights an untapped potential of 92% within the target demographic. Furthermore, the estimated market value of current microinsurance premiums sits at $30.9 billion, against a possible market worth $441 billion, underscoring vast areas for expansion in the microinsurance sector.\nSo, what are the primary hurdles? We can broadly identify the following microinsurance-specific challenges and roadblocks:\nEducation and Distribution: Financial literacy emerges as a fundamental issue. Insurance, being a complex product, requires a good grasp of financial concepts, which requires education to be understood by potential customers. In economically underdeveloped countries, many people – especially women – do not have the opportunity to attend schools. Literacy rates are still low in developing countries. Consequently, insurance companies and intermediaries face considerable difficulty in articulating the nuances of an insurance contract to customers and justifying regular premium payments.\nPayment: According to the Microinsurance Network, payment of microinsurance premiums was done in the following ways: Bank transfers (28%); Cash (28%); Mobile money (12%); Credit/loan (10%); Free/subsidized (3%); Other (20%). If we take a look at bank transfers, cash and mobile money – we can see that each of these payment methods presents unique challenges for low-income individuals:\nBank transfers can be costly for low-income individuals, due to their low-creditworthiness and subsequent high bank maintenance fees. In remote or rural areas, access to banking infrastructure might be entirely unavailable. Payment in cash involves a number of risks for insureds by itself, including theft and fraud. Furthermore, managing cash payments incurs extra costs for insurance companies, which are invariably passed onto the insured parties. Mobile money – while playing an important role in the financial development of developing nations – relies on trust to the operator of the mobile money system as a central authority. Furthermore, mobile money transactions aren’t cheap, especially for the poorest members of society who transact small amounts of money. According to an article by The East African newspaper, the average cost of sending $1 through the mobile money platform to a user on the same network is 9.5% of the value of the transaction, while the cost of sending $20 is 2.6%. Also, the mobile money transaction network across Africa is scattered with very low cross-system interoperability. Legislature: In many countries, especially developing ones, there is a lack of specific legislation for microinsurance. The rules and regulations designed for traditional insurance products are not always suitable for microinsurance, which could result in regulatory barriers. There are also challenges related to the enforcement of contracts, dispute resolution, and issues with fraud.\nWith these challenges identified, let’s now analyze if and how Bitcoin and Lightning can help alleviate some of these challenges.\nLet’s delve into how Bitcoin and the Lightning network could potentially address the three challenges outlined above, with the most significant potential lying in the realm of premium payments for microinsurance:\nFirstly, Bitcoin and Lightning allow any individual with a phone to open a wallet without needing authorization from any institution. This is transformative for the millions lacking access to traditional banking, as it allows them to store and transact value, and also enables them to manage and pay for their own microinsurance products. Secondly, payments in the Bitcoin Lightning network are simple to handle, have virtually no transaction fees and are settled within seconds. As such, Lightning combines the low transaction fees of cash payments with the accessibility and speed of mobile money, while removing any entrance barrier for people. This makes the Lightning network a perfect fit for microinsurance. Finally, Bitcoin and Lightning present innovative avenues for insurance companies to provide highly automated, programmable microinsurance policies. For instance, natural disasters could be insured automatically without transferring distribution and operating costs to the customer. Premium payments could be collected via Bitcoin Lightning, with agreed insurance payouts automatically triggered by certain metrics, like a specified amount of rainfall within a given period. This enables the construction of so-called ‘smart contracts’ with Bitcoin and Lightning. Regarding education and distribution, Bitcoin and Lightning could enhance the accessibility of microinsurance products via the internet, allowing potential customers or their representative intermediaries to purchase insurance directly online. This would increase productivity and reduce operational costs, freeing up time for agents and brokers to focus on educating customers about the benefits of insurance.\nIn terms of legislature, Bitcoin and Lightning might present new opportunities. There is, as of now, a limited but growing body of legislation concerning Bitcoin, with its interpretation and enforcement varying greatly from jurisdiction to jurisdiction. While this will initially present a challenge in establishing a uniform regulatory environment for the use of Bitcoin and Lightning in microinsurance, the open, transparent, and programmable nature of Bitcoin can also present opportunities for regulatory innovation. For instance, smart contracts on Bitcoin can be transparently audited and automatically enforced, which might facilitate dispute resolution and deter fraudulent activity.\nSummary Bitcoin and the Lightning network could bring substantial benefits to the microinsurance sector by addressing key challenges. They could revolutionize the way premiums are paid, removing the barriers posed by traditional banking systems and mobile money platforms. They have the potential to automate microinsurance policies, bringing efficiency and cost-effectiveness. Online accessibility can enhance the distribution of microinsurance products, while also enabling better focus on the critical aspect of customer education.\nFurthermore, as Bitcoin and Lightning become more mainstream, they could foster a more conducive legislative environment that is adapted to the specific needs of microinsurance. With transparent and programmable financial instruments such as smart contracts, dispute resolution and fraud prevention can be significantly enhanced, while fostering trust among customers and regulators.\nNevertheless, the potential of Bitcoin and Lightning should also not be overestimated. While they offer promising solutions, they do not solve all the challenges facing microinsurance. Issues related to education and literacy, infrastructure, corruption, and unstable political systems remain significant hurdles that need to be addressed through a combination of interventions.\n","permalink":"http://www.gabriel-berardi.com/blog/bitcoin/2023-08-30-microinsurance-unleashed-the-potential-of-bitcoin-and-lightning/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/bitcoin/2023-08-30-microinsurance-unleashed-the-potential-of-bitcoin-and-lightning/images/img1.png\"\u003e\u003c/p\u003e\n\u003cp\u003eInsurance is a market-based risk management tool that is fundamentally based on sharing and diversifying risks among a community of policyholders. It creates a safety net against unpredictable events, protecting individuals and societies from potentially hazardous financial losses. This invaluable system not only provides a sense of security to its members but also fosters innovation and exploration, encouraging humans to embrace calculated risks. Thus, insurance makes our world more resilient, fostering innovation and stability.\u003c/p\u003e","title":"Microinsurance – the Potential of Bitcoin and Lightning"},{"content":"\nIn a recent article and also on a Twitter thread, American entrepreneur, investor, and influencer Anthony Pompliano has called Bitcoin ‘the largest insurance company in the world’:\nHis argument in the article is quite straight-forward:\nBitcoin can protect its buyers from currency debasement, sovereign default or undisciplined monetary and fiscal policy. In order to receive this risk protection, someone has to pay the current Bitcoin price, which could be considered a one-time insurance premium. The earlier they buy this insurance, the cheaper it is. Instead of relying on an insurance company to honor their policy during a crisis, Bitcoin offers a digitized solution that doesn’t require you to file a claim, eliminating the need for trust. Pompliano even goes further in his article. According to him, Bitcoin offers insurance for previously uninsurable risks, like high inflation or government seizure of assets. In summary, Bitcoin as a $500+ billion insurance product could be considered the largest insurance in the world. Bitcoin as an insurance? Even the largest insurance in the world? It is an intriguing thought, which many in the Bitcoin community seem to share. Let’s take a closer look at this.\nBitcoin Is a Hedge, Not an Insurance In our opinion, calling Bitcoin an insurance, never mind an insurance company, is flawed.\nFirst and foremost: Bitcoin is not a company. Of course, this is absolutely clear to anyone in the Bitcoin community, but we should be careful not to use these kinds of words mindlessly in the context of Bitcoin, especially because most other cryptosecurities are, in fact, companies. But let’s not focus on this aspect too much. Is Bitcoin an insurance then?\nIn our view, Bitcoin is not an insurance, but rather a hedge. Let me explain.\nBoth insurance and hedges serve as a method for risk management, aiming to mitigate potential financial losses. However, there are some key differences between these two tools:\nPredetermined Compensation: In insurance, there is typically a contract that specifies the coverage and the amount of compensation to be provided in the event of a covered incident. This predetermined compensation provides foreseeable financial certainty to the buyer of the insurance policy. On the other hand, a hedge does not guarantee a specific compensation amount, but rather aims to minimize potential losses or offset risks by using financial instruments or strategies. Specific Event Coverage: Insurance is designed to cover specific events or risks, such as accidents, theft, or property damage. The compensation is triggered by the occurrence of these predetermined events. In contrast, a hedge is more general and focuses on reducing potential losses or managing risks across various aspects of an investment portfolio or financial position. Contractual Obligations: Insurance involves a contractual agreement between an insurer and a policyholder. The insurer agrees to provide coverage and compensation in exchange for premium payment. Hedging, on the other hand, does not necessarily involve contractual obligations between two parties. It often involves taking positions in financial instruments or strategies to offset risks. Timeframe: Insurance policies typically have defined terms and coverage periods. The compensation is provided during the term of the policy if the specific event occurs. Hedging, however, can be an ongoing strategy employed to manage risks and protect against potential losses over an extended period. Looking at all these points, Bitcoin obviously shares more characteristics with a hedge rather than an insurance: there is no predetermined compensation, there is no specific event coverage, there is no contract between a risk carrier and an insuree, and there is no defined terms and coverage period.\nHowever, we might say that the general population uses terms like ‘insurance’ or ‘hedge’ interchangeably, and most people would probably not care too much about this level of detail. Let’s therefore suppose Bitcoin could be considered an insurance – how would its size compare to the insurance industry?\nBitcoin Would Be a Massive Insurer The following chart shows the market capitalization of Bitcoin at the time of writing this article in comparison to the world’s 50 largest insurance companies:\nSource: Coinmarketcap.com and Companiesmarketcap.com\nAs can be seen from this chart, by terms of the market capitalization, Bitcoin is larger than any insurance company in the world! At its current all-time-high of $65,000 in 2021, Bitcoin had a market capitalization of over $1,250bn, which had been bigger than the market cap of the 30 largest insurance companies in the world, employing millions of people!\nLet’s now take a look at how this relates to the whole global insurance market:\nSource: Researchandmarkets.com\nThe insurance industry is vast and encompasses numerous companies providing a wide range of coverage. To me, it is absolutely mind-boggling to see that Bitcoin already represents roughly 1/11 of the worldwide insurance market, at the time of writing this article!\nAs I’ve discussed above, insurance and hedges are two tools for risk management. In this regard, we consider Bitcoin a hedge. However, it is important to keep in mind that Bitcoin’s role and potential extends far beyond mere risk management, and it offers several unique characteristics that set it apart from traditional insurance, hedges, or any other existing financial instruments.\nFor a growing number of people, Bitcoin is a long-awaited alternative to the fiat money standard we are currently living under. An alternative to inflationary money defined by unelected central and private bankers. An alternative to the centralized control of monetary policy. An alternative to the incentivizing of speculation and corruption. And an alternative to financial exclusion and political manipulation. As Michael Saylor, the founder and chairman of the software company MicroStrategy put it:\n“Bitcoin is a bank in cyberspace, run by incorruptible software, offering a global, affordable, simple, and secure savings account to billions of people that don’t have the option or desire to run their own hedge fund.”\nSummary Whether Bitcoin is an insurance or not, remains to be decided by the reader. To me, it would be an imprecise and flawed description of Bitcoin. In terms of risk management, Bitcoin shares more characteristics with a hedge. However, reducing Bitcoin to a mere risk management tool does not do justice to the full potential that lies in Bitcoin. It presents a groundbreaking alternative to the prevailing fiat money system, with unique characteristics that distinguish it from traditional financial instruments or assets. It offers an escape from inflationary money, centralized control of monetary policy, speculation-induced corruption, financial exclusion, and political manipulation.\n","permalink":"http://www.gabriel-berardi.com/blog/bitcoin/2023-07-30-is-bitcoin-the-largest-insurance-in-the-world/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/bitcoin/2023-07-30-is-bitcoin-the-largest-insurance-in-the-world/images/img1.png\"\u003e\u003c/p\u003e\n\u003cp\u003eIn a recent \u003ca href=\"https://pomp.substack.com/p/is-bitcoin-the-largest-insurance#details\" target=\"_blank\" \u003earticle\u003c/a\u003e and also on a Twitter thread, American entrepreneur, investor, and influencer Anthony Pompliano has called Bitcoin ‘the largest insurance company in the world’:\u003c/p\u003e\n\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/bitcoin/2023-07-30-is-bitcoin-the-largest-insurance-in-the-world/images/tweet.png\"\u003e\u003c/p\u003e\n\u003cp\u003eHis argument in the article is quite straight-forward:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eBitcoin can protect its buyers from currency debasement, sovereign default or undisciplined monetary and fiscal policy. In order to receive this risk protection, someone has to pay the current Bitcoin price, which could be considered a one-time insurance premium. The earlier they buy this insurance, the cheaper it is.\u003c/li\u003e\n\u003cli\u003eInstead of relying on an insurance company to honor their policy during a crisis, Bitcoin offers a digitized solution that doesn’t require you to file a claim, eliminating the need for trust.\u003c/li\u003e\n\u003cli\u003ePompliano even goes further in his article. According to him, Bitcoin offers insurance for previously uninsurable risks, like high inflation or government seizure of assets.\u003c/li\u003e\n\u003cli\u003eIn summary, Bitcoin as a $500+ billion insurance product could be considered the largest insurance in the world.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eBitcoin as an insurance? Even the largest insurance in the world? It is an intriguing thought, which many in the Bitcoin community seem to share. Let’s take a closer look at this.\u003c/p\u003e","title":"Is Bitcoin the Largest Insurance in the World?"},{"content":"\nThe first Bitcoin was mined in 2009 – only 14 years ago. In these 14 years, Bitcoin has developed from a curious experiment ran by a small group of cypherpunks into a global phenomenon. Its journey from obscurity to mainstream recognition has been nothing short of remarkable, as has been reflected not only by its price, but by the growing computing power securing this global network. Bitcoin is on its way to disrupt traditional financial systems, challenging the centralized control of money and empowering individuals with financial sovereignty.\nWhile much of the discussion in the Bitcoin community has been focusing on the impact of Bitcoin on central monetary policy and the private banking sector, there has been a surprising lack of discourse about the intersection between Bitcoin and insurance.\nOn the other hand, the insurance industry has kept itself busy discussing the usage of blockchain, web3 and cryptocurrencies in general. Few in the insurance industry have realized that Bitcoin – not crypto, not blockchain, not web3 – will have a disruptive impact on their business.\nThis article aims to shed light on why the insurance industry should care about Bitcoin and why, likewise, Bitcoiners should also consider the potential of insurance.\nThe Bitcoin Revolution and the Insurance Industry The impact of Bitcoin on the traditional financial systems extends beyond just banking and monetary transactions. As the Bitcoin revolution gains momentum, it is important to explore its potential implications for all sectors, including the insurance industry.\nTo begin with, here are some high-level questions that leaders in the insurance sector should explore regarding the intersection of these two fields:\nHow could insurance companies utilize Bitcoin as a payment vehicle? What investment opportunities arise from the Bitcoin ecosystem? Which new products could insurance carriers develop as Bitcoin grows in importance? How can Bitcoin help to improve outdated processes within the insurance industry? By actively exploring and embracing Bitcoin’s potential, insurance companies can position themselves at the forefront of innovation, adapt to evolving customer needs, and drive positive transformation within the industry.\nLikewise, Bitcoiners should ask themselves:\nWhat would small and large scale risk diversification look like under a Bitcoin standard? What are the potential risk implications for Bitcoin mining companies? How can the insurance industry help Bitcoin based companies to be successful and grow adoption? How could insurance products tailored to the Bitcoin market benefit both the industry and individual Bitcoin users alike? Let’s now look more closely at the benefits that Bitcoin can bring to the insurance industry and how, vice versa, insurance can help the Bitcoin network.\nBenefits of Bitcoin for Insurance Companies Bitcoin offers several advantages that insurance companies can leverage. Among these advantages are:\nEfficient Payments: Bitcoin allows for faster and more cost-effective payments. Transactions on the main layer have an average historic cost between $0.50 – $2.50 and transactions on the second layer are virtually free of charge, with average transaction fees of 0.0029%. Insurance companies can make use of this in terms of the payment of insurance premiums, settlement of claims, but also for large volume cross-border transactions between corporate entities. Financial Inclusion: Around 20% of the world’s population do not have access to the global financial system and hence lack insurance coverage. These people have no way to store, invest or transfer their excess productivity, which massively hinders their personal and economic development. Bitcoin offers a game-changing solution. By allowing any person with a smartphone to store and transact value, Bitcoin has the potential to provide financial inclusion on a global scale. For insurance companies, this opens up new possibilities in thus far untapped markets, by providing risk coverage in these countries and regions. Innovative Products: Individuals and businesses involved in Bitcoin, including Bitcoin holders and miners, are exposed to various risks, including theft, cyberattacks, and hardware damage. These risks can lead to significant financial losses. Specialized insurance products tailored for Bitcoin can mitigate these risks, offering protection and providing reassurance to the participants of the Bitcoin network. Investment Opportunities: Finally, insurance companies can explore Bitcoin as an alternative investment asset, allowing them to diversify their portfolios and potentially achieving higher returns for their customers and shareholders. In the 12 years between 2011 and 2022, Bitcoin outperformed any other asset class in 9 out of 12 years. Cumulatively, the return of Bitcoin between 2011 and 2022 was around 1,500,000 %. The following table shows the annual return of the best-performing asset since 2011: Year Asset Class Return 2011 Bitcoin 1,473 % 2012 Bitcoin 186 % 2013 Bitcoin 5,507 % 2014 REIT 28 % 2015 Bitcoin 35 % 2016 Bitcoin 125 % 2017 Bitcoin 1,331 % 2018 Cash 1.8 % 2019 Bitcoin 95 % 2020 Bitcoin 301 % 2021 Bitcoin 90 % 2022 Cash 1.6 % Sources: Novel Investor and Good Financial Cents\nWith the growing importance of Bitcoin, insurance companies need to understand its implication on their core business and investment activities. By actively exploring its potential and managing its risk, insurance companies have the opportunity to benefit from this revolutionary digital, decentralized asset.\nThe Role of Insurance in a Bitcoin Standard Let’s now shift our focus from the insurance industry to the Bitcoin community. Bitcoiners, who naturally feel skeptical towards traditional financial intermediaries, should recognize the role insurance companies can play in a Bitcoin standard world. Unlike banks, insurance companies provide risk mitigation and diversification that enable resilient societies and innovative economies.\nKey considerations for Bitcoiners include:\nCoverage for Bitcoin custody: As the adoption of Bitcoin grows, individuals and businesses will have a need for insurance coverage to protect their asset holdings from theft, loss, or other risks. Bitcoiners should consider partnering with insurance companies to develop tailored coverage options specifically designed for Bitcoin self-custody and third-party custody. Bitcoin mining insurance: Bitcoin mining is the backbone of the Bitcoin blockchain and hence crucial for the security of the network. Bitcoin mining companies face risks such as the physical damage or loss of their expensive physical hardware and cybersecurity risks. Insurance companies can support Bitcoin mining operators to manage these risks through tailored insurance products. Business continuity and resilience: Insurance companies can support the Bitcoin community by offering coverage for businesses that rely on Bitcoin for their operations. This includes coverage for physical infrastructure, as well as coverage for business interruption or loss of funds due to unforeseen events. These insurance products can help ensure the resilience and continuity of Bitcoin-related businesses and organizations. Cybersecurity protection: With the increasing use of Bitcoin, the need for robust cybersecurity measures becomes vital. Insurance companies specializing in cybersecurity insurance can work closely with the Bitcoin community to develop comprehensive coverage options that protect against cyber threats, hacking incidents, and data breaches. Regulatory compliance and legal protection: Insurance companies possess expertise in navigating complex regulatory environments and can provide legal protection through insurance coverage for potential disputes or legal actions related to Bitcoin transactions. Hedging against price volatility: Bitcoin is a very young currency and naturally has experienced a lot of volatility in recent years. This can pose risks for individuals and businesses holding significant amounts of Bitcoin or using Bitcoin as a payment method in their business. Insurance companies can offer innovative products that enable Bitcoiners to hedge against price fluctuations, providing a level of stability and risk management in a Bitcoin standard world. By embracing the benefits of insurance, Bitcoiners can strengthen the resilience and stability of the Bitcoin ecosystem. These partnerships can address the unique risks and challenges faced by the Bitcoin community, providing them with peace of mind and fostering trust in the broader adoption of Bitcoin.\nSummary As Bitcoin continues to disrupt the financial services industry, it is crucial for the insurance industry and Bitcoiners to recognize the potential synergies and opportunities that lie at the intersection of insurance and Bitcoin. By embracing this convergence, insurance companies can leverage Bitcoin’s advantages, create innovative products, and streamline processes. Simultaneously, Bitcoiners can utilize insurance as a tool for risk diversification and protection. Together, the insurance industry and the Bitcoin community can shape the future of finance and redefine the boundaries of possibility.\n","permalink":"http://www.gabriel-berardi.com/blog/bitcoin/2023-06-14-bitcoin-and-insurance-why-is-no-one-talking-about-this/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/bitcoin/2023-06-14-bitcoin-and-insurance-why-is-no-one-talking-about-this/images/title.png\"\u003e\u003c/p\u003e\n\u003cp\u003eThe first Bitcoin was mined in 2009 – only 14 years ago. In these 14 years, Bitcoin has developed from a curious experiment ran by a small group of cypherpunks into a global phenomenon. Its journey from obscurity to mainstream recognition has been nothing short of remarkable, as has been reflected not only by its price, but by the growing computing power securing this global network. Bitcoin is on its way to disrupt traditional financial systems, challenging the centralized control of money and empowering individuals with financial sovereignty.\u003c/p\u003e","title":"Bitcoin and Insurance – Why Is No One Talking About This?!"},{"content":"\nRecently, I read a thread on Twitter about several Machine Learning papers that contained severe cases of data leakage. The authors of the papers seemed unaware of this phenomenon and therefore trained models that performed exceptionally well. Unfortunately, this was mainly due to data leakage.\nNot many beginners are aware of this problem and in my opinion, not many courses emphasize this issue early enough. Therefore, I would like to tell you all the things you need to know about data leakage and some ways to prevent it in this post.\nWhat is Data Leakage? In Machine Learning, we usually split our data into a training and a test set. We use the training set to train our model and review the quality of our model using the test set. This should be very clear to everyone. If not, check out this post.\nHowever, some people tend to forget what the test set actually represents: new, unseen data. The whole point of the test set is that it can be used as if it was unknown data on which we want to make predictions upon, using our trained model.\nTherefore, it is absolutely crucial to separate our training and our test set from each other and avoid leakage from one set to the other. We want to train our model solely based on the training set\u0026rsquo;s data.\nThis may sound a bit abstract at first. To clear things up, let\u0026rsquo;s talk about a few common data leakage mistakes you should avoid.\nCommon Types of Data Leakage Feature Leakage Imagine this: you try to predict the annual income of a person based on certain features like age, education background, industry, years of working experience and so on. You divide the data into input features and the target feature, namely the column of the data set that contains the annual income for each person, do your train/test split, train the model and get an accuracy of 99.99 %.\nUnfortunately, you overlooked that one of the input features you fed into the model is \u0026lsquo;annual_tax\u0026rsquo;, while another feature is \u0026rsquo;tax_quota\u0026rsquo;. With these two features alone, it is an easy job to estimate the annual income of any person, but obviously, this information wouldn\u0026rsquo;t be available in a production environment and your model is likely to perform very poorly without these two features!\nIn other words: you included input features that indirectly represent the target feature and wouldn\u0026rsquo;t be known ex-ante. This is called feature leakage or column-wise leakage!\nScaling Leakage Scaling your data means applying a certain function to all entries of the dataset, in order to eliminate differences in the order of magnitude, while still keeping the relative information of these data points. We want to keep the original distribution of the data, squeezed into a certain range like 0 to 1 or -1 to 1. Two very common methods of scaling data are min-max Normalization and Z-score Normalization.\nScaling leakage happens if we include the test set when determining the parameters to scale the data. In the case of Z-score Normalization, for example, we need to calculate both the mean and the standard deviation of the data. Then, we subtract the mean from every data point and divide by the standard deviation.\nIf we include the test set in the scaling process, we would assume prior knowledge of the test set\u0026rsquo;s distribution, which of course, is a mistake. Therefore, you always need to fit the scaler on the training set and only transform the test set using the scaler!\nTime Leakage Working with time-series data can be tricky for different reasons. Time leakage is just one of them. Imagine you work with data of a company\u0026rsquo;s stock price. For simplicity, we only look at the stock price of three days. The stock price at the end of these three days is as follows:\nD1: 110 EUR D2: 120 EUR D3: 130 EUR Now let\u0026rsquo;s say D1 and D3 end up in the training set, while we test on D2. This would represent data leakage because we trained our model on future information relative to the test data. We would already know that the stock price at D3 is higher than on D1, therefore it is more likely that the stock price on D2 was higher than on D1.\nSo, when dealing with time-series data, we should always use the more recent data for testing, instead of splitting the data randomly, in order to acknowledge the dependencies in the data.\nMissing Data Leakage When dealing with missing data, we can either remove the affected entries, delete the whole feature, or find a meaningful way to augment the data. For example, we might replace the missing entries with the mode or mean of the other entries with respect to that particular feature. But if we do that before splitting the data, we yet again introduce leakage into our training set, because we transfer knowledge of the test set into the training set.\nTherefore, we should first do the train/test split, and then use the mode or mean of the training set to fill in missing data in the training set and then use the mode or mean of the test set to fill in the missing data in the test set.\nDuplicate Leakage A final example of data leakage can be found in identical or nearly identical entries in a dataset. If such entries exist, and you randomly split them into training and test data, chances are that some identical entries can be found both in the training and the test set, which represents data leakage!\nAn example of such duplicate entries could be web-scraped customer reviews: Some customers might accidentally publish the same review more than once. An example of nearly duplicate entries would be the over- or bootstrap-sampling of a dataset before splitting the data into a training and test set.\nTo avoid data leakage from (nearly) duplicate entries, you simply need to use a method to detect and delete these entries. One such method for text data is called fuzzy matching.\nSummary Data leakage can be a difficult thing that even qualified professionals struggle with at times. The above list is by no means complete and unfortunately, there are many more data leakage pitfalls out there. However, it is crucial to be aware of it and follow best-practices to avoid data leakage, such as always performing feature engineering after doing the train/test split. In general, you should avoid doing anything to your training set that involves having knowledge of the test set. As a rule of thumb: split early and leave the test set untouched until you\u0026rsquo;re done training your model.\nIf you ever feel unsure if a certain action might introduce data leakage to your model, make sure to discuss the matter with colleagues or ask other practitioners or researchers online on websites like stats.stackexchange.com, reddit.com/r/MachineLearning or kaggle.com/discussion.\nSources and Further Material https://en.wikipedia.org/wiki/Leakage_(machine_learning) https://towardsdatascience.com/data-leakage-in-machine-learning-10bdd3eec742 https://www.youtube.com/watch?v=n9jz7G68pVg ","permalink":"http://www.gabriel-berardi.com/blog/data/2020-12-01-data-leakage-in-machine-learning/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/data/2020-12-01-data-leakage-in-machine-learning/images/img1.jpg\"\u003e\u003c/p\u003e\n\u003cp\u003eRecently, I read a thread on Twitter about several Machine Learning papers that contained severe cases of data leakage. The authors of the papers seemed unaware of this phenomenon and\ntherefore trained models that performed exceptionally well. Unfortunately, this was mainly due to data leakage.\u003c/p\u003e\n\u003cp\u003eNot many beginners are aware of this problem and in my opinion, not many courses emphasize this issue early enough. Therefore, I would like to tell you all the things you need to know\nabout data leakage and some ways to prevent it in this post.\u003c/p\u003e","title":"Data Leakage in Machine Learning"},{"content":"\nCounterfeit money is a real problem both for individuals and for businesses. Counterfeiters constantly find new ways and techniques to produce fake banknotes, that are essentially indistinguishable from real money. At least for the human eye!\nIdentifying forged banknotes is a typical example of a binary classification task in Machine Learning. If we have enough data of both real and forged banknotes, we can use this data to train a model that can classify new banknotes as either real or fake.\nTherefore, in this post, we are going to explore how we can use a simple Logistic Regression to determine whether a banknote is real or forged!\nData Exploration I found a dataset on the UCI Machine Learning Repository that contains data of 1.372 real and forged banknotes. According to the UCI, the data was extracted from images of genuine and forged banknotes. The authors used Wavelet Transform to extract the first three features from these images. This is a quite complicated process, but broadly speaking, it means that they extracted information about the distribution of certain aspects of these images. If you are interested in the details, here\u0026rsquo;s a link to a paper on this subject from the author of the dataset we are using. The fourth feature, the entropy, was obtained from the original images.\nLet\u0026rsquo;s spend one more minute on entropy. In general, entropy is a statistical measure of randomness. The entropy of an image can be understood as the amount of information within an image. The authors state that they have used 400x400 pixel images of the banknotes. In the paper co-written by the author of the original dataset, I found three images of a genuine banknote, a high-quality forgery and a low-quality forgery. I then used the CakeImageAnalyzer tool by Jean Vitor to check the entropy of these three images.\nThe first image is the genuine banknote with an entropy of 4.737. The second image is a high-quality forgery with an entropy of 4.373, while the last image is a low-quality forgery with an entropy of 4.189.\nAs we can see, there seems to be a connection between the entropy and the authenticity of a banknote. However, the entropy alone is not enough to reliably detect forged banknotes!\nNow back to our task at hand. Our dataset contains these four input features:\nVariance of Wavelet Transformed image Skewness of Wavelet Transformed image Curtosis of Wavelet Transformed image Entropy of image The target feature is simply 0 for real banknotes and 1 for forged banknotes.\nFinally, let\u0026rsquo;s start coding. First, we are going to need some modules.\nimport pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix Next, we are going to read in the dataset, assign headers and check the first five rows to check if it worked.\ndata = pd.read_csv(\u0026#39;data_banknote_authentication.txt\u0026#39;, header = None) data.columns = [\u0026#39;var\u0026#39;, \u0026#39;skew\u0026#39;, \u0026#39;curt\u0026#39;, \u0026#39;entr\u0026#39;, \u0026#39;auth\u0026#39;] print(data.head() Alright. Now let\u0026rsquo;s start exploring the dataset. First, we should check the data types and if there are any missing values.\nprint(data.info)) Perfect, no missing values and the datatypes of all the features are fine.\nNow we can plot a pairplot to get an overview of the relationship between all features. I will also color the observations: blue for genuine banknotes and orange for forged banknotes.\nsns.pairplot(data, hue = \u0026#39;auth\u0026#39;) plt.show() From this pairplot we can make several interesting observations:\nThe distribution for both the variance and the skewness seem to be quite different for the two target features, while the curtosis and entropy seem more alike. There are clear linear and non-linear trends in the input features. Some features seem to be correlated. Certain features seem to separate the genuine and forged banknotes quite well. It is hard to see any correlation between the input features and the target feature, therefore I will plot a correlation heatmap.\nplt.figure(figsize=(7,6)) plt.title(\u0026#39;Correlation Heatmap of All Features\u0026#39;, size=18) ax = sns.heatmap(data.corr(), cmap=\u0026#39;coolwarm\u0026#39;, vmin=-1, vmax=1, center=0, mask=mask, annot=True) plt.show() Interesting. We can see a rather high correlation of -0.72 between the variance and the target feature and some correlation of -0.44 between the skewness and the target feature.\nIt\u0026rsquo;s important to keep in mind that the pd.DataFrame.corr() method uses Pearson correlation, which only measures the linear relationship between to variables. There might be other relationships in the data that cannot be observed so easily.\nLast but not least, we should check if our data is balanced with regard to the target feature.\nplt.figure(figsize=(8,6)) plt.title(\u0026#39;Distribution of Target\u0026#39;, size=18) sns.countplot(x=data[\u0026#39;auth\u0026#39;]) target_count = data.auth.value_counts() plt.annotate(s=target_count[0], xy=(-0.04,10+target_count[0]), size=14) plt.annotate(s=target_count[1], xy=(0.96,10+target_count[1]), size=14) plt.ylim(0,900) plt.show() The dataset is fairly balanced, but I think for this project, we should balance it perfectly. So let\u0026rsquo;s start the data preprocessing by doing exactly that.\nData Preprocessing First, we are going to balance the dataset. The easiest way to do this is to randomly delete a number of instances of the overrepresented target feature. This is called random undersampling. In the opposite case, we could also create new synthethic data for the underrepresented target class. That would be called oversampling. You can read more on this topic here. For now, let\u0026rsquo;s randomly delete 152 observations of real banknotes.\nThere are various ways to achieve this. I decided to first shuffle the dataset, sort it to keep all genuine banknotes at the top, and then simply slice off the first 152 rows.\nnb_to_delete = target_count[0]-target_count[1] data = data.sample(frac=1,random_state=42).sort_values(by=\u0026#39;auth\u0026#39;) data = data[nb_to_delete:]print(data[\u0026#39;auth\u0026#39;].value_counts()) Perfect. Now, we have an evenly balanced dataset.\nNext, we need to split our data into a training set and a test set. I decided to use 70% of the data for training and 30 % for testing.\nX = data.loc[:, data.columns != \u0026#39;auth\u0026#39;] y = data.loc[:, data.columns == \u0026#39;auth\u0026#39;] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) Finally, we should scale our input features. I am using standardization here, which means that for each datapoint in a given feature, we subtract the mean and divide it by the standard deviation. It is very important to only fit the scaler on the training set, not on the test set. Then, we use the obtained parameters on the test set. This is to prevent data leakage from the test set into the training set. It basically means that we need to treat the test data set as new, unseen data and prevent any information about the test set to be used for training our model. You can read more about this here.\nscaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) Train and Test a Model Now we only need to train our model. As mentioned before, we are going to use a Logistic Regression here. With Logistic Regression, we can not only classify new data, but we can also extract the probability that a new observation belongs to either class.\nI am going to use the LogisticRegression class from sklearn and leave all parameters to default.\nclf = LogisticRegression(solver=\u0026#39;lbfgs\u0026#39;, random_state=42, multi_class=\u0026#39;auto\u0026#39;) clf.fit(X_train, y_train.values.ravel()) And ultimately, we can use the test set to make some predictions and compare them with the actual target class. To see how good the model performs, we can print out a confusion matrix and calculate the accuracy.\ny_pred = np.array(clf.predict(X_test)) conf_mat = pd.DataFrame(confusion_matrix(y_test, y_pred), columns=[\u0026#39;Pred. Negative\u0026#39;, \u0026#39;Pred. Positive\u0026#39;], index=[\u0026#39;Act. Negative\u0026#39;, \u0026#39;Act. Positive\u0026#39;]) tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel() accuracy = round((tn+tp)/(tn+fp+fn+tp),4) print(conf_mat) print(f\u0026#39;\\nAccuracy = {round(100*accuracy,2)}%\u0026#39;) Neat! Our simple Logistic Regression model reached an accuracy of 98.36 %. And not only that: when the model predicted that a banknote was real (Pred. Negative), it was correct in 100 % of all cases.\nOne last thing we can do is to simulate the prediction of a single new banknote. All we need to do is to extract the features, scale them and feed them into our pretrained model. We can also inspect the probabilities that the banknote belongs to each target class.\nnew_banknote = np.array([4.5, -8.1, 2.4, 1.4], ndmin=2) new_banknote = scaler.transform(new_banknote) print(f\u0026#39;Prediction: Class {clf.predict(new_banknote)[0]}\u0026#39;) print(f\u0026#39;Probability [0/1] : {clf.predict_proba(new_banknote)[0]}\u0026#39;) Our model predicts that this banknote is real, but it estimates a probability of only 61%. In other words, the model is not very sure that this banknote is indeed genuine. The default threshold of the sklearn Logistic Regression to determine between class 0 (real) and class 1 (forged) is 50 %, but we could easily decrease this threshold to 30 or 40 %, in order to minimize the risk of wrongly accept a forged banknote as a real one. A metric often used to determine the best threshold is the ROC or AUC respectively.\nAnd that\u0026rsquo;s it for this post. If you want, you can take a look at the full code and try to increase the accuracy of the model or try other methods like a Random Forest or an Artificial Neural Net. Thanks for reading!\nFull Code on Github Link: https://gist.github.com/gabriel-berardi/ce716edb20c032714213ed6556abf27c\n# Importing required libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix # Loading the dataset from https://archive.ics.uci.edu/ml/datasets/banknote+authentication data = pd.read_csv(\u0026#39;data_banknote_authentication.txt\u0026#39;, header=None) data.columns = [\u0026#39;var\u0026#39;, \u0026#39;skew\u0026#39;, \u0026#39;curt\u0026#39;, \u0026#39;entr\u0026#39;, \u0026#39;auth\u0026#39;] print(data.head()) # Show information about all features print(data.info()) # Use pairplot to get an overview of the features sns.pairplot(data, hue=\u0026#39;auth\u0026#39;) plt.show() # Display a correlation heatmap of all features mask = np.zeros(data.corr().shape, dtype=bool) mask[np.triu_indices(len(mask))] = True plt.figure(figsize=(7,6)) plt.title(\u0026#39;Correlation Heatmap of All Features\u0026#39;, size=18) ax = sns.heatmap(data.corr(), cmap=\u0026#39;coolwarm\u0026#39;, vmin=-1, vmax=1, center=0, mask=mask, annot=True) plt.show() # Show the distribution of the target plt.figure(figsize=(8,6)) plt.title(\u0026#39;Distribution of Target\u0026#39;, size=18) sns.countplot(x=data[\u0026#39;auth\u0026#39;]) target_count = data.auth.value_counts() plt.annotate(s=target_count[0], xy=(-0.04,10+target_count[0]), size=14) plt.annotate(s=target_count[1], xy=(0.96,10+target_count[1]), size=14) plt.ylim(0,900) plt.show() # Balance the dataset with regard to the target feature nb_to_delete = target_count[0]-target_count[1] data = data.sample(frac=1, random_state=42).sort_values(by=\u0026#39;auth\u0026#39;) data = data[nb_to_delete:] print(data[\u0026#39;auth\u0026#39;].value_counts()) # Split our data into a training and test data set X = data.loc[:, data.columns != \u0026#39;auth\u0026#39;] y = data.loc[:, data.columns == \u0026#39;auth\u0026#39;] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Scale the features. Note: only fit the scaler on training data to prevent data leakage scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) # Train a Logistic Regression model clf = LogisticRegression(solver=\u0026#39;lbfgs\u0026#39;, random_state=42, multi_class=\u0026#39;auto\u0026#39;) clf.fit(X_train, y_train.values.ravel()) # Make predictions on the test data y_pred = np.array(clf.predict(X_test)) # Print a confusion matrix and calculate accuracy conf_mat = pd.DataFrame(confusion_matrix(y_test, y_pred), columns=[\u0026#39;Pred. Negative\u0026#39;, \u0026#39;Pred. Positive\u0026#39;], index=[\u0026#39;Act. Negative\u0026#39;, \u0026#39;Act. Positive\u0026#39;]) tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel() accuracy = round((tn+tp)/(tn+fp+fn+tp),4) print(conf_mat) print(f\u0026#39;\\nAccuracy = {round(100*accuracy,2)}%\u0026#39;) # Simulate the prediction of a single new banknote new_banknote = np.array([4.5, -8.1, 2.4, 1.4], ndmin=2) new_banknote = scaler.transform(new_banknote) print(f\u0026#39;Prediction: Class {clf.predict(new_banknote)[0]}\u0026#39;) print(f\u0026#39;Probability [0/1] : {clf.predict_proba(new_banknote)[0]}\u0026#39;) Sources and Further Materials https://archive.ics.uci.edu/ml/datasets/banknote+authentication https://jeanvitor.com/image-entropy-value-visualization/ ","permalink":"http://www.gabriel-berardi.com/blog/data/2020-09-01-detect-forged-banknotes-with-a-logistic-regression/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/data/2020-09-01-detect-forged-banknotes-with-a-logistic-regression/images/title.png\"\u003e\u003c/p\u003e\n\u003cp\u003eCounterfeit money is a real problem both for individuals and for businesses. Counterfeiters constantly find new ways and techniques to produce fake banknotes, that are essentially\nindistinguishable from real money. At least for the human eye!\u003c/p\u003e\n\u003cp\u003eIdentifying forged banknotes is a typical example of a binary classification task in Machine Learning. If we have enough data of both real and forged banknotes, we can use this data to\ntrain a model that can classify new banknotes as either real or fake.\u003c/p\u003e","title":"Detect Forged Banknotes with a Logistic Regression"},{"content":"\nLinear and Logistic regression are among the most elementary algorithms for supervised learning. Supervised Learning describes the situation where we deal with labelled data, which means that we have labelled inputs and a target variable.\nDespite the fact that both have the word \u0026ldquo;regression\u0026rdquo; in their name, only one of them is typically being used for solving regression problems!\nLet\u0026rsquo;s see how they work!\nLinear Regression Linear regression is possibly the easiest, most intuitive way of making a quantitative prediction. The relationship between an independent and a dependent variable is assumed to be linear, meaning that the dependent variable can be predicted using a linear function of the independent variable. For example:\nWhere ŷ is the predicted value of the dependent variable, x is the independent variable, β0 is the y-intercept (aka bias-term) and β1 is a constant that determines the slope of the function.\nIn the above case, we only have one independent (aka explanatory) variable x. This case is called a \u0026ldquo;Simple Linear Regression\u0026rdquo;.\nIf we have more than one independent explanatory variable, we talk about \u0026ldquo;Multiple Linear Regression\u0026rdquo;, while as the case in which we try to predict more than one dependent variable is known as \u0026ldquo;Multivariate Regression\u0026rdquo;.\nFor now, let\u0026rsquo;s focus on the Simple Linear Regression and make things a bit clearer with an example.\nTake a look at this scatter plot:\nHere we see data for 30 different products of filter coffee. On the x-axis we have the percentage of Caffeine and on the y-axis the price per unit in Euro.\nAs human beings, it is quite simple for us to quickly spot an existing trend in a two-dimensional scatter plot. It seems that the higher the caffeine content is in a coffee product, the lower the price. This has something to do with the two common types of coffee: Arabica and Robusta. We\u0026rsquo;ll get back to that later!\nWe can visualize this trend by adding a matching straight line to the scatter plot:\nThis yellow line seems about right. If we now know the percentage of caffeine for a new product, we can estimate its price by looking at the matching value on the yellow line or in the corresponding linear function.\nBut wait\u0026hellip; How do we know this is really the best fitting line? Isn\u0026rsquo;t there an infinite amount of lines we could plot that all more or less represent our data? How do we know which of the following lines is the best:\nWhat we need is a certain criterion to measure the quality of these lines and their corresponding functions. Such a criterion is often called a \u0026ldquo;loss function\u0026rdquo; (aka cost function).\nThe most common criterion for regression problems is the \u0026ldquo;Root Mean Square Error (RMSE)\u0026rdquo;, which represents the average mistake a model makes in its predictions, while big mistakes have a higher weight than small mistakes:\nNaturally, we are looking for a linear regression function that minimizes the RMSE.\nIt is easier to do this minimization for the \u0026ldquo;Mean Squared Error MSE\u0026rdquo; instead, and it yields the same result!\nTo find the best parameters for our regression function, we can use \u0026ldquo;Ordinary Least Squares\u0026rdquo;, or alternatively an optimization algorithm called \u0026ldquo;Gradient Descent\u0026rdquo;. You can read more about Gradient Descent here.\nI will not go into more details in this post. What\u0026rsquo;s important is that we can find a linear regression function that minimizes the RMSE/MSE and therefore represents the best fitting line to our data.\nIn most simple terms, the goal of a linear regression is to find a function that is \u0026ldquo;closest\u0026rdquo; to as many data points as possible. In the 2-dimensional space, this function represents a line.\nThe GIF below visualizes this:\nLet\u0026rsquo;s now go back to our coffee example above. I will use the statsmodels package in Python to find the regression function for our dataset.\nimport statsmodels.api as sm df.Price = sm.add_constant(df.Price, prepend=False) mod = sm.OLS(df.Caffeine, df.Price) res = mod.fit() print(res.summary()) From the statsmodels summary we know that the y-intercept of the linear regression function is 6.5855 and the coefficient for our explanatory variable is -65.2992. Therefore, we can define our regression function for our example as:\nWith this function, we can easily predict the price for a new product of coffee, of which we know the content of caffeine. For example, if the coffee has 1.5 % of caffeine, we predict the price to be 6.5855 - 65.2992 x 0.015 = 5.60 €.\nLogistic Regression Logistic Regression is a very powerful way to solve classification problems by assigning a certain probability of a class existing or event occurring.\nIn its basic form it uses a logistic function (a type of sigmoid function) to model a binary dependent variable, such as success/failure, yes/no, dog/cat etc.\nA logistic function can be defined as:\nThis function has an important characteristic: No matter what value you plug in for x, f(x) is always in the range from 0 to 1, which already gives us a hint that this function can be used to estimate probabilities.\nLet\u0026rsquo;s plot this:\nIn the simplest case of having only one independent variable, the predicted probability of a logistic regression can be expressed as:\nSo all we need to do is to find the parameters β0/β1 that modify a logistic function which can model our data. As in the case of Linear Regression, we need a loss function to find these parameters. Such a loss function should be able to penalize the assignment of wrong probabilities by the model.\nLet\u0026rsquo;s say for one specific training data point where y = 1, two Logistic Regression models a and b predict 0.25 and 0.75 respectively. Both models predict a probability ≠ 1 and therefore we need to assign a loss to both of them. However, the loss for model a should be much larger than the loss for model b, because assuming a threshold of 0.5, model b would still have predicted the outcome to be 1 instead of 0.\nLet\u0026rsquo;s use the negative logarithm to assign a loss. Remember that we assume the true value y to be 1.\nModel a: -log(0.25) = - 0.60 Model b: -log(0.75) = - 0.12 Let\u0026rsquo;s plot a graph for all predicted probabilities:\nAs we clearly see, the loss for p → 0 is extremely high, while higher predicted probabilities cause a much lower loss. If we had the situation of true y = 0, we would use -log(1-p) instead.\nLong story short: we chose the \u0026ldquo;Log Loss\u0026rdquo; as our loss function to find the optimal parameters for our Logistic Regression function. You can read more about Log Loss here. The optimal parameters can be found using Gradient Descent or another optimization algorithm.\nLet\u0026rsquo;s finally get back to our good old coffee example. As mentioned above, there are two main categories of coffee: Arabica and Robusta. Arabica generally contains less caffeine, yet is more expensive than Robusta!\nLet\u0026rsquo;s use Logistic Regression to train a model to determine the kind of coffee.\nFor our data we have the following features:\nI will use sklearn to find the Logistic Regression function. I will only use the Price feature, in order to be able to plot the function.\nfrom sklearn.linear_model import LogisticRegression X = df.loc[:, df.columns == \u0026#39;Price\u0026#39;] y = list(df.loc[:, df.columns == \u0026#39;Type\u0026#39;].Type) logreg = LogisticRegression(random_state=42, solver=\u0026#34;lbfgs\u0026#34;) logreg.fit(X, y) print(f\u0026#34;Beta 0 is {round(logreg.intercept_[0], 4)}\u0026#34;) print(f\u0026#34;Beta 1 is {round(logreg.coef_[0][0], 4)}\u0026#34;) Note that sklearn uses a slightly different form of the logistic function mentioned above, meaning that we need to take the negative of these parameters to plug into our previous logistic function.\nTherefore, the logistic function for our data is:\nAnd if we plot this:\nWe want to determine the type for a new product of coffee, we simply plug the price, for example 4.75 €, into our function: 1 / (1+e^(10.5731-2.4588*4.75)) = 75%. Therefore, the model would predict that this coffee is an Arabica.\nSummary That\u0026rsquo;s it for this post. We have seen the basics of linear and logistic regression and how they can be used.\nThe table below summarizes the key differences between linear and logistic regression:\nLinear Regression Logistic Regression Models the relationship between one or more dependent variables with one or more independent variables Predicts the probability of a binary outcome, using one or more dependent variables. For regression problems For classification problems Uses a linear function Uses a sigmoid function Minimizes RMSE/MSE Minimizes Log Loss Sources and Further Material Geron, Aurélien - Hands-On Machine Learning (2017) Ng, Annalyn \u0026amp; Soo, Kenneth - Numsense! Data Science for the Layman (2017) https://en.wikipedia.org/wiki/Linear_regression https://en.wikipedia.org/wiki/Logistic_regression ","permalink":"http://www.gabriel-berardi.com/blog/data/2020-07-01-linear-and-logistic-regression/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/data/2020-07-01-linear-and-logistic-regression/images/header.png\"\u003e\u003c/p\u003e\n\u003cp\u003eLinear and Logistic regression are among the most elementary algorithms for supervised learning. Supervised Learning describes the situation where we deal with labelled data, which means that we have labelled inputs and a target variable.\u003c/p\u003e\n\u003cp\u003eDespite the fact that both have the word \u0026ldquo;regression\u0026rdquo; in their name, only one of them is typically being used for solving regression problems!\u003c/p\u003e\n\u003cp\u003eLet\u0026rsquo;s see how they work!\u003c/p\u003e\n\u003ch2 id=\"linear-regression\"\u003eLinear Regression\u003c/h2\u003e\n\u003cp\u003eLinear regression is possibly the easiest, most intuitive way of making a quantitative prediction. The relationship between an independent and a dependent variable is assumed to be linear, meaning that the dependent variable can be predicted using a linear function of the independent variable. For example:\u003c/p\u003e","title":"Linear and Logistic Regression"},{"content":"After reading this article from Pratap Vardhan with great interest, I wanted to build my own version of a Bar Chart Race that is smoother and a bit more beautiful. The biggest improvement is the interpolation (or augmentation) of the available data points in order to make the animation smoother.\nHere is the Bar Chart Race we are going to build in this article:\nFor the purpose of this demonstration, we are going to use a GDP per capita forecast dataset provided by the OECD. You can find the original dataset here.\nYou can find the full code via GitHub at the end of the article!\nImporting Modules We start by importing the modules. We are going to use pandas for data handling, matplotlib for the graphs, Numpy for matrix operations. The usage of colorsys and re will be explained later in the code.\nimport pandas as pd import numpy as np import matplotlib import matplotlib.pyplot as plt import matplotlib.ticker as ticker import matplotlib.animation as animation import matplotlib.colors as mc import colorsys import re from random import randint Data Preparation We read in the CSV file and get rid of all variables except GDP per capita. We then delete data entries that resemble regions instead of single countries and delete all columns except Country, Time and Value.\ndf = pd.read_csv(\u0026#34;EO95_LTB_07012019070138125.csv\u0026#34;) df = df.loc[df[\u0026#39;Variable\u0026#39;] == \u0026#39;GDP per capita in USA 2005 PPPs\u0026#39;] df = df[((df.Country != \u0026#39;OECD - Total\u0026#39;) \u0026amp; (df.Country != \u0026#39;Non-OECDEconomies\u0026#39;) \u0026amp; (df.Country != \u0026#39;World\u0026#39;) \u0026amp; (df.Country !=\u0026#39;Euro area (15 countries)\u0026#39;))] df = df[[\u0026#39;Country\u0026#39;, \u0026#39;Time\u0026#39;, \u0026#39;Value\u0026#39;]] Now, we want to interpolate the data to make the animation smooth. In order to do that, we first pivot the data frame to wide format. I wrote a simple for-loop which adds the average value of two columns in between two existing values and assigns a numeric name, starting with the time value of the interpolated column. I add a ‘^’ to the name, so when I am displaying the time unit, I can easily get rid of everything behind this particular character.\nWhen the last column of the data frame is reached, the code will run into an error because the df.iloc statement won’t be able to select a column (since it will be non-existent). Therefore, I use a try/except statement and indicate whenever an interpolation is done. The steps of interpolation can be adjusted by changing the n in range(n) in the for-loop.\nNext, we pivot the table back to long format using the melt method.\nfor p in range(3): i = 0 while i \u0026lt; len(df.columns): try: a = np.array(df.iloc[:, i + 1]) b = np.array(df.iloc[:, i + 2]) c = (a + b) / 2 df.insert(i+2, str(df.iloc[:, i + 1].name) + \u0026#39;^\u0026#39; +str(len(df.columns)), c) except: print(f\u0026#34;\\n Interpolation No. {p + 1} done...\u0026#34;) i += 2 df = pd.melt(df, id_vars = \u0026#39;Country\u0026#39;, var_name = \u0026#39;Time\u0026#39;) Defining the Frames List When creating animations with matplotlib, we always need a list of all frames, that will be passed to the core function that draws each frame. Here we create this frames list by taking all unique values of the time_unit series and converting it to a list. I add the last value of the time_unit series (the last point in time) to the frames list five times in order to stop the animation for a few seconds at the end, before replay.\nframes_list = df[\u0026#34;Time\u0026#34;].unique().tolist() for i in range(10): frames_list.append(df[\u0026#39;Time\u0026#39;].iloc[-1]) Defining the Color Schema The next block of code assigns the colors of the bar chart. First, I use a function that can transform any color to a lighter/darker shade. I have found the function in this Stackoverflow post. It requires the colorsys module we imported at the beginning.\nNext, we put all names in a list and create as many random HEX colors as there are names. The reason why we use random colors is to add more flexibility into the code. We want it to be reusable, even if the number of elements changes. Lastly, we create three lists of these colors: normal colors, slightly transparent colors and darker colors.\ndef transform_color(color, amount = 0.5): try: c = mc.cnames[color] except: c = color c = colorsys.rgb_to_hls(*mc.to_rgb(c)) return colorsys.hls_to_rgb(c[0], 1 - amount * (1 - c[1]), c[2]) all_names = df[\u0026#39;Country\u0026#39;].unique().tolist() random_hex_colors = [] for i in range(len(all_names)): random_hex_colors.append(\u0026#39;#\u0026#39; + \u0026#39;%06X\u0026#39; % randint(0, 0xFFFFFF)) rgb_colors = [transform_color(i, 1) for i in random_hex_colors] rgb_colors_opacity = [rgb_colors[x] + (0.825,) for x in range(len(rgb_colors))] rgb_colors_dark = [transform_color(i, 1.12) for i in random_hex_colors] Now we have arrived at the core function of this code!\nWe define a new data frame called df_frame which contains the top elements in this point in time.\nWe then draw a bar chart in this particular time frame with the top elements using the correct color from the normal_colors dictionary. In order to make the chart prettier, we draw a darker shade around each bar using the respective color from the dark_color dictionary.\nThe rest of the function is simply formatting the graph. We write the name and the value next to each bar. Then, we display the time unit of each frame at the top right position. Here we make use of the ‘^’ character we assigned earlier when we did the interpolation. Using a regular expression we can get rid of all characters after the ‘^’ and then display the respective time unit. Here we need the re module we imported at the beginning.\nNext, we add the chart title and axis label, and we format the numbers on the x-axis and display them at the top of the chart. We get rid of the y-axis ticks and add grid lines to the chart.\nLastly, we limit the number of ticks to 4, get rid of the black frame around the chart and adjust the margin on each side.\ndef draw_barchart(Time): df_frame = df[df[\u0026#39;Time\u0026#39;].eq(Time)].sort_values(by = \u0026#39;value\u0026#39;, ascending = True).tail(num_of_elements) ax.clear() normal_colors = dict(zip(df[\u0026#39;Country\u0026#39;].unique(), rgb_colors_opacity)) dark_colors = dict(zip(df[\u0026#39;Country\u0026#39;].unique(), rgb_colors_dark)) ax.barh(df_frame[\u0026#39;Country\u0026#39;], df_frame[\u0026#39;value\u0026#39;], color = [normal_colors[x] for x in df_frame[\u0026#39;Country\u0026#39;]], height = 0.8, edgecolor = ([dark_colors[x] for x in df_frame[\u0026#39;Country\u0026#39;]]), linewidth = \u0026#39;6\u0026#39;) dx = float(df_frame[\u0026#39;value\u0026#39;].max()) / 200 for i, (value, name) in enumerate(zip(df_frame[\u0026#39;value\u0026#39;], df_frame[\u0026#39;Country\u0026#39;])): ax.text(value + dx, i + (num_of_elements / 50), \u0026#39; \u0026#39; + name, size = 36, weight = \u0026#39;bold\u0026#39;, ha = \u0026#39;left\u0026#39;, va = \u0026#39;center\u0026#39;, fontdict = {\u0026#39;fontname\u0026#39;: \u0026#39;Trebuchet MS\u0026#39;}) ax.text(value + dx, i - (num_of_elements / 50), f\u0026#39;{value:,.0f}\u0026#39;, size = 36, ha = \u0026#39;left\u0026#39;, va = \u0026#39;center\u0026#39;) time_unit_displayed = re.sub(r\u0026#39;\\^(.*)\u0026#39;, r\u0026#39;\u0026#39;, str(Time)) ax.text(1.0, 1.14, time_unit_displayed, transform = ax.transAxes, color = \u0026#39;#666666\u0026#39;, size = 62, ha = \u0026#39;right\u0026#39;, weight = \u0026#39;bold\u0026#39;, fontdict = {\u0026#39;fontname\u0026#39;: \u0026#39;Trebuchet MS\u0026#39;}) ax.text(-0.005, 1.06, \u0026#39;GDP/capita\u0026#39;, transform = ax.transAxes, size = 30, color = \u0026#39;#666666\u0026#39;) ax.text(-0.005, 1.14, \u0026#39;Projection of GDP/capita from 2010 to 2060\u0026#39;, transform = ax.transAxes, size = 62, weight = \u0026#39;bold\u0026#39;, ha = \u0026#39;left\u0026#39;, fontdict = {\u0026#39;fontname\u0026#39;: \u0026#39;Trebuchet MS\u0026#39;}) x.xaxis.set_major_formatter(ticker.StrMethodFormatter( \u0026#39;{x:,.0f}\u0026#39;)) ax.xaxis.set_ticks_position(\u0026#39;top\u0026#39;) ax.tick_params(axis = \u0026#39;x\u0026#39;, colors = \u0026#39;#666666\u0026#39;, labelsize = 28) ax.set_yticks([]) ax.set_axisbelow(True) ax.margins(0, 0.01) ax.grid(which = \u0026#39;major\u0026#39;, axis = \u0026#39;x\u0026#39;, linestyle = \u0026#39;-\u0026#39;) plt.locator_params(axis = \u0026#39;x\u0026#39;, nbins = 4) plt.box(False) plt.subplots_adjust(left = 0.075, right = 0.75, top = 0.825, bottom = 0.05, wspace = 0.2, hspace = 0.2) Animation The last step of every Matplotlib animation is to call the FuncAnimation method.\nanimator = animation.FuncAnimation(fig, draw_barchart, frames = frames_list) animator.save(\u0026#34;Racing Bar Chart.mp4\u0026#34;, fps = 20, bitrate = 1800) And that’s it. Feel free to play around with the code and adjust it to your needs. If you have any suggestion or question feel free to leave a comment below!\nFull Code on GitHub Link: https://gist.github.com/gabriel-berardi/2598032da5ea0453cf3385c8ce73bafc\nimport pandas as pd import numpy as np import matplotlib import matplotlib.pyplot as plt import matplotlib.ticker as ticker import matplotlib.animation as animation import matplotlib.colors as mc import colorsys from random import randint import re df = pd.read_csv(\u0026#34;EO95_LTB_07012019070138125.csv\u0026#34;) df = df.loc[df[\u0026#39;Variable\u0026#39;] == \u0026#39;GDP per capita in USA 2005 PPPs\u0026#39;] df = df[((df.Country != \u0026#39;OECD - Total\u0026#39;) \u0026amp; ( df.Country != \u0026#39;Non-OECD Economies\u0026#39;) \u0026amp; (df.Country != \u0026#39;World\u0026#39;) \u0026amp; (df.Country != \u0026#39;Euro area (15 countries)\u0026#39;))] df = df[[\u0026#39;Country\u0026#39;, \u0026#39;Time\u0026#39;, \u0026#39;Value\u0026#39;]] df = df.pivot(index = \u0026#39;Country\u0026#39;, columns = \u0026#39;Time\u0026#39;, values = \u0026#39;Value\u0026#39;) df = df.reset_index() for p in range(3): i = 0 while i \u0026lt; len(df.columns): try: a = np.array(df.iloc[:, i + 1]) b = np.array(df.iloc[:, i + 2]) c = (a + b) / 2 df.insert(i+2, str(df.iloc[:, i + 1].name) + \u0026#39;^\u0026#39; + str(len(df.columns)), c) except: print(f\u0026#34;\\n Interpolation No. {p + 1} done...\u0026#34;) i += 2 df = pd.melt(df, id_vars = \u0026#39;Country\u0026#39;, var_name = \u0026#39;Time\u0026#39;) frames_list = df[\u0026#34;Time\u0026#34;].unique().tolist() for i in range(10): frames_list.append(df[\u0026#39;Time\u0026#39;].iloc[-1]) def transform_color(color, amount = 0.5): try: c = mc.cnames[color] except: c = color c = colorsys.rgb_to_hls(*mc.to_rgb(c)) return colorsys.hls_to_rgb(c[0], 1 - amount * (1 - c[1]), c[2]) all_names = df[\u0026#39;Country\u0026#39;].unique().tolist() random_hex_colors = [] for i in range(len(all_names)): random_hex_colors.append(\u0026#39;#\u0026#39; + \u0026#39;%06X\u0026#39; % randint(0, 0xFFFFFF)) rgb_colors = [transform_color(i, 1) for i in random_hex_colors] rgb_colors_opacity = [rgb_colors[x] + (0.825,) for x in range(len(rgb_colors))] rgb_colors_dark = [transform_color(i, 1.12) for i in random_hex_colors] fig, ax = plt.subplots(figsize = (36, 20)) num_of_elements = 8 def draw_barchart(Time): df_frame = df[df[\u0026#39;Time\u0026#39;].eq(Time)].sort_values(by = \u0026#39;value\u0026#39;, ascending = True).tail(num_of_elements) ax.clear() normal_colors = dict(zip(df[\u0026#39;Country\u0026#39;].unique(), rgb_colors_opacity)) dark_colors = dict(zip(df[\u0026#39;Country\u0026#39;].unique(), rgb_colors_dark)) ax.barh(df_frame[\u0026#39;Country\u0026#39;], df_frame[\u0026#39;value\u0026#39;], color = [normal_colors[x] for x in df_frame[\u0026#39;Country\u0026#39;]], height = 0.8, edgecolor =([dark_colors[x] for x in df_frame[\u0026#39;Country\u0026#39;]]), linewidth = \u0026#39;6\u0026#39;) dx = float(df_frame[\u0026#39;value\u0026#39;].max()) / 200 for i, (value, name) in enumerate(zip(df_frame[\u0026#39;value\u0026#39;], df_frame[\u0026#39;Country\u0026#39;])): ax.text(value + dx, i + (num_of_elements / 50), \u0026#39; \u0026#39; + name, size = 36, weight = \u0026#39;bold\u0026#39;, ha = \u0026#39;left\u0026#39;, va = \u0026#39;center\u0026#39;, fontdict = {\u0026#39;fontname\u0026#39;: \u0026#39;Trebuchet MS\u0026#39;}) ax.text(value + dx, i - (num_of_elements / 50), f\u0026#39; {value:,.0f}\u0026#39;, size = 36, ha = \u0026#39;left\u0026#39;, va = \u0026#39;center\u0026#39;) time_unit_displayed = re.sub(r\u0026#39;\\^(.*)\u0026#39;, r\u0026#39;\u0026#39;, str(Time)) ax.text(1.0, 1.14, time_unit_displayed, transform = ax.transAxes, color = \u0026#39;#666666\u0026#39;, size = 62, ha = \u0026#39;right\u0026#39;, weight = \u0026#39;bold\u0026#39;, fontdict = {\u0026#39;fontname\u0026#39;: \u0026#39;Trebuchet MS\u0026#39;}) ax.text(-0.005, 1.06, \u0026#39;GDP/capita\u0026#39;, transform = ax.transAxes, size = 30, color = \u0026#39;#666666\u0026#39;) ax.text(-0.005, 1.14, \u0026#39;Projection of GDP/capita from 2010 to 2060\u0026#39;, transform = ax.transAxes, size = 62, weight = \u0026#39;bold\u0026#39;, ha = \u0026#39;left\u0026#39;, fontdict = {\u0026#39;fontname\u0026#39;: \u0026#39;Trebuchet MS\u0026#39;}) ax.xaxis.set_major_formatter(ticker.StrMethodFormatter(\u0026#39;{x:,.0f}\u0026#39;)) ax.xaxis.set_ticks_position(\u0026#39;top\u0026#39;) ax.tick_params(axis = \u0026#39;x\u0026#39;, colors = \u0026#39;#666666\u0026#39;, labelsize = 28) ax.set_yticks([]) ax.set_axisbelow(True) ax.margins(0, 0.01) ax.grid(which = \u0026#39;major\u0026#39;, axis = \u0026#39;x\u0026#39;, linestyle = \u0026#39;-\u0026#39;) plt.locator_params(axis = \u0026#39;x\u0026#39;, nbins = 4) plt.box(False) plt.subplots_adjust(left = 0.075, right = 0.75, top = 0.825, bottom = 0.05, wspace = 0.2, hspace = 0.2) animator = animation.FuncAnimation(fig, draw_barchart, frames = frames_list) animator.save(\u0026#34;Racing Bar Chart.mp4\u0026#34;, fps = 20, bitrate = 1800) Sources and Further Material https://www.kaggle.com/auwsom/gdp-projections-to-2060-oecd-countries-and-world https://matplotlib.org/3.1.1/api/animation_api.html ","permalink":"http://www.gabriel-berardi.com/blog/data/2020-05-01-racing-bar-chart/","summary":"\u003cp\u003eAfter reading \u003ca href=\"https://towardsdatascience.com/bar-chart-race-in-python-with-matplotlib-8e687a5c8a41\" target=\"_blank\" \u003ethis article\u003c/a\u003e from Pratap Vardhan with great interest, I\nwanted to build my own version of a Bar Chart Race that is smoother and a bit more beautiful. The biggest improvement is the interpolation (or augmentation) of\nthe available data points in order to make the animation smoother.\u003c/p\u003e\n\u003cp\u003eHere is the Bar Chart Race we are going to build in this article:\u003c/p\u003e\n\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/data/2020-05-01-racing-bar-chart/images/img1.gif\"\u003e\u003c/p\u003e\n\u003cp\u003eFor the purpose of this demonstration, we are going to use a GDP per capita forecast dataset provided by the OECD. You can find the original dataset \u003ca href=\"https://www.kaggle.com/auwsom/gdp-projections-to-2060-oecd-countries-and-world\" target=\"_blank\" \u003ehere\u003c/a\u003e.\u003c/p\u003e","title":"How to Create a Racing Bar Chart with Python"},{"content":"\nk-Nearest Neighbors, or k-NN as I am going to call it from now on, is one of the easiest algorithms to solve classification tasks. It can be used for regression problems as well, but I am going to focus on the more common use case of classification in this post.\nIn a nutshell, k-NN will assign a new data point to the class that the majority of its k neighbours in the training set belong to. Let\u0026rsquo;s use another coffee-related example to see how that works.\nImagine an assistant to a CEO who buys a coffee for his boss every morning on the way to work. Sometimes, the CEO really likes the coffee and will praise the assistant. Other times, he dislikes the coffee and gives the assistant a hard time for the rest of the day. The assistant already knows that this has something to do with the temperature of the coffee and the number of meetings scheduled for that particular day.\nThe assistant decides to find a way to always predict whether his boss would like or dislike the cup of coffee, so that he could save the two bucks if the boss would dislike the coffee anyway. So, this assistant decides to measure the temperature of the coffee and note down the number of meetings as well as the reaction of the CEO. The assistant does that for 30 days and gathers the following data:\nLooking at the data, we can easily spot that the assistant\u0026rsquo;s boss likes his coffee hot and his days relaxed. But how can we predict whether he will or will not like a new coffee in the future?\nWell, k-NN can do exactly that. Let\u0026rsquo;s see how it works.\nHow Does the k-NN Algorithm Work? Here are the steps of the k-NN algorithm:\nSafe the location and class of all data points in the training set Set k, which is the number of neighbours to consider when determining the class of a new data point Assign the new data point to the class that the majority of the k nearest data points belong to A commonly used distance metric for continuous variables is the Euclidean distance, while Hamming distance may be used for discrete variables.\nThe question you should be asking yourself is how to decide what value to set for k. In general, we should consider the following three things when choosing a value for k:\nA small value of k means that noise will have a higher influence on the result, which might lead to overfitting A large value of k is computationally more expensive and might lead to underfitting To avoid incertitude, k should be an odd number Looking at the following image will help you to understand these points better. First, it is clear that we wouldn\u0026rsquo;t know what class we should assign the green data point to, if we chose an even number for k, such as 4 or 6. Second, we can see that one would obtain different results if k = 3 or k = 5.\nObviously, there is no one-fits all-answer how to choose k. Some suggest to chose k as the square root of your n data points (adjusted to be an odd number). Another option is to use cross-validation to select the optimal k value, using a validation set.\nk-NN in Action Let\u0026rsquo;s continue our imaginary example and apply k-NN on our coffee data.\nSuppose we have a coffee with a temperature of 45 degree Celsius, brought into the office on a day with 6 meetings, represented by the green point on our scatter plot:\nIf the k-NN had been trained with k = 5, the algorithm would look at the class of the 5 closest data points. We can visualize this with a circle:\nSince 3 out of 5 data points belong to the class \u0026ldquo;No\u0026rdquo;, the k-NN algorithm would predict that the boss doesn\u0026rsquo;t like this cup of coffee. It really is a simple as that!\nk-NN is a so-called Lazy Learning method, which means that the majority of computation is deferred to the moment when we want to make a prediction. In other words, the k-NN algorithm simply saves the information of each data point in our training set and computes the distance to a new data point when used to make a new prediction.\nLimitations and Problems of k-NN k-NN is as easy as it gets. As always, it has some limitations and problems:\nIf we deal with unbalanced data, k-NN might easily overlook the underrepresented class. When we have a big dataset with many features, k-NN can be computationally extensive, because it needs to calculate the distance to each data point. The k-NN algorithm can be quite sensitive to outliers. Summary The k-NN algorithm is a very common and easy to understand algorithm to solve classification problems. It used some distance metric to decide which class a new data point belongs to, by comparing it to a number of k nearest neighbours.\nk-NN works best for balanced data sets with relatively few features. Like many other algorithms, it also requires scaled data as training input.\nSources and Further Materials Ng, Annalyn \u0026amp; Soo, Kenneth - Numsense! Data Science for the Layman (2017) https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm https://www.youtube.com/watch?v=HVXime0nQeI ","permalink":"http://www.gabriel-berardi.com/blog/data/2020-10-01-k-nearest-neighbors/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/data/2020-10-01-k-nearest-neighbors/images/header.jpg\"\u003e\u003c/p\u003e\n\u003cp\u003ek-Nearest Neighbors, or k-NN as I am going to call it from now on, is one of the easiest algorithms to solve classification tasks. It can be used for regression\nproblems as well, but I am going to focus on the more common use case of classification in this post.\u003c/p\u003e\n\u003cp\u003eIn a nutshell, k-NN will assign a new data point to the class that the majority of its k neighbours in the training set belong to. Let\u0026rsquo;s use another coffee-related\nexample to see how that works.\u003c/p\u003e","title":"k-Nearest Neighbors"},{"content":"\nHave you ever seen some cool applications of computer vision tools, like this the one below?\nPerhaps your phone\u0026rsquo;s camera can autofocus on faces, or maybe you have uploaded a photo on a social media platform and it automatically recognized the person on the image?\nThese are facial recognition applications and they all rely on Machine Learning. In this post, we are going to use a very easy package called OpenCV to build our own facial recognition program!\nWhat\u0026rsquo;s OpenCV? OpenCV (aka \u0026ldquo;Open Source Computer Vision Library\u0026rdquo;) is an open source computer vision and machine learning software library. The library contains more than 2,500 optimized algorithms for different computer vision tasks, such as recognizing faces and objects or removing red eyes from a photograph.\nOpenCV has C++, Python, Java and MATLAB interfaces and supports Windows, Linux, Android and Mac OS. You can find out more about OpenCV here.\nRecognizing Faces From an Image Let\u0026rsquo;s say we have a photograph and want to write a program that can count the number of people in an image like this:\nThe first thing we need to do is importing the needed modules. Note that I will import matplotlib.pyplot only to display images in Jupyter Notebook:\nimport cv2 as cv import matplotlib.pyplot as plt Now we need to open our image and convert into gray-scale:\nimg = cv.imread(\u0026#34;people.jpeg\u0026#34;) gray_img = cv.cvtColor(img, cv2.COLOR_BGR2GRAY) plt.imshow(gray_img, \u0026#34;gray\u0026#34;) plt.axis(\u0026#39;off\u0026#39;) plt.show() The reason why we remove the colours of the image has something to do with how computer vision algorithms work. Here is a brief explanation for this:\nA colourless image can be represented as a matrix of values between 0 and 255 for each pixel in the image, in which case 0 represents black and 255 represents white and all values in between represent a certain shade of gray. However, a colourful image can only be represented as a three-dimensional matrix or three matrices stacked on top of each other, with each matrix containing values of 0 to 255 for the red, green and blue component of the colour of a pixel. Computer vision then uses these values as numerical inputs to algorithms to perform classification tasks.\nNow, some computer vision tasks might certainly perform much better when using coloruful images. One might imagine that an algorithm can distinguish between an apple and a peach more easily when the images contain the colour-information. Other tasks, such as facial recognition, do not require colours and can perform much more efficiently when using gray-scale images.\nIf you want to read more on this topic, check out this article.\nMoving on. Next, we need to find the pre-trained models from OpenCV. The easiest way to find the path to these files is to type in \u0026ldquo;haarcascades\u0026rdquo; into your OS search. You should find 17 XML files, looking like that:\nYou can also find these on OpenCV\u0026rsquo;s GitHub page.\nFrom their name, you can tell the use case of these different models. For this example, I used the haarcascade_frontalface_alt_tree model.\nNext, we pass this path as an argument to create a new instance of the cv.CascadeClassifier class.\nThen we use the .detectMultiScale method and pass three arguments: our gray-scale image, scaleFactor and minNeighbors. scaleFactor parameter specifies how much the image size is reduced at each image scale. The minNeighbors argument specifies how many neighbors each candidate rectangle should have to retain it. In other words, the minNeighbors parameter will affect the quality of detected faces. A high value results in fewer detections but with higher quality.\n.detectMultiScale will return the position of all detected faces in our image. We save these coordinates to a variable called faces.\nclassifier_path = \u0026#34;~/haarcascade_frontalface_alt_tree.xml\u0026#34; classifier = cv.CascadeClassifier(classifier_path) faces = classifier.detectMultiScale(gray_img, scaleFactor=1.05, minNeighbors=3) print(faces) Lastly, all we have to do is draw some rectangles around the detected face on the image using the cv.rectangle method and output the image. For this, we use our original, colorful image and not the gray-scaled one, but that\u0026rsquo;s really just for the sake of visualizing our results:\nc = img.copy()for face in faces: x, y, w, h = face cv.rectangle(c, (x, y), (x+w, y+h), (0, 255, 0), 10) plt.figure(figsize=(16,16)) img = cv.cvtColor(c, cv.COLOR_BGR2RGB) plt.annotate(f\u0026#39;Number of detected faces: {len(faces)}\u0026#39;, xy=(0.99, 0.02), xycoords=\u0026#39;axes fraction\u0026#39;, fontsize=20, color=\u0026#39;green\u0026#39;, bbox=dict(facecolor=\u0026#39;black\u0026#39;, alpha=0.99), horizontalalignment=\u0026#39;right\u0026#39;, verticalalignment=\u0026#39;bottom\u0026#39;) plt.axis(\u0026#39;off\u0026#39;) plt.imshow(img) And there you have it. The pre-trained model from OpenCV has successfully detected and marked all faces on this photograph!\nI ran the same code above for some other images. Here are some of the results:\nThis looks pretty good. However, the model is far from being perfect! In many cases it fails to recognize all the faces on an image.\nHave a look:\nThat\u0026rsquo;s it for this post. In another post, I am going to demonstrate how OpenCV can be used for object recognition in videos. Stay tuned and thanks for reading!\nFull Code on GitHub Link: https://gist.github.com/gabriel-berardi/3e4aeeebbe0b27eb030e8c84738ace9a\nimport cv2 as cv import matplotlib.pyplot as plt img = cv.imread(\u0026#34;people.jpeg\u0026#34;) gray_img = cv.cvtColor(img, cv2.COLOR_BGR2GRAY) plt.imshow(gray_img, \u0026#34;gray\u0026#34;) plt.axis(\u0026#39;off\u0026#39;) plt.show() classifier_path = \u0026#34;~/haarcascade_frontalface_alt_tree.xml\u0026#34; classifier = cv.CascadeClassifier(classifier_path) faces = classifier.detectMultiScale(gray_img, scaleFactor=1.05, minNeighbors=3) faces c = img.copy() for face in faces: x, y, w, h = face cv.rectangle(c, (x, y), (x+w, y+h), (0, 255, 0), 10) plt.figure(figsize=(16,16)) img = cv.cvtColor(c, cv.COLOR_BGR2RGB) plt.annotate(f\u0026#39;Number of detected faces: {len(faces)}\u0026#39;, xy=(0.99, 0.02), xycoords=\u0026#39;axes fraction\u0026#39;, fontsize=20, color=\u0026#39;green\u0026#39;, bbox=dict(facecolor=\u0026#39;black\u0026#39;, alpha=0.99), horizontalalignment=\u0026#39;right\u0026#39;, verticalalignment=\u0026#39;bottom\u0026#39;) plt.axis(\u0026#39;off\u0026#39;) plt.imshow(img) Sources And Further Material https://docs.opencv.org/3.4/db/d28/tutorial_cascade_classifier.html https://www.pexels.com/photo/children-taking-groupie-3556662/ https://www.pexels.com/photo/group-of-people-smiling-3756513/ https://www.pexels.com/photo/man-in-suit-jacket-and-woman-in-dress-grayscale-photo-3859002/ https://www.pexels.com/photo/photo-of-man-sitting-on-wooden-chair-3617660/ https://www.pexels.com/photo/people-girl-design-happy-35188/ https://www.pexels.com/photo/people-men-women-crowd-34291/ https://www.pexels.com/photo/people-wearing-face-mask-for-protection-3957986/ https://www.pexels.com/photo/family-photo-1648358/ https://www.pexels.com/photo/multi-cultural-people-3184419/ ","permalink":"http://www.gabriel-berardi.com/blog/data/2020-08-01-simple-facial-recognition-with-opencv/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/data/2020-08-01-simple-facial-recognition-with-opencv/images/header.jpg\"\u003e\u003c/p\u003e\n\u003cp\u003eHave you ever seen some cool applications of computer vision tools, like this the one below?\u003c/p\u003e\n\u003cp\u003ePerhaps your phone\u0026rsquo;s camera can autofocus on faces, or maybe you have uploaded a photo on a social media platform and it automatically recognized the person on the\nimage?\u003c/p\u003e\n\u003cp\u003eThese are facial recognition applications and they all rely on Machine Learning. In this post, we are going to use a very easy package called OpenCV to build our\nown facial recognition program!\u003c/p\u003e","title":"Simple Facial Recognition with OpenCV"},{"content":"\nI recently moved to a new city - Munich! I live in a very calm area, but soon realized that the neighbourhood is not really the best when it comes to eating outside. So, I decided to try to analyse review data from the web to find out which area is most compelling for me and other foodies. I scraped online reviews, cleaned the data and then visualized it on a map, showing the average rating of restaurants in different areas in Munich.\nGetting the Data In order to get data about all restaurants in Munich, I decided to scrape a popular website for restaurant reviews with Python and BeautifulSoup.\nSince web scraping is still a kind of legal grey area (check out this article), I prefer not to show which website I specifically scraped, but I will still explain my code. If you scrape a website, always make sure to follow the rules by the website host (which can be found in the robots.txt file) be courteous and add a short break in between your requests, in order not to overburden their servers.\nAs always, we need to import a bunch of libraries:\nimport pandas as pd import requests import time import matplotlib.pyplot as plt import matplotlib.patches as mpatches import matplotlib.patheffects as pe import seaborn as sns import geopandas as gpd from bs4 import BeautifulSoup from urllib.parse import urljoin from random import randint from math import e Next, we specify the base URL, as well as the first URL extension of the website that we want to scrape. The main idea is that we are going to iterate over many pages containing information about the restaurants, where the base URL stays the same and only the URL extension changes, for example like this:\nBase URL: www.google.com/ URL extensions: page-1, page-2, page-3\u0026hellip;\nbase_url = \u0026#39;WEBSITE\u0026#39; first_page = \u0026#39;EXTENSION.html\u0026#39; Now we will collect all the URL extensions for each individual restaurant in a list. We do this by looking for all URLs referring to the respective CSS class and appending them to the list. Then we look for the \u0026ldquo;Next Page\u0026rdquo; button at the end of the website and concatenate the new URL extension to our base URL. When the last page has been reached, our script won\u0026rsquo;t find the \u0026ldquo;Next Page\u0026rdquo; button and will end the loop. We then have a list containing all URL extensions for 3,821 restaurants in Munich. This step of the scraping took around 30 minutes.\nnext_page = urljoin(base_url, first_page) page_exts = [] i = 0 loop = True while loop == True: i += 1 print(f\u0026#39;Now scraping page number {i}...\u0026#39;) time.sleep(randint(10,15)) r = requests.get(next_page) soup = BeautifulSoup(r.text, \u0026#34;html.parser\u0026#34;) for url in soup.find_all(class_=\u0026#34;_15_ydu6b\u0026#34;): page_exts.append(url[\u0026#39;href\u0026#39;]) try: next_button = soup.find(class_=\u0026#39;nav next rndBtn ui_button primary taLnk\u0026#39;)[\u0026#39;href\u0026#39;] next_page = urljoin(base_url, next_button) except TypeError: print(\u0026#39;Last Page Reached...\u0026#39;) loop = False As the final step of the web scraping process, we loop through this list of URL extensions, get the HTML code of that particular website and extract the following information about each restaurant:\nName Location/Address Rating from 1 to 5 Number of reviews Price range from $ to $$$$ Sometimes, the information we want might not be available in the form we expect. Therefore, we should use some try/except statements to avoid our code from breaking. Since I added a random time stop of 1, 2 or 3 seconds between all of my requests, this piece of code took almost 4 hours to finish! Therefore, I suggest using the pickle module to save results after every 1000th restaurant.\nrest_name = [] rest_loc = [] rest_rating = [] rest_norat = [] rest_price = [] for page_ext in page_exts: print(f\u0026#39;Now scraping restaurant number {page_exts.index(page_ext)}...\u0026#39;) time.sleep(randint(1,3)) r = requests.get(urljoin(base_url, page_ext)) soup = BeautifulSoup(r.text, \u0026#34;html.parser\u0026#34;) try: rest_name.append(soup.find(class_=\u0026#39;_3a1XQ88S\u0026#39;).text) except AttributeError: rest_name.append(None) try: rest_loc.append(soup.find(class_=\u0026#39;_2saB_OSe\u0026#39;).text) except AttributeError: rest_loc.append(None) try: rest_rating.append(soup.find(class_=\u0026#39;r2Cf69qf\u0026#39;).text) except AttributeError: rest_rating.append(None) try: rest_norat.append(soup.find(class_=\u0026#39;_10Iv7dOs\u0026#39;).text) except AttributeError: rest_norat.append(None) try: rest_price.append(soup.find(class_=\u0026#39;_2mn01bsa\u0026#39;).text) except AttributeError: rest_price.append(None) The last thing I did is to generate a data frame with the data and save it to a CSV file.\ndf = pd.DataFrame(data = list(zip(rest_name, rest_loc, rest_rating, rest_norat, rest_price)), columns = [\u0026#39;name\u0026#39;, \u0026#39;address\u0026#39;, \u0026#39;rating\u0026#39;, \u0026#39;number_reviews\u0026#39;, \u0026#39;price\u0026#39;]) df.to_csv(\u0026#39;restaurants.csv\u0026#39;) This is what the first few rows of the scraped, uncleaned dataset looked like:\nCleaning the Data Next, we should clean our dataset. Looking at the scraped data, we notice the following issues:\nDue to advertising on the website, some restaurants have been scraped several times. We need to remove these duplicate entries. There are null value entries in the dataset. In the \u0026ldquo;price\u0026rdquo; column, some entries are faulty, showing different information. Also, we should change the \u0026ldquo;$\u0026rdquo; characters into numerical values to display the price level of each restaurant. Since our goal is to visualize the quality of restaurants in different areas of Munich, we should format the \u0026ldquo;Address\u0026rdquo; column and only keep the area code. Also, some area codes in the data set do not really belong to Munich. Lastly, the \u0026ldquo;number_reviews\u0026rdquo; and the \u0026ldquo;rating\u0026rdquo; columns are of wrong data typed and should be converted to integers and floats respectively. So, let\u0026rsquo;s first load the dataset and drop the duplicate entries, keeping only the first instance.\ndf = pd.read_csv(\u0026#39;munich restaurants.csv\u0026#39;, index_col=\u0026#39;Unnamed: 0\u0026#39;) df = df.drop_duplicates(keep=\u0026#39;first\u0026#39;) Next, we check for null values and delete all rows containing null values.\nprint(df.info()) df = df.dropna() Then, we are going to delete all rows where the \u0026ldquo;Price Range\u0026rdquo; information is wrong, by only keeping entries that contain a \u0026ldquo;$\u0026rdquo; character.\ndf = df[df[\u0026#39;price\u0026#39;].str.contains(\u0026#39;$\u0026#39;, regex=False)] Proceeding, we take care of the \u0026ldquo;number_reviews\u0026rdquo; and the \u0026ldquo;rating\u0026rdquo; column.\ndf.loc[:, \u0026#39;number_reviews\u0026#39;] = df[\u0026#39;number_reviews\u0026#39;].apply(lambda x: re.sub(\u0026#39;[^0-9]\u0026#39;,\u0026#39;\u0026#39;, x)) df[\u0026#39;number_reviews\u0026#39;] = df[\u0026#39;number_reviews\u0026#39;].astype(int) df[\u0026#39;rating\u0026#39;] = df[\u0026#39;rating\u0026#39;].astype(float) We replace the price range \u0026ldquo;$\u0026rdquo; column with an integer representing the price level of each restaurant.\ndf[\u0026#39;price\u0026#39;] = df[\u0026#39;price\u0026#39;].replace(\u0026#39;$\u0026#39;, 1) df[\u0026#39;price\u0026#39;] = df[\u0026#39;price\u0026#39;].replace(\u0026#39;$$ - $$$\u0026#39;,2) df[\u0026#39;price\u0026#39;] = df[\u0026#39;price\u0026#39;].replace(\u0026#39;$$$$\u0026#39;, 3) And now we format the address column to only keep the are codes. We do this with a function and a regex using the re package.\nimport re def get_area_code(string): try: return(re.search(\u0026#39;\\d{5}\u0026#39;, string).group()) except AttributeError: return(None) df[\u0026#39;area\u0026#39;] = df[\u0026#39;address\u0026#39;].apply(lambda x: get_area_code(x)) df = df.drop(\u0026#39;address\u0026#39;, axis = 1) df = df.dropna() And finally, we delete all entries of restaurants that have an area code which not really belongs to Munich. Then, we save the cleaned dataset to another CSV file.\nnot_munich = [\u0026#39;85356\u0026#39;, \u0026#39;85640\u0026#39;, \u0026#39;85540\u0026#39;, \u0026#39;85551\u0026#39;, \u0026#39;85646\u0026#39;, \u0026#39;85737\u0026#39;,\u0026#39;85757\u0026#39;, \u0026#39;82194\u0026#39;, \u0026#39;82041\u0026#39;, \u0026#39;82194\u0026#39;, \u0026#39;82041\u0026#39;, \u0026#39;82067\u0026#39;,\u0026#39;82031\u0026#39;, \u0026#39;82049\u0026#39;] df = df[~df[\u0026#39;area\u0026#39;].isin(not_munich)] df.to_csv(\u0026#39;munich restaurants cleaned.csv\u0026#39;) And that\u0026rsquo;s it for the data preparation. We now have a clean data set of 2,202 restaurants, that looks like this:\nFinally, we can visualize this data on a map.\nVisualizing the Data The first we obviously need to do is to load the cleaned data set. Next, we are going to group it by the area code and taking the mean of the other variables. Because we are interested in the total number of reviews for each area code, we define this column specifically. Then we make sure that the area code is a string data type and rearrange the columns.\ndf = pd.read_csv(\u0026#39;munich restaurants cleaned.csv\u0026#39;, index_col=\u0026#39;Unnamed: 0\u0026#39;) df_by_area = df.groupby(by=\u0026#39;area\u0026#39;).mean() df_by_area[\u0026#39;number_reviews\u0026#39;] = df.groupby(by=\u0026#39;area\u0026#39;).sum()[\u0026#39;number_reviews\u0026#39;] df_by_area = df_by_area.reset_index() df_by_area[\u0026#39;area\u0026#39;] = df_by_area[\u0026#39;area\u0026#39;].astype(str) df_by_area.columns = [\u0026#39;area\u0026#39;, \u0026#39;avg_rating\u0026#39;, \u0026#39;number_reviews\u0026#39;, \u0026#39;avg_price\u0026#39;] The data frame now looks like this:\nSince we want to produce a map, we are going to need some geospatial information about each area code. After quite some research, I found a way to add a polygon to each area code. I downloaded the \u0026ldquo;plz-gebiete.shp\u0026rdquo; shapefile which contains geometrical polygons for most area codes in Germany from https://www.suche- postleitzahl.org/downloads. I load this data into a pandas data frame, drop an unnecessary column and then filter out the data for area codes starting with \u0026ldquo;8\u0026rdquo; and rename the columns.\narea_shape_df = gpd.read_file(\u0026#39;plz-gebiete.shp\u0026#39;, dtype={\u0026#39;plz\u0026#39;: str}) area_shape_df = area_shape_df.drop(\u0026#39;note\u0026#39;, axis = 1) area_shape_df = area_shape_df[area_shape_df[\u0026#39;plz\u0026#39;].astype(str). str.startswith(\u0026#39;8\u0026#39;)] area_shape_df.columns = [\u0026#39;area\u0026#39;, \u0026#39;geometry\u0026#39;] The area_shape_df looks like this:\nNext, we can simply join the two data frames on the \u0026ldquo;area\u0026rdquo; column and drop entries with missing values.\nfinal_df = pd.merge(left = area_shape_df, right = df_by_area, on = \u0026#39;area\u0026#39;) final_df = final_df.dropna() Next, we need to think about our visualization a bit more. I would like to have a map of Munich that colors different areas according to the quality of the restaurants in that area. The question is how we want to define \u0026ldquo;quality\u0026rdquo; in this context.\nYou might think that we could just take the average rating of all restaurants in every area. This is an option, but there are several things to consider. For starters, some areas have many more restaurants and therefore also many more total reviews than other areas. This makes it difficult to compare two areas solely based on their average rating.\nThis is actually a problem that we would always encounter when comparing restaurants, products, hotels and so on based on online ratings.\nSuppose you compare three movies A, B and C:\nA has 1.000 reviews with an average rating of 8 out of 10 points. B has 50 reviews with an average rating of 9 out of 10 points. C has 2 reviews with an average rating of 10 out of 10 points. Most likely, you have come up with some heuristics to help you decide what movie to watch in such a setting. If you are like me, you would probably go with movie A, since an average rating of 8 with 1.000 reviews is more likely to be a good movie than movie B with only 50 reviews. Movie C would be out of question, because 2 reviews is simply to few and the rating could even be fake.\nThe importance we assign to the average rating and the number of reviews respectively will always be subjective, but we should at least try to come up with a way to incorporate both of these aspects into our measurement of \u0026ldquo;quality\u0026rdquo; for the restaurants in Munich.\nI did not find many different approaches to this problem, but the one that I personally liked best is from this post. The author Marc Bogaerts defines an algorithm that takes in both the average rating and the number of reviews and outputs a score. I have adapted the algorithm to our current setting and it looks like this:\nWhere p is the average rating, q is the number of reviews and Q is the median of the number of reviews. Note that in the original algorithm, Marc Bogaerts proposed to set Q equal to a number that we would consider \u0026ldquo;moderate\u0026rdquo;. In my opinion the median is a good approach to a \u0026ldquo;moderate\u0026rdquo; value.\nLet\u0026rsquo;s take a look at the first half of the formulae: 0.5 * p. This tells us that 50% of the final score is determined by the average rating. Now let\u0026rsquo;s look at the second half: 2.5 * (1 - e ^(-q/Q)).The expression e ^(-q/Q) can take values between 0 and 1. Suppose the number of ratings is equal to 1 and the median number of reviews is 1000. e^(-1/1000) is equal to 0.999, which lead the term inside brackets to be slightly over 0 and the second half of the expression to be almost 0 as well. Hence, such a low number of ratings will punish the score, by basically taking only half of the average rating. For more information on the formulae, I suggest reading the linked post. When implemented in Python, this would look like this:\np = final_df[\u0026#39;avg_rating\u0026#39;] q = final_df[\u0026#39;number_reviews\u0026#39;]Q = final_df[\u0026#39;number_reviews\u0026#39;].median() final_df[\u0026#39;Score\u0026#39;] = 0.5 * p + 2.5 * (1 - e**(-q / Q)) The final data frame we will use as the basis for our visualization looks like this:\nNow let\u0026rsquo;s move on to the actual visualization.\nWith the geometrical polygons, we are able to plot our data as a map, like this:\nplt.rcParams[\u0026#39;figure.figsize\u0026#39;] = [48, 32] fig, ax = plt.subplots() final_df.plot(ax = ax, zorder = 0, column = \u0026#39;score\u0026#39;, categorical = False, cmap=\u0026#39;RdYlGn\u0026#39;) ax.set(facecolor = \u0026#39;lightblue\u0026#39;, aspect = 1.4, xticks = [], yticks = []) plt.show() In order to make the map clearer, I simply used this map as the background for the plot:\nWhile our data contains data for all area codes, this map shows the different area names of Munich. Note that one area name can contain different area codes and one area code can lie within several area names.\nThe rest of the code is simply aesthetics and annotations:\nplt.rcParams[\u0026#39;figure.figsize\u0026#39;] = [48, 32] img = plt.imread(\u0026#39;munich map.png\u0026#39;) fig, ax = plt.subplots() final_df.plot(ax = ax, zorder = 0, column = \u0026#39;score\u0026#39;, categorical = False, cmap=\u0026#39;RdYlGn\u0026#39;) ax.imshow(img, zorder = 1, extent = [11.36, 11.725, 48.06, 48.250], alpha = 0.7) red_patch = mpatches.Patch(color=\u0026#39;#bb8995\u0026#39;, label = \u0026#39;Bad\u0026#39;) yellow_patch = mpatches.Patch(color=\u0026#39;#d6d0b9\u0026#39;, label = \u0026#39;Okay\u0026#39;) green_patch = mpatches.Patch(color=\u0026#39;#89ab9a\u0026#39;, label = \u0026#39;Good\u0026#39;) plt.legend(handles=[green_patch, yellow_patch, red_patch], facecolor = \u0026#39;white\u0026#39;, edgecolor=\u0026#39;lightgrey\u0026#39;, fancybox=True, framealpha=0.5, loc = \u0026#39;right\u0026#39;, bbox_to_anchor=(0.975, 0.925), ncol = 3, fontsize = 48) ax.text(0.0375, 0.925, \u0026#39;Where to Eat in Munich ?\u0026#39;, fontsize = 80, weight = \u0026#39;bold\u0026#39;, transform = ax.transAxes, bbox = dict(facecolor = \u0026#39;white\u0026#39;, edgecolor = \u0026#39;lightgrey\u0026#39;, alpha = 0.6, pad = 25)) plt.annotate(\u0026#39;Map based on 290.522 online ratings of 2,202 restaurants in Munich.\u0026#39;, (0, 0), (0, -20), fontsize = 38, weight = \u0026#39;bold\u0026#39;, xycoords = \u0026#39;axes fraction\u0026#39;, textcoords=\u0026#39;offset points\u0026#39;, va = \u0026#39;top\u0026#39;) plt.annotate(\u0026#39;by: Gabriel Berardi\u0026#39;, (0,0), (1960, -20), fontsize = 38, weight = \u0026#39;bold\u0026#39;, xycoords = \u0026#39;axes fraction\u0026#39;, textcoords = \u0026#39;offset points\u0026#39;, va = \u0026#39;top\u0026#39;) plt.annotate(\u0026#39;\\nThe score of each area is calculated using the average rating and the total number of reviews:\u0026#39;, (0, 0),(0, -70), fontsize = 38, xycoords = \u0026#39;axes fraction\u0026#39;,textcoords = \u0026#39;offset points\u0026#39;, va = \u0026#39;top\u0026#39;) plt.annotate(\u0026#39;score = 0.5 * avg_rating + 2.5 * (1 - e^( - number_reviews / median(number_reviews))\u0026#39;,(0,0), (0, -165), fontsize = 38, xycoords = \u0026#39;axes fraction\u0026#39;, textcoords = \u0026#39;offset points\u0026#39;, va = \u0026#39;top\u0026#39;) plt.annotate(\u0026#39;(Formula by: Marc Bogaerts (math.stackexchange.com/ users/118955))\u0026#39;, (0, 0), (0, -220), fontsize = 32, xycoords = \u0026#39;axes fraction\u0026#39;, textcoords = \u0026#39;offset points\u0026#39;, va = \u0026#39;top\u0026#39;) ax.set(facecolor = \u0026#39;lightblue\u0026#39;, aspect = 1.4, xticks = [], yticks = []) plt.show() And that\u0026rsquo;s the final visualization. As you can see, the polygons do not match up perfectly with the underlying map on the borders, but this is fine for our little project here.\nThe central areas of Munich tend to be the place to go for a dine-out, while the outskirts should be avoided. This result is dependent on the weights we assigned to the average rating and the number of reviews in these areas. As one would expect, there are far fewer restaurants in the outer areas of Munich.\nThat\u0026rsquo;s it for this little web scraping and data visualization project. Let me know what you would have done differently or how my approach could be enhanced! Thanks!\nFull Code on Github Link: https://gist.github.com/gabriel-berardi/743fbcaf874badce9469e1ad41591bcb\n# Import required ibraries import pandas as pd import requests import time import re import matplotlib.pyplot as plt import matplotlib.patches as mpatches import matplotlib.patheffects as pe import seaborn as sns import geopandas as gpd from math import e from bs4 import BeautifulSoup from urllib.parse import urljoin from random import randint # Set the base url and the first page to scrape base_url = \u0026#39;URL\u0026#39; first_page = \u0026#39;Extension\u0026#39; # This code block retrieves the url extensions for all restaurants next_page = urljoin(base_url, first_page) page_exts = [] i = 0 loop = True while loop == True: i += 1 print(f\u0026#39;Now scraping page number {i}...\u0026#39;) time.sleep(randint(10,15)) r = requests.get(next_page) soup = BeautifulSoup(r.text, \u0026#34;html.parser\u0026#34;) for url in soup.find_all(class_ = \u0026#34;_15_ydu6b\u0026#34;): page_exts.append(url[\u0026#39;href\u0026#39;]) try: next_button = soup.find(class_ = \u0026#39;nav next rndBtn ui_button primary taLnk\u0026#39;)[\u0026#39;href\u0026#39;] next_page = urljoin(base_url, next_button) except TypeError: print(\u0026#39;Last Page Reached...\u0026#39;) loop = False # This code block extracts the name, location, rating, number of reviews and price range # from all the restaurants rest_name = [] rest_loc = [] rest_rating = [] rest_norat = [] rest_price = [] for page_ext in page_exts: print(f\u0026#39;Now scraping restaurant number {page_exts.index(page_ext)}...\u0026#39;) time.sleep(randint(1,3)) r = requests.get(urljoin(base_url, page_ext)) soup = BeautifulSoup(r.text, \u0026#34;html.parser\u0026#34;) try: rest_name.append(soup.find(class_ = \u0026#39;_3a1XQ88S\u0026#39;).text) except AttributeError: rest_name.append(None) try: rest_loc.append(soup.find(class_ = \u0026#39;_2saB_OSe\u0026#39;).text) except AttributeError: rest_loc.append(None) try: rest_rating.append(soup.find(class_ = \u0026#39;r2Cf69qf\u0026#39;).text) except AttributeError: rest_rating.append(None) try: rest_norat.append(soup.find(class_ = \u0026#39;_10Iv7dOs\u0026#39;).text) except AttributeError: rest_norat.append(None) try: rest_price.append(soup.find(class_ = \u0026#39;_2mn01bsa\u0026#39;).text) except AttributeError: rest_price.append(None) # Create the dataframe from the scraped data and save it to a csv file df = pd.DataFrame(data = list(zip(rest_name, rest_loc, rest_rating, rest_norat, rest_price)), columns = [\u0026#39;name\u0026#39;, \u0026#39;address\u0026#39;, \u0026#39;rating\u0026#39;, \u0026#39;number_reviews\u0026#39;, \u0026#39;price\u0026#39;]) df.to_csv(\u0026#39;restaurants.csv\u0026#39;) # Reading in the raw scraped data df = pd.read_csv(\u0026#39;munich restaurants.csv\u0026#39;, index_col = \u0026#39;Unnamed: 0\u0026#39;) # Delete all duplicate rows and only keep the first entry df = df.drop_duplicates(keep = \u0026#39;first\u0026#39;) # Checking for null values print(df.info()) # Drop all rows contain null values df = df.dropna() # Delete all rows where the \u0026#39;Price Range\u0026#39; information is wrong df = df[df[\u0026#39;price\u0026#39;].str.contains(\u0026#39;$\u0026#39;, regex = False)] # Format the \u0026#39;No. of Ratings\u0026#39; column df.loc[:, \u0026#39;number_reviews\u0026#39;] = df[\u0026#39;number_reviews\u0026#39;].apply(lambda x: re.sub(\u0026#39;[^0-9]\u0026#39;,\u0026#39;\u0026#39;, x)) df[\u0026#39;number_reviews\u0026#39;] = df[\u0026#39;number_reviews\u0026#39;].astype(int) # Format the \u0026#39;Rating\u0026#39; column df[\u0026#39;rating\u0026#39;] = df[\u0026#39;rating\u0026#39;].astype(float) # Format the \u0026#39;Price Range\u0026#39; column df[\u0026#39;price\u0026#39;] = df[\u0026#39;price\u0026#39;].replace(\u0026#39;$\u0026#39;, 1) df[\u0026#39;price\u0026#39;] = df[\u0026#39;price\u0026#39;].replace(\u0026#39;$$ - $$$\u0026#39;,2) df[\u0026#39;price\u0026#39;] = df[\u0026#39;price\u0026#39;].replace(\u0026#39;$$$$\u0026#39;, 3) # Format the \u0026#39;Address column\u0026#39; to only keep the area code def get_area_code(string): try: return(re.search(\u0026#39;\\d{5}\u0026#39;, string).group()) except AttributeError: return(None) df[\u0026#39;area\u0026#39;] = df[\u0026#39;address\u0026#39;].apply(lambda x: get_area_code(x)) df = df.drop(\u0026#39;address\u0026#39;, axis = 1) df = df.dropna() # Drop all areas that don\u0026#39;t belong to Munich not_munich = [\u0026#39;85356\u0026#39;, \u0026#39;85640\u0026#39;, \u0026#39;85540\u0026#39;, \u0026#39;85551\u0026#39;, \u0026#39;85646\u0026#39;, \u0026#39;85737\u0026#39;, \u0026#39;85757\u0026#39;, \u0026#39;82194\u0026#39;, \u0026#39;82041\u0026#39;, \u0026#39;82194\u0026#39;, \u0026#39;82041\u0026#39;, \u0026#39;82067\u0026#39;, \u0026#39;82031\u0026#39;, \u0026#39;82049\u0026#39;] df = df[~df[\u0026#39;area\u0026#39;].isin(not_munich)] # Saving cleaned dataset to a csv file df.to_csv(\u0026#39;munich restaurants cleaned.csv\u0026#39;) # Load the cleaned dataset df = pd.read_csv(\u0026#39;munich restaurants cleaned.csv\u0026#39;, index_col=\u0026#39;Unnamed: 0\u0026#39;) # Group the dataframe by area code df_by_area = df.groupby(by = \u0026#39;area\u0026#39;).mean() df_by_area[\u0026#39;number_reviews\u0026#39;] = df.groupby(by = \u0026#39;area\u0026#39;).sum()[\u0026#39;number_reviews\u0026#39;] df_by_area = df_by_area.reset_index() df_by_area[\u0026#39;area\u0026#39;] = df_by_area[\u0026#39;area\u0026#39;].astype(str) df_by_area.columns = [\u0026#39;area\u0026#39;, \u0026#39;avg_rating\u0026#39;, \u0026#39;number_reviews\u0026#39;, \u0026#39;avg_price\u0026#39;] # Create a dataframe with geometrical data for all area codes # Shapefile from https://www.suche-postleitzahl.org/downloads area_shape_df = gpd.read_file(\u0026#39;plz-gebiete.shp\u0026#39;, dtype = {\u0026#39;plz\u0026#39;: str}) area_shape_df = area_shape_df.drop(\u0026#39;note\u0026#39;, axis = 1) area_shape_df = area_shape_df[area_shape_df[\u0026#39;plz\u0026#39;].astype(str).str.startswith(\u0026#39;8\u0026#39;)] area_shape_df.columns = [\u0026#39;area\u0026#39;, \u0026#39;geometry\u0026#39;] # Merge the dataframes and drop missing values final_df = pd.merge(left = area_shape_df, right = df_by_area, on = \u0026#39;area\u0026#39;) final_df = final_df.dropna() # Apply a function to calculate the score of each area # https://math.stackexchange.com/questions/942738 p = final_df[\u0026#39;avg_rating\u0026#39;] q = final_df[\u0026#39;number_reviews\u0026#39;] Q = final_df[\u0026#39;number_reviews\u0026#39;].median() final_df[\u0026#39;score\u0026#39;] = 0.5 * p + 2.5 * (1 - e**(-q / Q)) # Create plot to show the map # Map from https://upload.wikimedia.org/wikipedia/commons/2/2d/Karte_der_Stadtbezirke_in_M%C3%BCnchen.png plt.rcParams[\u0026#39;figure.figsize\u0026#39;] = [48, 32] img = plt.imread(\u0026#39;munich map.png\u0026#39;) fig, ax = plt.subplots() final_df.plot(ax = ax, zorder = 0, column = \u0026#39;score\u0026#39;, categorical = False, cmap=\u0026#39;RdYlGn\u0026#39;) ax.imshow(img, zorder = 1, extent = [11.36, 11.725, 48.06, 48.250], alpha = 0.7) red_patch = mpatches.Patch(color=\u0026#39;#bb8995\u0026#39;, label = \u0026#39;Bad\u0026#39;) yellow_patch = mpatches.Patch(color=\u0026#39;#d6d0b9\u0026#39;, label = \u0026#39;Okay\u0026#39;) green_patch = mpatches.Patch(color=\u0026#39;#89ab9a\u0026#39;, label = \u0026#39;Good\u0026#39;) plt.legend(handles=[green_patch, yellow_patch, red_patch], facecolor = \u0026#39;white\u0026#39;, edgecolor=\u0026#39;lightgrey\u0026#39;, fancybox=True, framealpha=0.5, loc = \u0026#39;right\u0026#39;, bbox_to_anchor=(0.975, 0.925), ncol = 3, fontsize = 48) ax.text(0.0375, 0.925, \u0026#39;Where to Eat in Munich ?\u0026#39;, fontsize = 80, weight = \u0026#39;bold\u0026#39;, transform = ax.transAxes, bbox = dict(facecolor = \u0026#39;white\u0026#39;, edgecolor = \u0026#39;lightgrey\u0026#39;, alpha = 0.6, pad = 25)) plt.annotate(\u0026#39;Map based on 290.522 online ratings of 2,202 restaurants in Munich.\u0026#39;, (0, 0), (0, -20), fontsize = 38, weight = \u0026#39;bold\u0026#39;, xycoords = \u0026#39;axes fraction\u0026#39;, textcoords=\u0026#39;offset points\u0026#39;, va = \u0026#39;top\u0026#39;) plt.annotate(\u0026#39;by: Gabriel Berardi\u0026#39;, (0,0), (1960, -20), fontsize = 38, weight = \u0026#39;bold\u0026#39;, xycoords = \u0026#39;axes fraction\u0026#39;, textcoords = \u0026#39;offset points\u0026#39;, va = \u0026#39;top\u0026#39;) plt.annotate(\u0026#39;\\nThe score of each area is calculated using the average rating and the total number of reviews:\u0026#39;, (0, 0), (0, -70), fontsize = 38, xycoords = \u0026#39;axes fraction\u0026#39;, textcoords = \u0026#39;offset points\u0026#39;, va = \u0026#39;top\u0026#39;) plt.annotate(\u0026#39;score = 0.5 * avg_rating + 2.5 * (1 - e^( - number_reviews / median(number_reviews))\u0026#39;, (0,0), (0, -165), fontsize = 38, xycoords = \u0026#39;axes fraction\u0026#39;, textcoords = \u0026#39;offset points\u0026#39;, va = \u0026#39;top\u0026#39;) plt.annotate(\u0026#39;(Formula by: Marc Bogaerts (math.stackexchange.com/users/118955))\u0026#39;, (0, 0), (0, -220), fontsize = 32, xycoords = \u0026#39;axes fraction\u0026#39;, textcoords = \u0026#39;offset points\u0026#39;, va = \u0026#39;top\u0026#39;) ax.set(facecolor = \u0026#39;lightblue\u0026#39;, aspect = 1.4, xticks = [], yticks = []) plt.show() Sources and Further Material https://math.stackexchange.com/questions/942738/ https://upload.wikimedia.org/wikipedia/commons/2/2d/Karte_der_Stadtbezirke_in_M%C3%BCnchen.png ","permalink":"http://www.gabriel-berardi.com/blog/data/2020-11-01-where-to-eat-in-munich/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/data/2020-11-01-where-to-eat-in-munich/images/header.jpg\"\u003e\u003c/p\u003e\n\u003cp\u003eI recently moved to a new city - Munich! I live in a very calm area, but soon realized that the neighbourhood is not really the best when it comes to eating\noutside. So, I decided to try to analyse review data from the web to find out which area is most compelling for me and other foodies. I scraped online reviews,\ncleaned the data and then visualized it on a map, showing the average rating of restaurants in different areas in Munich.\u003c/p\u003e","title":"Where to eat in Munich?"},{"content":"\nWeb Scraping is the automated process of extracting data from websites. This is commonly done by retrieving the HTML code of a website through a request and then extracting the information hidden in the HTML code programmatically. This is especially convenient when there is no API available to you!\nThere has been a lot of discussion going on about the legality and ethics of Web Scraping, which I do not want to get into in this article. You can check out this Wikipedia article and this blog post, if you want to know more about that.\nSome websites simply do not want you to scrape them. A good practice is to consult the \u0026ldquo;robots.txt\u0026rdquo; file before you scrape a website. This is basically a guideline by the website, which states what automated actions they allow or disallow. You can see this file by adding \u0026ldquo;/robots.txt\u0026rdquo; to any website (e.g. https://www.ikea.com/robots.txt).\nNo matter what, if you are going to scrape a website, you should always be polite! That means you should space out your requests, in order not to overload the servers of the website you want to scrape. If you are too aggressive, your IP address might get blocked.\nLet\u0026rsquo;s Scrape a Book Shop Next, let me show you a very simple example of how you might extract data from a book shop. For this, I want to take you through my code for scraping a fictional online book shop, which you can find at http://books.toscrape.com/\nThe website looks like this:\nThere are 1,000 books on this website, and we want to get data for all of them!\nImporting the Required Packages As always, we need some packages to perform the task at hand. We will use pandas to assemble the extracted data in a data frame, which we will output as a CSV or Excel file at the end. We use the requests module to get the HTML code from the website The time package is needed to add a short break in between each request BeautifulSoup is needed to perform the actual scraping Finally, we need the urljoin function from the urllib.parse package in order to correctly scrape the url of the images.\nimport pandas as pd import requests import time from bs4 import BeautifulSoup from urllib.parse import urljoin Defining the Classes and Methods First, we define a class called CrawledBooks, which is used to model a book. We are interested in the books title, price, rating, image and whether it is currently in stock or not. Therefore, we assign these properties to the CrawledBooks class.\nNext, we define the crawler class called BookCrawler which has a fetch method that will perform the web scraping. The fetch method works as follows:\nWe first initialize the url variable and get the HTML code using the requests package and transform it to a BeautifulSoup object We also initialize an empty list for the books Then, we have the main loop. doc.select(\u0026rsquo;.next\u0026rsquo;) will be a True boolean value as long as there is a \u0026ldquo;Next\u0026rdquo; button on the webpage that is currently scraped. As soon as we have reached the last page, the while loop will stop We add 1 second between each request and print the URL that is currently scraped, just in order to see what is going on Then, we need to redo our request each time we go through the while loop, because the URL will be different according to the current page we are on Next, we enter into a for loop, which extracts all the data we want from the HTML code and assigns it to the variables title, price, rating, image and available. These variables are then used to create an instance of the CrawledBooks class called crawled_books. This last variable crawled_books is then appended to the books list. The following try/except statement is used to update the URL by concatenating the base of the URL (http://books.toscrape.com/) with the respective ending of the page that is to be scraped next (for example /catalogue/page-2.html) We end the fetch method by returning the books list class CrawledBooks(): def __init__(self, title, price, rating, image, available): self.title = title self. price = price self.rating = rating self.image = image self.available = available class BookCrawler(): def fetch(self): url = \u0026#39;http://books.toscrape.com/\u0026#39; r = requests.get(url) doc = BeautifulSoup(r.text, \u0026#34;html.parser\u0026#34;) books = [] # The following while-loop is executed until the last page has been reached while doc.select(\u0026#39;.next\u0026#39;): # We set a break of 1 second in between each request and the URL that is currently time.sleep(1)print(url) r = requests.get(url) doc = BeautifulSoup(r.text, \u0026#34;html.parser\u0026#34;) for element in doc.select(\u0026#39;.product_pod\u0026#39;): title = element.select_one(\u0026#39;h3\u0026#39;).text price = element.select_one(\u0026#39;.price_color\u0026#39;).text[2:] rating = element.select_one(\u0026#39;p\u0026#39;).attrs[\u0026#39;class\u0026#39;][1] image = urljoin(url, element.select_one(\u0026#39;.thumbnail\u0026#39;).attrs[\u0026#39;src\u0026#39;]) available = element.select_one(\u0026#39;.instock\u0026#39;).text[15:23] crawled_books = CrawledBooks(title, price, rating, available) books.append(crawled_books) try: url = urljoin(url, doc.select_one(\u0026#39;.next a\u0026#39;).attrs[\u0026#39;href\u0026#39;]) except: print(\u0026#39;\\n Crawling complete!\u0026#39;) break return(books) Do the Actual Scraping The next two lines of code actually create an instance of the BookCrawler class and use its fetch method to do the actual scraping and save the result to the newly created scraped_books variable.\ncrawler = BookCrawler() scraped_books = crawler.fetch() In Jupyter Notebook, we can see the current URL that is being scraped:\nSave the Results We end our web scraping project by using list comprehension to save the results in distinct lists that we then use to create a Pandas dataframe.\nLastly, we save the result as a CSV or Excel file in the working directory.\n# Next, we save the data to variables as a list, using list comprehension all_titles = [i.title for i in scraped_books] all_prices = [i.price for i in scraped_books] all_ratings = [i.rating for i in scraped_books] all_images = [i.image for i in scraped_books] all_available = [i.available for i in scraped_books] # At last, we can assemble the gathered data in a pandas data frame and save the result # to a CSV or Excel file df = pd.DataFrame({\u0026#39;Title\u0026#39;: all_titles, \u0026#39;Price (£)\u0026#39;: all_prices, \u0026#39;Rating\u0026#39;: all_ratings, \u0026#39;Image\u0026#39;: all_images, \u0026#39;In Stock ?\u0026#39; : all_available}) df.to_csv(\u0026#39;Scraped Books.csv\u0026#39;) df.to_excel(\u0026#39;Scraped Books.xlsx\u0026#39;) We can either look at our data frame in Jupyter Notebook\u0026hellip;\n\u0026hellip;or in Excel:\nAfter you have successfully extracted data from a website, you might then proceed to analyze this data and derive some meaningful insights from it. But that\u0026rsquo;s for homework!\nFull Code on Github Link: https://gist.github.com/gabriel-berardi/3fc044964ed5806a78fc0a1d413afdb6\n# Importing needed packages import pandas as pd import requests import time from bs4 import BeautifulSoup from urllib.parse import urljoin # The following class enables us to access different elements of our crawled books class CrawledBooks(): def __init__(self, title, price, rating, image, available): self.title = title self. price = price self.rating = rating self.image = image self.available = available # The following class defines the crawler itself class BookCrawler(): def fetch(self): url = \u0026#39;http://books.toscrape.com/\u0026#39; r = requests.get(url) doc = BeautifulSoup(r.text, \u0026#34;html.parser\u0026#34;) books = [] # The following while-loop is executed until the last page has been reached while doc.select(\u0026#39;.next\u0026#39;): # We set a break of 1 second in between each request and print the URL that is currently scraped time.sleep(1) print(url) url = urljoin(url, doc.select_one(\u0026#39;.next a\u0026#39;).attrs[\u0026#39;href\u0026#39;]) r = requests.get(url) for element in doc.select(\u0026#39;.product_pod\u0026#39;): title = element.select_one(\u0026#39;h3\u0026#39;).text price = element.select_one(\u0026#39;.price_color\u0026#39;).text[2:] rating = element.select_one(\u0026#39;p\u0026#39;).attrs[\u0026#39;class\u0026#39;][1] image = urljoin(url, element.select_one(\u0026#39;.thumbnail\u0026#39;).attrs[\u0026#39;src\u0026#39;]) available = element.select_one(\u0026#39;.instock\u0026#39;).text[15:23] crawled_books = CrawledBooks(title, price, rating, image, available) books.append(crawled_books) try: doc = BeautifulSoup(r.text, \u0026#34;html.parser\u0026#34;) except: print(\u0026#39;\\n Crawling complete!\u0026#39;) break return books crawler = BookCrawler() scraped_books = crawler.fetch() # Next, we save the data to variables as a list, using list comprehension all_titles = [i.title for i in scraped_books] all_prices = [i.price for i in scraped_books] all_ratings = [i.rating for i in scraped_books] all_images = [i.image for i in scraped_books] all_available = [i.available for i in scraped_books] # At last, we can assemble the gathered data in a pandas data frame and save the result to a CSV or Excel file df = pd.DataFrame( {\u0026#39;Title\u0026#39;: all_titles, \u0026#39;Price (£)\u0026#39;: all_prices, \u0026#39;Rating\u0026#39;: all_ratings, \u0026#39;Image\u0026#39;: all_images, \u0026#39;In Stock ?\u0026#39; : all_available }) df.to_csv(\u0026#39;Scraped Books.csv\u0026#39;) df.to_excel(\u0026#39;Scraped Books.xlsx\u0026#39;) Sources and Further Material https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://2.python-requests.org/en/master/ ","permalink":"http://www.gabriel-berardi.com/blog/data/2020-03-01-scrape-bookshop-with-beautifulsoup/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/data/2020-03-01-scrape-bookshop-with-beautifulsoup/images/title.png\"\u003e\u003c/p\u003e\n\u003cp\u003eWeb Scraping is the automated process of extracting data from websites. This is commonly done by retrieving the HTML code of a website through a request and then extracting the information hidden in the HTML code programmatically. This is especially convenient when there is no API available to you!\u003c/p\u003e\n\u003cp\u003eThere has been a lot of discussion going on about the legality and ethics of Web Scraping, which I do not want to get into in this article. You can check out this Wikipedia article and this blog post, if you want to know more about that.\u003c/p\u003e","title":"Scrape a Book Shop with BeautifulSoup"},{"content":"\nThe k-means algorithm is used to divide unlabeled data into categories or classes, in order to draw useful conclusions from the resulting clusters.\nLet\u0026rsquo;s take a look at an imaginary dataset of n = 18 observations of different coffee brands. Note that we would never actually use the k-means algorithm on such a small data set.\nWe plot the price of the coffee vs. the rating obtained by customers:\nAs coffee drinkers, we might be interested in finding certain clusters in this data, so that we might purchase the best coffee we can afford with a given budget.\nObviously, with only 2 features (price and customer rating), it is very easy for us to spot different clusters with a simple scatter plot. But imagine if we had 10 features of 10,000 observations. Clearly, finding reasonable clusters would be a very difficult task. Luckily, there is the k-means algorithm!\nHow Does the k-Means Algorithm Work? The k-means algorithm essentially follows these steps:\nDetermine k, which is the number of clusters you want to find in your data Randomly set k points within our dataset as the means (or centroids) of the k clusters Assign each data point to the closest cluster mean, usually by using the Euclidean distance Recalculate the cluster means with the newly assigned data points Repeat step 3 and 4 until there is no change in clusters anymore Note that, if your input features have a different order of magnitude, you should always scale it before feeding it into the k-means algorithm.\nAnd here is a simplified visualization of this process:\nLooking at these 5 steps, you might wonder what value we should set the parameter k to be in the very beginning? Good question!\nSometimes, the answer to this question lies within the initial question you want to answer by using k-means. For example, imagine you want to cluster a dataset containing image information of banknotes into real and forged ones. Here, you would have a binary classification problem, and you would set k = 2.\nIn many cases, you do not know what the best value for k would be beforehand. In this case, you can use the variability within the clusters for different values of k, in order to determine the best value for it. This can be done using the Sum of Squared Distances (SSD) between each data point to its cluster centroid, which can be shown for different k\u0026rsquo;s in a scree plot, also known as elbow-plot. We will look at an example for this later.\nk-Means in Action Let\u0026rsquo;s see how the k-means algorithm clusters our coffee price vs. customer rating dataset. I used Python and sklearn to perform this task for k = 2, k = 3 and k = 4:\nYou can see the different ways the algorithm finds clusters for the different values of k. For example, in the k = 4 plot we might describe the clusters like this:\ncheap and disgusting coffee (blue) average coffee (yellow) cheap and tasty coffee (green) expensive and tasty coffee (red) Let\u0026rsquo;s add the centroids to the k = 4 scatter plot to better see the results:\nNext, I plotted a scree plot for different values of k:\nImagine the scree plot being an arm. The elbow method then suggests selecting k as the corresponding value at the elbow of the arm. If we increase k beyond the elbow, the SSD only decreases slightly. However, if we look at our coffee example, we might find reasons to use k values of 2, 3 or even 4. There is not always a right or wrong answer to the question of the best value for k!\nLimitations and Problems of k-Means The k-means algorithm is very easy to understand and can be applied in many situations. However, it has certain limitations and problems. Below are just a few, check out this post for more details on these drawbacks.\nEach data point can only belong to one cluster. This raises the question of how the algorithm should deal with data points that lie exactly between two cluster means. The first initialization of the centroids is random. Running the k-means on the exact same dataset can therefore result in different clusters. k-means will find (meaningless) clusters in a uniform dataset: In k-means, one cluster can never contain another cluster. In order to understand this problem, let\u0026rsquo;s take a look at the following image and think what might be a better way to cluster the data points:\nSummary As we have seen, k-means is a very straight-forward algorithm to find clusters within unlabelled data. The essence of k-means is to find k clusters and their respective centroids and assign each data point to the cluster with the shortest distance between the data point and the clusters\u0026rsquo; centroid.\nHowever, k-means has its limitations. Therefore, one might suggest using k-means during the exploratory data analysis to get a basic understanding of the structure of the data and then proceed with more sophisticated algorithms.\nSources and Further Materials Ng, Annalyn \u0026amp; Soo, Kenneth - Numsense! Data Science for the Layman (2017) https://en.wikipedia.org/wiki/K-means_clustering https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means https://www.youtube.com/watch?v=4b5d3muPQmA ","permalink":"http://www.gabriel-berardi.com/blog/data/2020-01-01-k-means-clustering/","summary":"\u003cp\u003e\u003cimg loading=\"lazy\" src=\"/blog/data/2020-01-01-k-means-clustering/images/header.jpg\"\u003e\u003c/p\u003e\n\u003cp\u003eThe k-means algorithm is used to divide unlabeled data into categories or classes, in order to draw useful conclusions from the resulting clusters.\u003c/p\u003e\n\u003cp\u003eLet\u0026rsquo;s take a look at an imaginary dataset of n = 18 observations of different coffee brands. Note that we would never actually use the k-means algorithm on such a small data set.\u003c/p\u003e\n\u003cp\u003eWe plot the price of the coffee vs. the rating obtained by customers:\u003c/p\u003e","title":"k-Means Clustering"},{"content":"I am originally born and raised in Switzerland 🇨🇭 but life has brought me to Germany 🇩🇪 where I am living and working in the beautiful City of Munich. For the past 8 years, I have been working in different roles in Insurance - from core insurance functions in sales, to business intelligence and data analytics.\nAfter completing my Bachelor degree in Insurance and Finance, I had developped a strong interest in all things data and tech. I had already taught myself how to code with Python, but I definitely wanted to learn more about data analytics, which is why I decided to pursue a Masters degree in Data Analytics from the University of Glasgow.\nCurrently, I am working as a Data Analyst in a German insurance company.\n💬 Want to connect? Find me on LinkedIn!\n📖 Education 2020 - 2023 MSc, Data Analytics; University of Glasgow Grade: 20/22 (with distinction) Master Thesis: Performance of Classification Models for Coronary Artery Disease with Limited Training Data and Different Synthetic Data Augmentation Techniques 2017 - 2020 BSc, Insurance and Finance; Wiesbaden Business School Grade: 20/22 Master Thesis: Application Areas of Machine Learning in Insurance Pricing 💼 Work Experience 06/2024 - today Data Analyst and BI Expert; Konzern Versicherungskammer Responsible for data pipelines and data products such as BI dashboards, reports and ML models Collaboration in projects and campaigns such as sales campaigns and portfolio management measures Close coordination with business stakeholders, in particular insurance managers, underwriters and actuaries 01/2024 - 06/2024 Implementation Manager; EMIL Group GmbH Implementation of a core insurance system as a SaaS solution for MGAs Collaboration with engineering, product, and sales teams for product enhancement Assuming the role and responsibilities of interim Lead Product Manager 01/2022 - 08/2024 Data Analytics Project Manager; Allianz SE Served as a Project Manager to establish a new CV tool for automated risk extraction in property insurance Acted as Product Owner in program to build new ML platform Collaborated with cross-functional team and regularly communicated with stakeholders to drive project direction and deliver high-quality ML platform 01/2021 - 08/2023 Data Consultant; Self-employed Design and prototyping of a PostgreSQL database for medical data Data analysis in the area of customer segmentation and survey evaluation Conducted several trainings on the topic of web scraping 09/2020 - 12/2020 Financial Data Science Intern; KPMG Development of interactive dashboard on the impact of Covid-19 on different industries Prototypes for calculating customer lifetime values in different contexts Assisted with NLP based competition analysis 09/2020 - 12/2020 Bachelor Trainee; Allianz Germany Review of inquiries and preparation of insurance quotes for broker partners Project tasks in the areas of data analysis, controlling and seminar preparation Worked in sales, underwriting, claims and business intelligence 💻 Technical Skills Python 🔵🔵🔵🔵⚪️ SQL 🔵🔵🔵🔵⚪️ Power BI 🔵🔵🔵🔵⚪️ Excel 🔵🔵🔵🔵⚪️ VBA 🔵🔵🔵⚪️⚪️ R 🔵🔵🔵⚪️⚪️ SAS 🔵🔵⚪️⚪️⚪️ Azure 🔵🔵⚪️⚪️⚪️ 📃 Certifications 03/2026 Microsoft Certified Azure Fundamentals 06/2023 Certified Professional for Requirements Engineering 04/2023 P3.Express Practitioner 03/2023 Professional Scrum Master I 10/2022 Certified PRINCE2 Practitioner 10/2022 Certified PRINCE2 Foundation 💬 Language Skills 🇩🇪 German: native 🇨🇭 Swiss: native 🇺🇸 English: fluent 🇨🇳 Chinese: intermediary\n","permalink":"http://www.gabriel-berardi.com/about-me/","summary":"\u003cp\u003eI am originally born and raised in Switzerland 🇨🇭 but life has brought me to Germany 🇩🇪 where I am living and working in the beautiful City of Munich. For the past 8 years, I have been working in different roles in Insurance - from core insurance functions in sales, to business intelligence and data analytics.\u003c/p\u003e\n\u003cp\u003eAfter completing my Bachelor degree in Insurance and Finance, I had developped a strong interest in all things data and tech. I had already taught myself how to code with Python, but I definitely wanted to learn more about data analytics, which is why I decided to pursue a Masters degree in Data Analytics from the University of Glasgow.\u003c/p\u003e","title":"About Me"},{"content":"Responsible for content according to § 5 TMG (Telemediengesetz):\nGabriel Berardi, Balanstr. 92, Munich\nContact:\nEmail: gabriel-berardi-anfrage (at) mailbox (dot) org (Please replace (at) and (dot) with @ and . when sending an email. This measure helps prevent spam.)\nEditorial Responsibility:\nGabriel Berardi, Balanstr. 92, Munich\nDisclaimer:\nThis newsletter and website are intended for informational purposes only and do not constitute legal, financial, or insurance advice.\nEU Dispute Resolution / Verbraucherstreitbeilegung:\nThe European Commission provides a platform for online dispute resolution (ODR): https://ec.europa.eu/consumers/odr We are neither obliged nor willing to participate in dispute settlement proceedings before a consumer arbitration board.\n","permalink":"http://www.gabriel-berardi.com/imprint/","summary":"\u003cp\u003eResponsible for content according to § 5 TMG (Telemediengesetz):\u003c/p\u003e\n\u003cp\u003eGabriel Berardi, Balanstr. 92, Munich\u003c/p\u003e\n\u003cp\u003eContact:\u003c/p\u003e\n\u003cp\u003eEmail: gabriel-berardi-anfrage (at) mailbox (dot) org\n(Please replace (at) and (dot) with @ and . when sending an email. This measure helps prevent spam.)\u003c/p\u003e\n\u003cp\u003eEditorial Responsibility:\u003c/p\u003e\n\u003cp\u003eGabriel Berardi, Balanstr. 92, Munich\u003c/p\u003e\n\u003cp\u003eDisclaimer:\u003c/p\u003e\n\u003cp\u003eThis newsletter and website are intended for informational purposes only and do not constitute legal, financial, or insurance advice.\u003c/p\u003e\n\u003cp\u003eEU Dispute Resolution / Verbraucherstreitbeilegung:\u003c/p\u003e","title":"Imprint/Impressum"},{"content":"📝 Articles Sometimes I like to write stuff, such as:\naticles on Data Topics articles on Technology articles on Bitcoin 🔓 Recommended FOSS I am a huge fan and supporter of FOSS (Free and Open Source Software). Having a global communtiy of highly talented people working together to develop great software tools is simply a wonderful idea.\nHere\u0026rsquo;s a list of open source software projects that I have used in the past or that I am still using and supporting:\nBitcoin Linux Mint Libre Office Thunderbird VSCodium Cryptomator New Pipe Aegis Bitwarden F-Droid LiberaPay Aurora Store FreeFileSync LibreWolf PiHole Peergos Cryptpad 🎙️ Podcasts I love listening to podcasts and I used the host a podcast on the topic of personal finances many years ago.\nIn May 2024, I have also been a guest on the \u0026ldquo;Data Analytics Chat\u0026rdquo; Podcast by Ben Parker. You can find the episode on Spotify, Apple Podcasts and other platforms.\n","permalink":"http://www.gabriel-berardi.com/ressources/","summary":"\u003ch2 id=\"-articles\"\u003e📝 Articles\u003c/h2\u003e\n\u003cp\u003eSometimes I like to write stuff, such as:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eaticles on \u003ca href=\"/categories/data/\" \u003eData Topics\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003earticles on \u003ca href=\"/categories/tech/\" \u003eTechnology\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003earticles on \u003ca href=\"/categories/bitcoin/\" \u003eBitcoin\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"-recommended-foss\"\u003e🔓 Recommended FOSS\u003c/h2\u003e\n\u003cp\u003eI am a huge fan and supporter of FOSS (Free and Open Source Software). Having a global communtiy of highly talented people working together to develop great software tools is simply a wonderful idea.\u003c/p\u003e\n\u003cp\u003eHere\u0026rsquo;s a list of open source software projects that I have used in the past or that I am still using and supporting:\u003c/p\u003e","title":"Ressources"}]