Scraping Data from Multiple Sites for Powerful Analysis

Recently we shared a post on the World’s Top Rugby Players which included the visualisation below. We as a team continually work as together to improve our visualisations and the insights they provide viewers, but the real power behind this particular visualisation was the way Junior Data Scientist, Sai Diwakar Bhrugubanda, used his programming skills to scrape data from multiple sources and produce a collated and dynamic visualisation from these sources.

How it was done

Firstly, a main source of data was identified. This source contains the main points of information we needed to extract for the purpose of producing the visualisation e.g. professional rugby players’ names, height, weight and age.

To be able to extract the information from each players profile, we created a specific program: ‘Web Scrapping Process’.

That code looks like this:

The code is broken down into 4 blocks. Each with the following purpose:

1. & 2. The first two blocks ([5] and [6]) of code utilise the specific web URL and find all Sports Profiles links. Essentially, these two blocks identify the most important data within the website (rugby player profiles that we are interested in).

3. With these links recorded, the third block of code ([7]) extracts the specific link for each player identified from the first process we explained in steps 1 & 2 above.

4. The last block of code ([8]) records each of the individual players’ links, pulling out the relevant information including their height, birth date and position on the field.

But the job wasn’t finished there

The aim of this particular visualisation was to identify the birth places of the top rugby players. We had their positions, heights, ages and Rugby Pass Index - a measure of their overall impact and influence on games and the key indicator to their success in this particular ranking system - but we still needed to identify their place of birth. Searching for each player’s birthplace individually would be way too time consuming, so, using the information already collected, an Excel file with each players name, followed by “birthplace” was created. Then using the Python code below, this program then searched each individual players name followed by the text “birthplace” which consequently extracted the Wikipedia link with their details.

Once each link was found, the Python code drilled down into the link to extract the birthplace of each individual player. That looked like this:

Note: With any automated system like this, you need some manual checks to make sure the machine is doing things correctly.

Then, using his visualisation skills, Sai produced the interactive dashboard in Tableau.

Conclusion

Collating lots of information together in one place is a very useful and often overlooked data analysis process. However, being able to identify data sources that provide you with insightful information about your business, an event or an idea, is one of the first powerful steps you can make to extract the true value from your data.

If you would like help collecting and organising your data to create insightful and dynamic dashboards and reports, please get in touch today for more information on ways we can help you.

To keep up with all things data and White Box, follow us on our LinkedIn page.

Featured

Feb 6, 2025

Leveraging Power BI for Financial Reporting: 5 Best Practices

Feb 6, 2025

Power BI is transforming financial reporting by turning complex data into actionable insights. Discover five key practices to optimise data models, design clear dashboards, enable interactivity, automate updates, and maintain data security, empowering smarter business decisions.

Feb 6, 2025

Jun 26, 2023

AFL 2023: Risers and Sliders

Jun 26, 2023

After a long off-season, the AFL season is here! What better way to celebrate than by making some bold, data-driven predictions about the biggest movers of the upcoming season.

Jun 26, 2023

May 17, 2023

Data Mesh - The future for data led organisations

May 17, 2023

The main obstacle with any developed data analytics function is the bottleneck of data and skillset – the more central intelligence you have, the more outer teams want to use it, so you need to find a way of allowing data and insights to flow.

If data is so complex that only your highly paid data wizards, how can make sense of it?

Here’s one methodology that we’ll discuss in detail, the Data Mesh:

A decentralised data structure with domain/product teams who own the analytical and operational data:
Self-serve infrastructure, data products and federated governance.
A data-driven decision-making culture across the organisation.

May 17, 2023

Mar 8, 2023

Using data to predict movie ratings

Mar 8, 2023

With the Oscars just round the corner (March 12th), it seems fitting to keep on top of the latest movies and see if data can help find the diamonds in the rough!

In my previous article I looked at upcoming sequels, with a lens of which movies would come out on top. A question that kept coming back was “can we predict how non sequel movies are going to do (rating wise)?”

Where there is data, we find an answer!

Mar 8, 2023

Mar 2, 2023

NRL 2023: Movers and Shakers

Mar 2, 2023

After a long off-season, the NRL season is here! What better way to celebrate than by making some bold, data-driven predictions about the biggest movers of the upcoming season.

Mar 2, 2023

Feb 14, 2023

Gender and jobs, are we heading in the right direction?

Feb 14, 2023

The Australian Bureau of Statistics released the 2nd round of Census data on topics including employment and location-based variables in Oct2022. Our last article focused on the gender pay gap, so this time, we’ve used occupation data by age and gender to further enhance our understanding of this topic.

Feb 14, 2023

Jan 31, 2023

The most anticipated movies of 2023

Jan 31, 2023

There are some BIG movies on the horizon this year – some where we know the characters like old friends and others that will introduce new heroes and villains (I really don’t know what to make of the new horror Winnie the Pooh movie).

Jan 31, 2023

Dec 19, 2022

Christmas - a time to watch TV?

Dec 19, 2022

The Christmas break offers so much - from family reunions, beach activities to stuffing ourselves with more food than would normally be appropriate.

If like me, you feel a bit overwhelmed by the choice of TV shows to catch up with, let’s use data to our advantage and make sure you cover off the biggest shows from the last few years!

Dec 19, 2022

AU retail sales over last 10 years (excluding Dec).png

Nov 22, 2022

Black Friday: Big deal or no deal?

Nov 22, 2022

Black Friday (25th Nov) is a global event and something consumers and retailers look forward to every year.

In the last 3 years, we can see how much Electronic goods and Clothing retail have really taken advantage of Black Friday and look to be outpacing Department stores for uplift.

Nov 22, 2022

Nov 7, 2022

Mind the gap: the disparity in gender pay goes on

Nov 7, 2022

We’re all aware that there is a gender pay gap but are certain areas more prone to it? We look at high earners across Australia using the 2021 Census data.

Nov 7, 2022

CommentaryGuest UserMay 14, 2020White Box Analytics Pty Limitedhow to combine data sources, visualising data, data scraping, how to scrape data