WEB SCRAPING & DATA INTEGRATION – Part 2

3. Scraping

Data scraping, also known as web scraping, is the process of importing data from a website into your own data system. It’s one of the most efficient ways to get data from the web, and in some cases to channel that data to another website.

To scrape external data, you have to communicate over the Internet (HTTP) by using a famous Python library named Requests. When we retrieve the data, we will have to parseit from HTML, for which we will use lxml method such as Beautiful Soup.

Scraping represents a strong added value for any Databse aiming to be extended and enriched by external data from the web. In our Data Architecture for “Le Droguier”, scraping external data from related and relevant websites in an amazing way to consolidate your data sources in order to obtain a unique, large and highly condensed Database for students in Pharmacy.

1. Encyclopedia of Life – EOL

A relevant external source is the “Encyclopedia of Life” (EOL) website. www.eol.org is a free and collaborative encyclopedia which documents all of the 1.9 million living species known to science. It is compiled from existing databases and from contributions by experts and non-experts throughout the world.It aims to build one “infinitely expandable” page for each species, including video, sound, images, graphics, as well as text. In addition, the Encyclopedia incorporates content from the Biodiversity Heritage Library, which digitizes millions of pages of printed literature from the world’s major natural history libraries.

EOL has an API which lets us embed the functionality of EOL into your own website and tools, helping to make EOL an ingredient in biodiversity applications : http://eol.org/api/docs/search

As we are developers, we will obviously be using the url :

http://eol.org/api/search/1.0.json?q=query

instead of the UI proposed on the eol website. And overall, we consider that the API response doesn’t match our expectations for the Data Aggregation. For that reason, we don’t content ourselves with the API data returned but thanks to that, we will jump to the species page and scrape ALL the relevant information.

EOL API Call and Response

Let’s say we need to add information for the sample ref_id : 5071 called Melittis melissophyllum 

By using the call :

http://eol.org/api/search/1.0.json?q=Melittis%20Melissophyllum

Eol replies as shown here below :

On the Python side, you could run this script to only just fetch the ids from EOL:

which should output as follows:


5381527
5381527
36983592
5381518
5381531
45851589

The good news is that we received results. The bad is that we received SEVERAL results… You will even see further that some call might return more than 100 results. We will how to handle that.

The reason is likely that the EOL search engine also (or mainly) read the content information.

We could deal with these many results by linking several EOL pages for each sample of our Database but the first goal in this use case is to aggregate 100 % relevant data. We then need to find a way to select the right (or the best) link related to our sample.

Aggregating by Sample Title

Our first option would be to aggregate the right data by finding the EOL sample having the same title than our sample. If we take a look to our first result of the JSON file returned by the API we notica that it could work :

Let’s build a Pandas Dataframe containing all the fields returned by the API call :

Let’s create a function returning a dataframe which contains the EOL data having the same name in the field title :

output:

We have now 2 results but it seems that both lines have the same id. It should be a duplicate entry in the EOL Database. We just need to prevent any duplicate result by checking if the id are the same or not. Let’s modify the function as follows:

Let’s see in the next part how to create and update our Database with the EOL data.

<< Part 1 Part 3 >>