WEB SCRAPING & DATA INTEGRATION – PART 4

Molecular Biology – Barcode Image Scraping

In this part, let’s focus on the next HTML element we would like to scrape namely the Molecular Biology section in EOL detail page. Here below our html code :

As shown below, the script to proceed the web scraping from the url:

Output:

http://v2.boldsystems.org/connect/REST/getBarcodeRepForSpecies.php?taxid=291828&iwidth=400

Geographic Repartition

Let’s go deeper and further and now we would like to aggregate Data related to the geographic repartition of each sample that we can find in the Maps section from EOL web page. Moreover, we would like to only aggregate Data that has been validated and having the Trusted label.

We need to pay attention because in this case, we will not be able to directly scrape the content of the page. Indeed, the EOL Maps page only lists some links. You can visualize the repartition map only after having clicked on the given links.

For processing, we have to first scrape the links given on the Maps page, and from each link, relaunch a Scraping in each destination page. Our python is getting a bit more complicated but as soon as you can get the logic of BeautifulSoup Scraping, you will find it very easy.

 

Output:

Link found from Map Page: /data_objects/21321519
Map Image link: http://www.discoverlife.org/mp/20m?map=Melittis+melissophyllum
------------------
Link found from Map Page: /data_objects/30732673
Map Image link: http://media.eol.org/content/2011/12/25/14/02861_580_360.jpg
------------------

<< Part 3