WEB SCRAPING & DATA INTEGRATION – PART 3

After having tested the EOL API, retrieved the id and created a temporary Pandas Dataframe, we will now store the EOL link in our Database. We will still be using the MySQLdb Python module.

We first need to add a new column to our Database to store the EOL link we fetched with the API.

And update with SQL query the field eol_link for the corresponding sample in our Database.
Remember the Part 2 where the function we built returns a Dafaframe.
For fetching the EOL ID from the Dataframe, we can just run :

0 5381527
Name: id, dtype: int64

If we keep working on the same sample from our Database, we now need to start the process by using the ref_id of the sample namely : 5071 ( ref. illustration Part 2 ) and then:

If the SQL query went well, the field should have been updated. Let’s check :

OUTPUT :

5071 http://eol.org/5381527?action=overview&controller=taxa

We will now be able to scrape all the data we want from EOL to our Database. Let’s see what we could do with BeautifulSoup imagining that we have localized our relevant Data that we wish to aggregate to our Database.

1 ) The Taxonomy established by NCBI Taxonomy:

2 ) All the information from the Detail page with especially the Molecular Biology section:

As you guess, the best tool for quickly inspect HTML and CSS elements will be your properties inspector provided by your internet browser. For instance, let’s localize elements with Google Chrome :

output:

'Cellular organisms +Eukaryota +Viridiplantae +Streptophyta +Streptophytina +Embryophyta +Tracheophyta +Euphyllophyta +Spermatophyta +Magnoliophyta +Mesangiospermae +Eudicotyledons +Gunneridae +Pentapetalae +Asterids +Lamiids +Lamiales +Lamiaceae +Lamioideae +Stachydeae +Melittis + Melittis melissophyllum '

Let’s clean our Data by spliting each separated by a ‘+’ and removing the whitespaces by the using strip(). And let’s store each name in a list :

['Cellular organisms',
'Eukaryota',
'Viridiplantae',
'Streptophyta',
'Streptophytina',
'Embryophyta',
'Tracheophyta',
'Euphyllophyta',
'Spermatophyta',
'Magnoliophyta',
'Mesangiospermae',
'Eudicotyledons',
'Gunneridae',
'Pentapetalae',
'Asterids',
'Lamiids',
'Lamiales',
'Lamiaceae',
'Lamioideae',
'Stachydeae',
'Melittis',
'Melittis melissophyllum']

We have now a very clean list that we can store in our Database.

<< Part 2     Part 4 >>