After having tested the EOL API, retrieved the id and created a temporary Pandas Dataframe, we will now store the EOL link in our Database. We will still be using the MySQLdb Python module.
We first need to add a new column to our Database to store the EOL link we fetched with the API.
1 |
ALTER TABLE drog_collect ADD eol_link MEDIUMTEXT; |
And update with SQL query the field eol_link for the corresponding sample in our Database.
Remember the Part 2 where the function we built returns a Dafaframe.
For fetching the EOL ID from the Dataframe, we can just run :
1 2 3 |
eolSample = eolScraping('Melittis Melissophyllum') eolSampleId = eolSample['id'] eolSampleId |
0 5381527
Name: id, dtype: int64
If we keep working on the same sample from our Database, we now need to start the process by using the ref_id of the sample namely : 5071 ( ref. illustration Part 2 ) and then:
1 2 3 4 5 6 7 8 9 10 11 |
# ref_id of our sample ref_id = 5071 # Prepare a cursor object using cursor() method cursor = db.cursor() # Execute the SQL command cursor.execute("UPDATE drog_collect SET eol_link = %s WHERE ref_id = %s", (eolSampleLink, ref_id )) # Don't forget to add the command 'commit()' when updating a table with MySQLdb db.commit() |
If the SQL query went well, the field should have been updated. Let’s check :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# prepare a cursor object using cursor() method cursor = db.cursor() sql = "SELECT ref_id, eol_link FROM drog_collect WHERE ref_id = 5071" try: # Execute the SQL command cursor.execute(sql) # Fetch all the rows in a list of lists. results = cursor.fetchall() for row in results: ref_id = row[0] eol_link = row[1] print(ref_id , eol_link) except: print("Error: unable to fecth data") #db.close() |
OUTPUT :
5071 http://eol.org/5381527?action=overview&controller=taxa
We will now be able to scrape all the data we want from EOL to our Database. Let’s see what we could do with BeautifulSoup imagining that we have localized our relevant Data that we wish to aggregate to our Database.
1 ) The Taxonomy established by NCBI Taxonomy:
2 ) All the information from the Detail page with especially the Molecular Biology section:
As you guess, the best tool for quickly inspect HTML and CSS elements will be your properties inspector provided by your internet browser. For instance, let’s localize elements with Google Chrome :
1 2 3 4 5 6 7 8 9 10 11 |
import requests from bs4 import BeautifulSoup url = requests.get("http://eol.org/pages/5381527/overview") soup = BeautifulSoup(url.text, "lxml") if(soup.find("div",{"id":"classification_tree"})): section = soup.find("div",{"id":"classification_tree"}) for e in section.findAll("ul",{"class":"branch"})[0]: element = e.text.replace('\n', ' ').lstrip() element |
output:
'Cellular organisms +Eukaryota +Viridiplantae +Streptophyta +Streptophytina +Embryophyta +Tracheophyta +Euphyllophyta +Spermatophyta +Magnoliophyta +Mesangiospermae +Eudicotyledons +Gunneridae +Pentapetalae +Asterids +Lamiids +Lamiales +Lamiaceae +Lamioideae +Stachydeae +Melittis + Melittis melissophyllum '
Let’s clean our Data by spliting each separated by a ‘+’ and removing the whitespaces by the using strip(). And let’s store each name in a list :
1 2 3 4 5 6 7 |
elements = element.split('+') temp_list = [] for n in elements: temp_list.append(n.strip()) temp_list |
['Cellular organisms',
'Eukaryota',
'Viridiplantae',
'Streptophyta',
'Streptophytina',
'Embryophyta',
'Tracheophyta',
'Euphyllophyta',
'Spermatophyta',
'Magnoliophyta',
'Mesangiospermae',
'Eudicotyledons',
'Gunneridae',
'Pentapetalae',
'Asterids',
'Lamiids',
'Lamiales',
'Lamiaceae',
'Lamioideae',
'Stachydeae',
'Melittis',
'Melittis melissophyllum']
We have now a very clean list that we can store in our Database.