Web Scraping & Data Integration

I will here expose a use case whose the main idea is to consolidate a Relational Database by aggregating external data queried and fetched on the web. This project has been developed for the Pharmacology University of Montpellier in France for a Thesis collaboration directed by Thomas Citterio. We worked together for the creation, design and development of the new data system for « La Collection du Droguier », Drug Collection started in 1663 at the University of Pharmacy in Montpellier.

I will try to show and explain the Data processing based mainly based on External Data Scraping initially developped in PHP (for a Laravel front end) but converted in Python for the presentation.

1. The original Database

We will not detail how the initial Database has been created but we will start from the step when all the drug samples have been entered in a Database (MySQL). The Database structure has been designed in order to store all the features availables previously stored (… for years ! ) in small handwritten paper cards.But for information, the only content available (except the physical samples and the paper cards ) was a scanned picture of the paper card for each sample as shown on the left. We attempted to use many OCR (Optical character recognition) but as you assume, it was definitely to hard to read characters and that’s why the biggest part of the collection has been entered and checked by hand in the system for insuring and controlling the Data quality.

The MySQL Database has been as shown here below:

2. MySQL Data to Python Pandas Dataframe

Let’s now handle our Data with Python. For connecting to the MySQL Database through a MySQLConnection object , you can refer our page : http://devdataproject.com/data-pipeline-from-mysql-to-spark-with-python/ where you will have explanation on how to set the set the MySQLremote connection and how to implement MySQLdb.

For beginning, we will import Data to Python and create a Pandas Dataframe by fetching Data ref_id, nom_l, num_fam, num_genre where :

  • ref_id : is the unique ID of each sample in the Database
  • nom_l : the latin name of the sample
  • num_fam: the family number
  • num_genre : the gender family

Print the dataframe output:

Part 2 >>