Let’s continue and let’s play with our Dataframe we just have created.

A typical operation would be for many purposes to export our Python Dataframe to CSV file. For that, we will be using the pandas command to_csv. You can find the complete options list here.


For the next steps, we will assume that we need to realize heavy computing tasks our choice for that is Spark. Spark provides a faster and more general data processing platform. Spark lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop.

We are going to use Apache Spark version 2.2.1 and we will create Dataframe instead of RDD. A Spark Dataframe as a Spark RDD is an immutable distributed collection of data but unlike a RDD, a Spark Dataframe is organized into named columns, like a table in a relational database.

Most data engineers recommends to develop either in Scala (which is the native Spark language) or in Python through complete PySpark API. By using PySpark, it will give you the possibility to develop with a Jupyter Noteebok to modify and re-execute parts of your code in a very flexible way. Jupyter is a great tool to test and prototype programs.

I will not detail here how to install Spark but you can find many resources on the web.

As shown here below, you can simply convert your CSV file previously exported into a Spark Dataframe. Since the Spark 2.0.0 release, there is a new abstraction available namely: the Spark Session. It allows developers to instantiate and calld upon just like the Spark Context previously used.

I strongly foster to train yourself to handle Spark Dataframe. As you can notice, thanks to PySpark the operations are quite similar to those used for Python DataFrames. You can :

– Return the a defined number of rows as a list of Row with the command take() :

[Row(_c0=None, _c1=’user_id’, _c2=’user_email’, _c3=’user_info’, _c4=’user_info2′),
Row(_c0=0, _c1=’2′, _c2=’’, _c3=’693881′, _c4=None),
Row(_c0=1, _c1=’9′, _c2=’’, _c3=’500913′, _c4=’178416′),
Row(_c0=2, _c1=’15’, _c2=’’, _c3=’122067′, _c4=None),
Row(_c0=3, _c1=’16’, _c2=’’, _c3=’476609′, _c4=’973557′),
Row(_c0=4, _c1=’27’, _c2=’’, _c3=’243361′, _c4=None),
Row(_c0=5, _c1=’38’, _c2=’’, _c3=’966782′, _c4=None),
Row(_c0=6, _c1=’89’, _c2=’’, _c3=’598241′, _c4=None),
Row(_c0=7, _c1=’101′, _c2=’’, _c3=’889013′, _c4=None),
Row(_c0=8, _c1=’127′, _c2=’’, _c3=’708993′, _c4=’672496′)]

* You can notice that the header row counts for 1 row.

– Print the Dataframe Schema in the tree format:

|– _c0: integer (nullable = true)
|– _c1: string (nullable = true)
|– _c2: string (nullable = true)
|– _c3: string (nullable = true)
|– _c4: string (nullable = true)

– Select and print a specific column of the Dataframe:

– Select and filter data by using attribute (df._c1) or by indexing (df[‘_c1’]):

– As well as the describe() operation in order calculate the summary statistics of numerical column(s) in our Spark DataFrame:

Spark SQL

Spark SQL module is an Apache Spark module in order to work with structured data just as relational database. Spark SQL allows you to query structured data inside Spark environment by using either SQL or a familiar DataFrame API. It is usable in Java, Scala, Python and R.

Let’s say that we need to select specific rows from our Dataframe, apply a function, and create and export a view. For that, we will a create a Temporary table with registerTempTable(). Pay attention that the lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.

We are especially interested in customers having some data in the field user_info2. Our header row has been removed and we will work directly with _c0,_c1,_c2 and _c3 as column name. We can simply send a SQL query by select customers whose the field ‘_c3’ ( previously user_info2 ) is not null as shown below. You can notice that the method toPandas() has been added. It will return the contents in a Pandas DataFrame format.

It’s now the time to apply our function. Let’s purposefully use a simple function for the example – we want a have the sum of the fields user_info and user_info2 (_c2,_c3) and create another column in the dataframe storing this value :

Don’t forget that Spark is designed for distributed computing ! It would be useless to use it for simply reading a CSV file or applying a simple function like in our example.

<< Part 1