REST API to Spark Dataframe

July 27, 2019

With the increasing number of users in the digital world, a lot of raw data is being generated out of which insights could be derived. This is where REST APIs come into picture, as they help in filling the communication gap between the client (your software program) and the server (website’s data)

Introduction

REST APIs act as a gateway to establish a two-way communication between two software applications. This might be to either fetch data from a server or to create them using the respective HTTP protocols.

This post would provide a comprehensive understanding of how we can establish a connection to a server using an API and pull data from the same. In addition to this, we will also learn how to use Spark to perform futher transformations or analysis on this type of data.

By the end of this post, you should be clear on the below areas,

  • Connecting to a REST API using Python’s requests module
  • Basic HTTP methods like GET & POST to retrieve the data
  • Python’s JSON Module to play with the API Output
  • Load the API’s data into a Spark Dataframe.

Table of Contents

What is a REST API ?

An API (Application Programming Interface) in layman terms is simply a piece of code which facilitates the interaction between two software programs. It has a set of predefined protocols and definitions for building an application software.

REST (Representational State Transfer) is an architectural style that defines a set of standards and protocols which are used to create web services, which in turn acts as a medium of communication between two systems. It uses HTTP requests to GET, POST, PUT and DELETE data.
For example, say you want to retrieve the top trending tweets in your location on a daily basis, but instead of doing this exercise manually, you develop an application or write some piece of code to do this for you. This is where the Twitter API comes into use, which enables your software to connect to twitter and perform the desired operations. Similar such APIs are present for websites such as facebook, instagram, google etc.

Below are some of the principles which an API must follow to qualify itself to be RESTful,

  • Client - Server Architecture - Client and server are decoupled and independent to each other.
  • Stateless - Does not store the client side information or state.
  • Cache - Data is cached for large number of concurrent calls
  • Layered system - Integrated with different layers working together to increase scalability

HTTP Methods - Brief Overview

HTTP (Hypertext Transfer Protocol) is devised to facilitate communications between clients and servers. In simple terms, a client makes a request and a response is returned by the server.
This response also consists of the status code and may also consist information of the request payload.

Below are some of the key HTTP methods,

  • GET : Fetch a specific item (by id) or a collection of items
  • POST : Create new data
  • PUT : Update a specific item (by id)
  • DELETE : Remove a specific item by id

This article would consist of more details on how to use GET to retrieve data from a specific endpoint of an API.

Python Requests

The Python requests module is one of the simplest and intuitive libraries to speak with an API. It enables interaction with such microservices and helps consume data from them. It uses the standard HTTP methods to interact with an API.

To install requests, run the below command from your shell environment,
pip install requests

Requests allows you to add headers, data and params etc along with your request.
Let’s use the GET method to connect to google.com. As shown below, the status returned by it “200”, which implies a successful connection.

requests.get("https://google.com/")
<Response [200]>

Connecting to an API using Python’s Requests

Let’s jump in with a simple example to place a GET request to fetch the available items.

For this example, we will be using the oxford dictionary API to get the details of a particular word, such as its meaning and sentence usages etc. Kindly register on https://developer.oxforddictionaries.com to get an API key so that you can try this example out.

import requests
import json

def get_word_details(word):
    language = "en-gb"
    headers = {"app_id": "c7f6d128",
               "app_key": "73ea2ed8109721300050137e74044fa6"}
    url = f"https://od-api.oxforddictionaries.com:443/api/v2/entries/{language}/{word.lower()}"
    response = requests.get(url, headers=headers)
    return response

if __name__ == "__main__":
    response = get_word_details("cogent")
    print(response.text)
{
    "id": "cogent",
    "metadata": {
        "operation": "retrieve",
        "provider": "Oxford University Press",
        "schema": "RetrieveEntry"
    },
    "results": [
        {
            "id": "cogent",
            "language": "en-gb",
            "lexicalEntries": [
                {
                    "derivatives": [
                        {
                            "id": "cogently",
                            "text": "cogently"
                        }
                    ],
                    "entries": [
                        {
                            "etymologies": [
                                "mid 17th century: from Latin cogent-‘compelling’, from the verb cogere, from co-‘together’ + agere‘drive’"
                            ],
                            "senses": [
                                {
                                    "definitions": [
                                        "(of an argument or case) clear, logical, and convincing"
                                    ],
                                    "examples": [
                                        {
                                            "text": "they put forward cogent arguments for British membership"
                                        },
                                        {
                                            "text": "the newspaper's lawyers must prepare a cogent appeal"
                                        }
                                    ],
                                    "id": "m_en_gbus0197450.005",
                                    "shortDefinitions": [
                                        "clear and convincing"
                                    ],
                                    "thesaurusLinks": [
                                        {
                                            "entry_id": "cogent",
                                            "sense_id": "t_en_gb0002496.001"
                                        }
                                    ]
                                }
                            ]
                        }
                    ],
                    "language": "en-gb",
                    "lexicalCategory": {
                        "id": "adjective",
                        "text": "Adjective"
                    },
                    "pronunciations": [
                        {
                            "audioFile": "http://audio.oxforddictionaries.com/en/mp3/cogent_gb_1.mp3",
                            "dialects": [
                                "British English"
                            ],
                            "phoneticNotation": "IPA",
                            "phoneticSpelling": "ˈkəʊdʒ(ə)nt"
                        }
                    ],
                    "text": "cogent"
                }
            ],
            "type": "headword",
            "word": "cogent"
        }
    ],
    "word": "cogent"
}

API Response

Now that we’ve established a connection to the API, let’s explore some of the attributes of the response such as it’s status_code, content, headers etc.

Status Code

This parameter is used to check if the API hit was successful or not. A successful request would have a status code of “200”. Any other status code implies an error from either the server or the client’s end.

To check the status of your request, it is as simple as calling the status_code from the response, as shown below

response.status_code
200

Headers

This property is used to display the header information sent to the endpoint while placing the request.
Headers consist of useful information such as the content type, api version, rate limit allowed for the API and other similar details.
The headers attribute returns a ‘dict’ type object.

Below example demonstrates on how to display the headers,

print(type(response.headers))
print(response.headers)
<class 'requests.structures.CaseInsensitiveDict'>
{'api_version': 'v2', 'code_version': 'v2.4.1.2-0-g284004c', 'Content-Type': 'application/json;charset=utf-8', 'Date': 'Thu, 25 Jul 2019 10:11:50 GMT', 'Server': 'openresty/1.13.6.2', 'Content-Length': '4731', 'Connection': 'Keep-Alive', 'Age': '0'}

Content

This property is useful in extracting the response in byte format.
This is beneficial in certain use cases like constructing an image using raw bytes.

The content can be accessed in bytes by the calling the .content attriibute.

print(response.content)
b'{\n    "id": "cogent",\n    "metadata": {\n        "operation": "retrieve",\n        "provider": "Oxford University Press",\n        "schema": "RetrieveEntry"\n    },\n    "results": [\n        {\n            "id": "cogent",\n            "language": "en-gb",\n            "lexicalEntries": [\n                {\n                    "derivatives": [\n                        {\n                            "id": "cogently",\n                            "text": "cogently"\n                        }\n                    ],\n                    "entries": [\n                        {\n                            "etymologies": [\n                                "mid 17th century: from Latin cogent-\xe2\x80\x98compelling\xe2\x80\x99, from the verb cogere, from co-\xe2\x80\x98together\xe2\x80\x99 + agere\xe2\x80\x98drive\xe2\x80\x99"\n                            ],\n                            "senses": [\n                                {\n                                    "definitions": [\n                                        "(of an argument or case) clear, logical, and convincing"\n                                    ],\n                                    "examples": [\n                                        {\n                                            "text": "they put forward cogent arguments for British membership"\n                                        },\n                                        {\n                                            "text": "the newspaper\'s lawyers must prepare a cogent appeal"\n                                        }\n                                    ],\n                                    "id": "m_en_gbus0197450.005",\n                                    "shortDefinitions": [\n                                        "clear and convincing"\n                                    ],\n                                    "thesaurusLinks": [\n                                        {\n                                            "entry_id": "cogent",\n                                            "sense_id": "t_en_gb0002496.001"\n                                        }\n                                    ]\n                                }\n                            ]\n                        }\n                    ],\n                    "language": "en-gb",\n                    "lexicalCategory": {\n                        "id": "adjective",\n                        "text": "Adjective"\n                    },\n                    "pronunciations": [\n                        {\n                            "audioFile": "http://audio.oxforddictionaries.com/en/mp3/cogent_gb_1.mp3",\n                            "dialects": [\n                                "British English"\n                            ],\n                            "phoneticNotation": "IPA",\n                            "phoneticSpelling": "\xcb\x88k\xc9\x99\xca\x8ad\xca\x92(\xc9\x99)nt"\n                        }\n                    ],\n                    "text": "cogent"\n                }\n            ],\n            "type": "headword",\n            "word": "cogent"\n        }\n    ],\n    "word": "cogent"\n}'

Loading JSON Data into Spark Dataframe

Now that we have our raw JSON data, let’s load this in a Spark dataframe.
To read more on how to deal with JSON/semi-structured data in Spark, click here.

Spark has a read.json method to read JSON data and load it into a Spark DataFrame. The read.json method accepts a file path or a list of file paths or an RDD consisting of JSON data.
For this example, we will pass an RDD as an argument to the read.json method.
The RDD can be created by calling the sc.parallelize method, as shown below.

json_rdd = sc.parallelize([response.text])

Now, this RDD can be passed as an argument to the read.json method to create the corresponding dataframe.

df = spark.read.json(json_rdd)
df.show(truncate=False)
+------+--------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|id    |metadata                                          |results                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |word  |
+------+--------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|cogent|[retrieve, Oxford University Press, RetrieveEntry]|[[cogent, en-gb, [[[[cogently, cogently]], [[[mid 17th century: from Latin cogent-‘compelling’, from the verb cogere, from co-‘together’ + agere‘drive’], [[[(of an argument or case) clear, logical, and convincing], [[they put forward cogent arguments for British membership], [the newspaper's lawyers must prepare a cogent appeal]], m_en_gbus0197450.005, [clear and convincing], [[cogent, t_en_gb0002496.001]]]]]], en-gb, [adjective, Adjective], [[http://audio.oxforddictionaries.com/en/mp3/cogent_gb_1.mp3, [British English], IPA, ˈkəʊdʒ(ə)nt]], cogent]], headword, cogent]]|cogent|
+------+--------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+

Loading Multiple JSON Records

The above example explained how a single JSON record can be loaded to a spark dataframe.

Now, let’s take it up a notch and explore some methods to load multiple JSON responses.
Although there are multiple methods to achieve this, 2 methods would be discussed in this post,

  • Using a List
  • Using a File

Using a List

Now let’s extract the details for a set of words and load the final cumulative response into a spark dataframe.

First, we’ll build a file like object with all of the responses apended together.

Note: Spark accepts JSON data in the new-line delimited JSON Lines format, which basically means the JSON file must meet the below 3 requirements,

  • Each Line of the file is a JSON Record
  • Line Separator must be ‘\n’ or ‘\r\n’
  • Data must be UTF-8 Encoded

For our example, let’s create a python list consisting of all the responses and then, convert the same to an RDD. This RDD can be used to create the dataframe.

import requests

def get_word_details(word):
    language = "en-gb"
    headers = {"app_id": "c7f6d128",
               "app_key": "73ea2ed8109721300050137e74044fa6"}
    url = f"https://od-api.oxforddictionaries.com:443/api/v2/entries/{language}/{word.lower()}"
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        if response.status_code == 404:
            print("Word Definition not found for {0}".format(word))
            return None
        else:
            raise Exception("API Hit Failed - {0}".response.text)
    return response

if __name__ == "__main__":
    words = ["cogent", "digress", "tangible", "diligent", "mellifluous", "obscure", "intelligible"]
    response_list = []
    for word in words:
        response = get_word_details(word)
        response_list.append(response.text) # Appending the JSON responses to the list

We can convert this list of JSON data into an RDD using the sc.parallelize method, as shown below,

json_rdd = sc.parallelize(response_list)

Finally we can convert the above RDD into a dataframe by calling the read.json method.

json_df = spark.read.json(json_rdd)
json_df.show()
+------------+--------------------+--------------------+------------+
|          id|            metadata|             results|        word|
+------------+--------------------+--------------------+------------+
|      cogent|[retrieve, Oxford...|[[cogent, en-gb, ...|      cogent|
|     digress|[retrieve, Oxford...|[[digress, en-gb,...|     digress|
|    tangible|[retrieve, Oxford...|[[tangible, en-gb...|    tangible|
|    diligent|[retrieve, Oxford...|[[diligent, en-gb...|    diligent|
| mellifluous|[retrieve, Oxford...|[[mellifluous, en...| mellifluous|
|     obscure|[retrieve, Oxford...|[[obscure, en-gb,...|     obscure|
|intelligible|[retrieve, Oxford...|[[intelligible, e...|intelligible|
+------------+--------------------+--------------------+------------+

Using a File

The above example is a simple illustration to load JSON data to spark.
When dealing with large amounts of data, the above method of loading data to a list would be inefficent and memory heavy.
So, in such cases, it’s better to use an external object such as a file residing on disk to store the data from the API before loading it into a spark dataframe.

Let’s explore an example to load a file consisting of JSON strings.

import requests

def get_word_details(word):
    language = "en-gb"
    headers = {"app_id": "c7f6d128",
               "app_key": "73ea2ed8109721300050137e74044fa6"}
    url = f"https://od-api.oxforddictionaries.com:443/api/v2/entries/{language}/{word.lower()}"
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        if response.status_code == 404:
            print("Word Definition not found for {0}".format(word))
            return None
        else:
            raise Exception("API Hit Failed - {0}".response.text)
    return response

def write_to_file(words):
    output_file = "/Users/ecom-ahmed.noufel/Documents/words_file.json"
    for word in words:
        response = get_word_details(word)
        if response:
            # Open the file in append mode
            with open(output_file, 'a') as f:
                f.write(response.text)
                f.write("\n")
    return output_file

if __name__ == "__main__":
    words = ["cogent", "digress", "tangible", "diligent", "mellifluous", "obscure", "intelligible", "awa"]
    json_file = write_to_file(words)
    print("Output JSON Data Present in : {0}".format(json_file))
Word Definition not found for awa
Output JSON Data Present in : /Users/ecom-ahmed.noufel/Documents/words_file.json

Commonly faced issue - Corrupt Record

Now that we have our JSON data in a file, we can proceed in loading the same to a Spark Dataframe.
However, when we try to view the dataframe’s schema we get a corrupt record.
The most common reason for this corrupt record issue is an incorrect JSON File structure.

json_file_df = spark.read.json(json_file)
json_file_df.printSchema()
root
 |-- _corrupt_record: string (nullable = true)

Cause of Issue

If we open this JSON file, we can see that a single JSON record is spanning multiple lines in the file, which makes it tough for Spark to differentiate between individual JSON records.

Resolution

To combat this, let’s fix this file by loading a single JSON record per line inside the file.
To do this, we can make use of python’s json module.
First, let’s use the response.json() method to obtaing the API response as a dictionary object and then the json.dumps method can be used to convert this dict object to a single line JSON record.

Note: The json.loads method can also be used to convert a JSON string to a dictionary object,
It is as simple as executing the below statement,
json.loads(response.text)

import requests

def get_word_details(word):
    language = "en-gb"
    headers = {"app_id": "c7f6d128",
               "app_key": "73ea2ed8109721300050137e74044fa6"}
    url = f"https://od-api.oxforddictionaries.com:443/api/v2/entries/{language}/{word.lower()}"
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        if response.status_code == 404:
            print("Word Definition not found for {0}".format(word))
            return None
        else:
            raise Exception("API Hit Failed - {0}".response.text)
    return response

def write_to_file(words):
    output_file = "/Users/ecom-ahmed.noufel/Documents/words_file_4.json"
    for word in words:
        response = get_word_details(word)
        if response:
            # Open the file in append mode
            with open(output_file, 'a') as f:
                f.write(json.dumps(response.json())) # Creating a JSON String from a dict object
                f.write("\n")  # Appending new line at the end of the JSON record
    return output_file

if __name__ == "__main__":
    words = ["cogent", "digress", "tangible", "diligent", "mellifluous", "obscure", "intelligible", "awa"]
    json_file = write_to_file(words)
    print("Output JSON Data Present in : {0}".format(json_file))
Word Definition not found for awa
Output JSON Data Present in : /Users/ecom-ahmed.noufel/Documents/words_file_4.json
new_json_file_df = spark.read.json("/Users/ecom-ahmed.noufel/Documents/words_file_3.json")
new_json_file_df.printSchema()
root
 |-- id: string (nullable = true)
 |-- metadata: struct (nullable = true)
 |    |-- operation: string (nullable = true)
 |    |-- provider: string (nullable = true)
 |    |-- schema: string (nullable = true)
 |-- results: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- language: string (nullable = true)
 |    |    |-- lexicalEntries: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- derivatives: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |-- entries: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- etymologies: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |    |-- grammaticalFeatures: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |    |-- notes: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |    |-- senses: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |-- constructions: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- definitions: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |    |    |    |-- examples: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- shortDefinitions: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |    |    |    |-- subsenses: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |    |-- definitions: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |    |    |    |    |    |-- domains: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- examples: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- shortDefinitions: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |    |    |    |    |    |-- thesaurusLinks: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |    |    |    |-- entry_id: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |    |    |-- sense_id: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- thesaurusLinks: array (nullable = true)
 |    |    |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |    |    |-- entry_id: string (nullable = true)
 |    |    |    |    |    |    |    |    |    |    |-- sense_id: string (nullable = true)
 |    |    |    |    |-- language: string (nullable = true)
 |    |    |    |    |-- lexicalCategory: struct (nullable = true)
 |    |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |    |-- text: string (nullable = true)
 |    |    |    |    |-- pronunciations: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- audioFile: string (nullable = true)
 |    |    |    |    |    |    |-- dialects: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |    |-- phoneticNotation: string (nullable = true)
 |    |    |    |    |    |    |-- phoneticSpelling: string (nullable = true)
 |    |    |    |    |-- text: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- word: string (nullable = true)
 |-- word: string (nullable = true)
new_json_file_df.show()
+------------+--------------------+--------------------+------------+
|          id|            metadata|             results|        word|
+------------+--------------------+--------------------+------------+
|      cogent|[retrieve, Oxford...|[[cogent, en-gb, ...|      cogent|
|     digress|[retrieve, Oxford...|[[digress, en-gb,...|     digress|
|    tangible|[retrieve, Oxford...|[[tangible, en-gb...|    tangible|
|    diligent|[retrieve, Oxford...|[[diligent, en-gb...|    diligent|
| mellifluous|[retrieve, Oxford...|[[mellifluous, en...| mellifluous|
|     obscure|[retrieve, Oxford...|[[obscure, en-gb,...|     obscure|
|intelligible|[retrieve, Oxford...|[[intelligible, e...|intelligible|
+------------+--------------------+--------------------+------------+

Flattening the Data

Now that we have our data in our dataframe, let’s derive a flat structure out of it.
For this illustration, let’s arrive at a structure with the below columns,
id, language, definition, examples
All the above columns are present inside the results array. We will use Spark SQL to construct this flat structure

Note: For additional details on dealing with semi structured data on Spark, click here.

new_json_file_df.createOrReplaceTempView('my_dictionary')

df5 = spark.sql("""select
  r.id as word
, r.language as language
, definitions as definition
, examples.text as example
from my_dictionary
lateral view explode(results)a as r
lateral view explode(r.lexicalEntries)e as rlE
lateral view explode(rlE.entries)b as rlEe
lateral view explode(rlEe.senses)b as rlEes
lateral view explode(rlEes.definitions)c as definitions
lateral view explode(rlEes.examples)d as examples
""")

df5.printSchema()
df5.show(20,False)
root
 |-- word: string (nullable = true)
 |-- language: string (nullable = true)
 |-- definition: string (nullable = true)
 |-- example: string (nullable = true)

+------------+--------+--------------------------------------------------------------------+-----------------------------------------------------------------+
|word        |language|definition                                                          |example                                                          |
+------------+--------+--------------------------------------------------------------------+-----------------------------------------------------------------+
|cogent      |en-gb   |(of an argument or case) clear, logical, and convincing             |they put forward cogent arguments for British membership         |
|cogent      |en-gb   |(of an argument or case) clear, logical, and convincing             |the newspaper's lawyers must prepare a cogent appeal             |
|digress     |en-gb   |leave the main subject temporarily in speech or writing             |I have digressed a little from my original plan                  |
|tangible    |en-gb   |perceptible by touch                                                |the atmosphere of neglect and abandonment was almost tangible    |
|tangible    |en-gb   |a thing that is perceptible by touch                                |these are the only tangibles upon which an assessment can be made|
|diligent    |en-gb   |having or showing care and conscientiousness in one's work or duties|after diligent searching, he found a parcel                      |
|mellifluous |en-gb   |(of a sound) pleasingly smooth and musical to hear                  |her low mellifluous voice                                        |
|obscure     |en-gb   |not discovered or known about; uncertain                            |his origins and parentage are obscure                            |
|obscure     |en-gb   |not clearly expressed or easily understood                          |obscure references to Proust                                     |
|obscure     |en-gb   |keep from being seen; conceal                                       |grey clouds obscure the sun                                      |
|intelligible|en-gb   |able to be understood; comprehensible                               |use vocabulary that is intelligible to your audience             |
|intelligible|en-gb   |able to be understood; comprehensible                               |a barely intelligible reply                                      |
+------------+--------+--------------------------------------------------------------------+-----------------------------------------------------------------+

Valuable tip to ignore unwanted fields

When you load data from an API, there may be some or many fields which may not be used. So, loading such fields would be redundant.
These unwanted fields can be negated before even it is loaded in the spark dataframe. To do this, we can use the JSON module present in python to manipulate and load only the required JSON fields.

Let’s go ahead with a simple example to perform this. From the API’s response we shall load only the results element.

To perform this action, it is as simple as selecting a key from a dictionary as shown below.

response.json()['results']
[{'id': 'placebo',
  'language': 'en-gb',
  'lexicalEntries': [{'entries': [{'etymologies': ['late 18th century: from Latin, literally ‘I shall be acceptable or pleasing’, from placere‘to please’'],
      'senses': [{'definitions': ['a medicine or procedure prescribed for the psychological benefit to the patient rather than for any physiological effect.'],
        'domains': [{'id': 'medicine', 'text': 'Medicine'}],
        'id': 'm_en_gbus0786080.005',
        'shortDefinitions': ['medicine or procedure prescribed for psychological benefit'],
        'subsenses': [{'definitions': ['a substance that has no therapeutic effect, used as a control in testing new drugs.'],
          'domains': [{'id': 'medicine', 'text': 'Medicine'}],
          'id': 'm_en_gbus0786080.008',
          'shortDefinitions': ['substance that has no therapeutic effect, used as control in testing new drugs']},
         {'definitions': ['a measure designed merely to humour or placate someone'],
          'examples': [{'text': 'pacified by the placebos of the previous year, they claimed a moral victory'}],
          'id': 'm_en_gbus0786080.009',
          'shortDefinitions': ['measure designed merely to humour or placate someone']}],
        'thesaurusLinks': [{'entry_id': 'medicine',
          'sense_id': 't_en_gb0009310.001'}]}]}],
    'language': 'en-gb',
    'lexicalCategory': {'id': 'noun', 'text': 'Noun'},
    'pronunciations': [{'audioFile': 'http://audio.oxforddictionaries.com/en/mp3/placebo_gb_1.mp3',
      'dialects': ['British English'],
      'phoneticNotation': 'IPA',
      'phoneticSpelling': 'pləˈsiːbəʊ'}],
    'text': 'placebo'}],
  'type': 'headword',
  'word': 'placebo'}]

To load the same into a Spark Dataframe, follow the steps present below,

json_data = json.dumps(response.json()['results'])
results_rdd = sc.parallelize([json_data])
results_df = spark.read.json(results_rdd)
results_df.show()
+-------+--------+--------------------+--------+-------+
|     id|language|      lexicalEntries|    type|   word|
+-------+--------+--------------------+--------+-------+
|placebo|   en-gb|[[[[[late 18th ce...|headword|placebo|
+-------+--------+--------------------+--------+-------+

The above method is very handy when you want to pick some child element from the JSON.

Below is another example where I can use Python’s dictionary & list methods to pick the element etymologies from the JSON as shown below.

Firstly the json.loads method can be used to convert the JSON string to a Python Dict Object. After this, elements can be referenced from the dictionary.

json.loads(response_list[0])['results'][0]['lexicalEntries'][0]['entries'][0]['etymologies']
['mid 17th century: from Latin cogent-‘compelling’, from the verb cogere, from co-‘together’ + agere‘drive’']

Conclusion

By now, I hope you are comfortable in pulling data from REST APIs and loading the same to Spark. I recommend you to try connecting with similar such API’s to gain more expertise.
Comments and feedback are welcome. Cheers!

comments powered by Disqus