Blogs

Hive Analytic Functions

In a world where data is everything, transforming raw data into meaningful insights require the usage of certain techniques and functions. This post would focus on the commonly used SQL analytical functions in Hive and Spark Introduction Analytic functions come packed with a lot of features such as computing aggregates such as moving sums, cumulative sums, averages etc. In addition to this, they are used to obtain ranking on the data, eliminating duplicates etc.

Continue reading

REST API to Spark Dataframe

With the increasing number of users in the digital world, a lot of raw data is being generated out of which insights could be derived. This is where REST APIs come into picture, as they help in filling the communication gap between the client (your software program) and the server (website’s data) Introduction REST APIs act as a gateway to establish a two-way communication between two software applications.

Continue reading

Multiprocessing in Python

With increasing number of power hungry applications, the demand for speed and low latency has become a challenge in certain situations. However, the availability of machines with multiple processors/processors with multiple cores help us combat such situations. This post would guide you through using multiprocessing in python. Introduction In contemporary times, a lot of CPUs are being manufactured with multiple cores to boost performance by enabling parallelism and concurrency of applications.

Continue reading

Multithreading in Python

Often we build applications which might require several tasks to run simultaneously within the same application. This is where the concept of multithreading comes into play. This post provides a comprehensive explanation of using the Multithreading(Threading) module in Python. Introduction Multithreading a.k.a Threading in python is a concept by which mutliple threads are launched in the same process to achieve parallelism and multitasking within the same application. Executing different threads are equivalent to executing different programs or different functions within the same process.

Continue reading

Semi-Structured Data in Spark (pyspark) - JSON

In this post we discuss how to read semi-structured data from different data sources and store it as a spark dataframe. The spark dataframe can in turn be used to perform aggregations and all sorts of data manipulations. Introduction Previously we saw how to create and work with spark dataframes. In post we discuss how to read semi-structured data from different data sources and store it as a spark dataframe and how to do further data manipulations.

Continue reading

Pyspark DataFrame Operations - Basics

In this post, we will be discussing on how to perform different dataframe operations such as a aggregations, ordering, joins and other similar data manipulations on a spark dataframe. Introduction Spark provides the Dataframe API, which is a very powerful API which enables the user to perform parallel and distrivuted structured data processing on the input data. A Spark dataframe is a dataet with a named set of columns.

Continue reading

Spark Repartition & Coalesce - Explained

All data processed by spark is stored in partitions. Today we discuss what are partitions, how partitioning works in Spark (Pyspark), why it matters and how the user can manually control the partitions using repartition and coalesce for effective distributed computing. Introduction Spark is a framework which provides parallel and distributed computing on big data. To perform it’s parallel processing, spark splits the data into smaller chunks(i.e. partitions) and distributes the same to each node in the cluster to provide a parallel execution of the data.

Continue reading