Memory error in Spark

Spark is a general purpose cluster computing framework used to process large datasets. It runs on Standalone, Mesos, and YARN clusters.

You often encounter memory issue like something below.

Cause:

Spark uses disk space on worker nodes to store shuffle data/files. So if the disk space is not enough to hold the data you’re processing, it will throw error.

Solution:

I used to get this error on AWS EMR while running spark jobs. By increasing the EBS volume size on core nodes using the below AWS CLI, I was able to fix the memory issue. Another solution is to increase the cluster size so the shuffle fils sizes would be smaller.

aws emr create-cluster --release-label emr-5.9.0  --use-default-roles --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=d2.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=d2.xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=4}]}' --auto-terminate

Reference:

https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html

Advertisements

S3 and Data Transfer issue from Spark

Spark is a general purpose distributed high performance computation engine that has APIs in many major languages like Java, Scala, Python.

S3 is Amazon Simple Storage Service for storing objects in a highly durable and reliable manner at very low cost.

Spark is used in combination with S3 for reading input and saving output data.

Spark can apply many transformations on input data, and finally store the data in some bucket on S3.

While processing large input data and storing output data on S3, I found that it is very fast in processing the data, but it’s very slow in writing the output data to S3.

Instead, I found that it’s very fast storing the data first on local HDFS (on Hadoop cluster), and then copy the data back to S3 from HDFS using s3-dist-cp (Amazon version of Hadoop’s distcp).

In one of the cases, to process data of 1TB, it took about 1.5 hrs to process, but about 4 hours to copy the output data to S3.
But with the above solution, it just took less than 5 min to copy the data to S3, saving lot of time and money.

s3-dist-cp command can be run from master node using the format below.

s3-dist-cp --src /input/data --dest s3://my-bucket/output

OR

hadoop jar s3-dist-cp.jar --src /input/data --dest s3://my-bucket/output