Memory error in Spark

Spark is a general purpose cluster computing framework used to process large datasets. It runs on Standalone, Mesos, and YARN clusters.

You often encounter memory issue like something below.

Cause:

Spark uses disk space on worker nodes to store shuffle data/files. So if the disk space is not enough to hold the data you’re processing, it will throw error.

Solution:

I used to get this error on AWS EMR while running spark jobs. By increasing the EBS volume size on core nodes using the below AWS CLI, I was able to fix the memory issue. Another solution is to increase the cluster size so the shuffle fils sizes would be smaller.

aws emr create-cluster --release-label emr-5.9.0  --use-default-roles --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=d2.xlarge 'InstanceGroupType=CORE,InstanceCount=2,InstanceType=d2.xlarge,EbsConfiguration={EbsOptimized=true,EbsBlockDeviceConfigs=[{VolumeSpecification={VolumeType=gp2,SizeInGB=100}},{VolumeSpecification={VolumeType=io1,SizeInGB=100,Iops=100},VolumesPerInstance=4}]}' --auto-terminate

Reference:

https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s