Hive Databases on AWS EMR

If you’re using AWS (Amazon Web Services) EMR (Elastic MapReduce) which is AWS distribution of Hadoop, it is a common practice to spin up a Hadoop cluster when needed and shut it down after finishing up using it.

Suppose you are using a MySQL meta store and create a database on Hive, we usually do the following:

CREATE DATABASE mydb;

This creates a folder at the location /user/hive/warehouse/mydb.db on HDFS, and that information is stored in meta store.

Later when the cluster is shutdown, you spin up another cluster, and try to access the same database, say, in a Hive insert statement like below, you get an error.

INSERT OVERWRITE TABLE mydb.mytable SELECT col1, col2 FROM another_table

Loading data to table mydb.mytable
Failed with exception Unable to alter table.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

Why?

This is because the old cluster is gone and the database is still referring to old NameNode.

Solution is to externalize the location of the database.

CREATE DATABASE IF NOT EXISTS mydb
LOCATION 's3://mybucket/mydb.db';

Advertisements