Hive Date Functions

Hive provides a SQL-like interface to data stored on HDFS. Some of the rare date functions are given below (Tested on version 2.1.0).

  • Get last date of month
$ hive -e 'select last_day("2017-05-01")'
OK
2017-05-31
Time taken: 2.15 seconds, Fetched: 1 row(s)
  • Get last day of month (Monday=1, Sunday=7)
$ hive -e 'select date_format(last_day("2017-05-01"), "u")'
OK
3
Time taken: 2.46 seconds, Fetched: 1 row(s)
  • Add days to a date
$ hive -e 'select date_add(last_day("2017-04-03"), 1)'
OK
2017-05-01
Time taken: 2.425 seconds, Fetched: 1 row(s)
  • Get first day of last month (Monday=1, Sunday=7)
$ hive -e 'select date_format(date_add(last_day("2017-04-03"), 1), "u")'
OK
1
Time taken: 2.301 seconds, Fetched: 1 row(s)

P.S. Find Hive version using command: hive –version

$ hive --version
Hive 2.1.0-amzn-0
Advertisements

Remove Special Characters from Data using Hive

I have data that has some special characters like carriage returns (\r), newlines (\n), and some non-printable control characters (like ^M). I need to process this data using Hive for some transformations and later load this data to a database like Oracle using Sqoop or Amazon Redshift using PSQL COPY for analysis.

The Sqoop or PSQL COPY fail to load data because of these invalid characters.
I was trying to find how to strip off these characters at Hive before loading to external database.
Hive’s regexp_extract function comes to rescue, but finding the right REGEX pattern was a challenge.
Finally I figured it out after searching on online forums. Good to know is that Hive supports Java REGEX patterns.
My final Hive query is given below.


SELECT regexp_extract(col1, '\\p{Print}*',0)
FROM table1;

HiveServer2 JDBC Client on AWS

Connecting to Hive database can be done using Hive CLI or beeline from a command prompt and programmatically using a JDBC client. The following section describes setting up a JDBC client for HiveServer2 on Eclipse.

Download the required Hive 0.13.1 JDBC jar files from Amazon site in the section titled ‘To download JDBC drivers’. Extract ZIP file and you can see all jar files in the extracted-to folder.

Using Eclipse IDE, create a Java project and configure the Build Path to add external JARs downloaded in the above step.

Use the following program to connect to database on Hive and run the program. It connects to Hive database on Hadoop NameNode and displays the results on the Eclipse console.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

public class JdbcClient {
	public static void main(String[] args) {

		try {
			Class.forName("com.amazon.hive.jdbc41.HS2Driver");
		} catch (ClassNotFoundException e) {
			e.printStackTrace();
		}

		Connection conn = null;
		ResultSet rs = null;
		try {
			conn = DriverManager.getConnection("jdbc:hive2://<namenode_ipaddress>:10000/default", "hadoop", "");
			Statement stmt = conn.createStatement();
			String addJarQuery = "ADD JAR /path/on/namenode/to/json-serde-1.3-jar-with-dependencies.jar";
			String queryStr = "select * from my_table limit 10";

			stmt.execute(addJarQuery);
			rs = stmt.executeQuery(queryStr);

			while (rs.next()) {
				System.out.println(rs.getString(1));
			}
		} catch (SQLException e) {
			e.printStackTrace();
		} finally {
			try {
				if (conn != null)
					conn.close();

			} catch (SQLException e) {
				e.printStackTrace();
			}
		}

		System.out.println("End of the program");
	}
}

P.S: In the above code, if ‘ADD JAR’ Hive command is not there, you get the following error.

java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe

Drop Hive Partitions with Special Characters

I have a partition for a Hive table created wrongly. See the query below. Hive uses URL-encoded values for special characters.

$ hive -e 'show partitions mytable'
date=$%7Bhiveconf%3AINPUT_DATE}

How to drop it?

It will not work if you use the same value displayed above to drop it, even if Hive says OK.

hive> alter table mytable drop partition (date='$%7Bhiveconf%3AINPUT_DATE}');
OK
Time taken: 0.804 seconds

You know the actual partition when you created. If don’t remember check here to know what is the equivalent value for each encoded character value, and use the actual value to drop it.

hive> alter table mytable drop partition (date='${hiveconf:INPUT_DATE}');

 

How to stop Tez jobs

Apache Tez is a fast data processing engine that can be used as an alternative to slow and old MapReduce. Tez runs on YARN.

MapReduce applications (jobs) can be listed using command: mapred job -list

MapReduce jobs can be killed using: mapred job -kill <JobId>

$ mapred job -list
16/03/16 17:22:41 INFO client.RMProxy: Connecting to ResourceManager at ip-10-10-10-10.ec2.internal/10.10.10.10:8032
Total jobs:2
 JobId State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
job_1457636583888_30685 RUNNING 1458148947326 hadoop default NORMAL 1 0 3392M 0M 3392M http://ip-10-10-10-10.ec2.internal:20888/proxy/application_1457636583888_30685/
job_1457636583888_30686 PREP 1458148961271 hadoop default NORMAL 1 0 2880M 0M 2880M http://ip-10-10-10-10.ec2.internal:20888/proxy/application_1457636583888_30686/

 

Tez jobs can be listed using: yarn application -list

Tez jobs can be killed using: yarn application -kill <Application-Id>

$ yarn application -list
16/03/16 17:24:25 INFO client.RMProxy: Connecting to ResourceManager at ip-10-10-10-10.ec2.internal/10.10.10.10:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):2
 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1457636583888_30638 HIVE-64481d5d-eb4f-4a13-a3d8-ab2584547368 TEZ hadoop default RUNNING UNDEFINED 99.85% http://ip-10-10-10-10.ec2.internal:52648/ui/
application_1457636583888_30690 oozie:launcher:T=hive:W=some_job:A=some_name:ID=0005123-160310193446393-oozie-oozi-W MAPREDUCE hadoop default RUNNING UNDEFINED 5% http://ip-10-10-10-10.ec2.internal:32777

 

 

Hive Databases on AWS EMR

If you’re using AWS (Amazon Web Services) EMR (Elastic MapReduce) which is AWS distribution of Hadoop, it is a common practice to spin up a Hadoop cluster when needed and shut it down after finishing up using it.

Suppose you are using a MySQL meta store and create a database on Hive, we usually do the following:

CREATE DATABASE mydb;

This creates a folder at the location /user/hive/warehouse/mydb.db on HDFS, and that information is stored in meta store.

Later when the cluster is shutdown, you spin up another cluster, and try to access the same database, say, in a Hive insert statement like below, you get an error.

INSERT OVERWRITE TABLE mydb.mytable SELECT col1, col2 FROM another_table

Loading data to table mydb.mytable
Failed with exception Unable to alter table.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

Why?

This is because the old cluster is gone and the database is still referring to old NameNode.

Solution is to externalize the location of the database.

CREATE DATABASE IF NOT EXISTS mydb
LOCATION 's3://mybucket/mydb.db';