HiveServer2 JDBC Client on AWS

Connecting to Hive database can be done using Hive CLI or beeline from a command prompt and programmatically using a JDBC client. The following section describes setting up a JDBC client for HiveServer2 on Eclipse.

Download the required Hive 0.13.1 JDBC jar files from Amazon site in the section titled ‘To download JDBC drivers’. Extract ZIP file and you can see all jar files in the extracted-to folder.

Using Eclipse IDE, create a Java project and configure the Build Path to add external JARs downloaded in the above step.

Use the following program to connect to database on Hive and run the program. It connects to Hive database on Hadoop NameNode and displays the results on the Eclipse console.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

public class JdbcClient {
	public static void main(String[] args) {

		try {
			Class.forName("com.amazon.hive.jdbc41.HS2Driver");
		} catch (ClassNotFoundException e) {
			e.printStackTrace();
		}

		Connection conn = null;
		ResultSet rs = null;
		try {
			conn = DriverManager.getConnection("jdbc:hive2://<namenode_ipaddress>:10000/default", "hadoop", "");
			Statement stmt = conn.createStatement();
			String addJarQuery = "ADD JAR /path/on/namenode/to/json-serde-1.3-jar-with-dependencies.jar";
			String queryStr = "select * from my_table limit 10";

			stmt.execute(addJarQuery);
			rs = stmt.executeQuery(queryStr);

			while (rs.next()) {
				System.out.println(rs.getString(1));
			}
		} catch (SQLException e) {
			e.printStackTrace();
		} finally {
			try {
				if (conn != null)
					conn.close();

			} catch (SQLException e) {
				e.printStackTrace();
			}
		}

		System.out.println("End of the program");
	}
}

P.S: In the above code, if ‘ADD JAR’ Hive command is not there, you get the following error.

java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe
Advertisements

Drop Hive Partitions with Special Characters

I have a partition for a Hive table created wrongly. See the query below. Hive uses URL-encoded values for special characters.

$ hive -e 'show partitions mytable'
date=$%7Bhiveconf%3AINPUT_DATE}

How to drop it?

It will not work if you use the same value displayed above to drop it, even if Hive says OK.

hive> alter table mytable drop partition (date='$%7Bhiveconf%3AINPUT_DATE}');
OK
Time taken: 0.804 seconds

You know the actual partition when you created. If don’t remember check here to know what is the equivalent value for each encoded character value, and use the actual value to drop it.

hive> alter table mytable drop partition (date='${hiveconf:INPUT_DATE}');

 

How to stop Tez jobs

Apache Tez is a fast data processing engine that can be used as an alternative to slow and old MapReduce. Tez runs on YARN.

MapReduce applications (jobs) can be listed using command: mapred job -list

MapReduce jobs can be killed using: mapred job -kill <JobId>

$ mapred job -list
16/03/16 17:22:41 INFO client.RMProxy: Connecting to ResourceManager at ip-10-10-10-10.ec2.internal/10.10.10.10:8032
Total jobs:2
 JobId State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
job_1457636583888_30685 RUNNING 1458148947326 hadoop default NORMAL 1 0 3392M 0M 3392M http://ip-10-10-10-10.ec2.internal:20888/proxy/application_1457636583888_30685/
job_1457636583888_30686 PREP 1458148961271 hadoop default NORMAL 1 0 2880M 0M 2880M http://ip-10-10-10-10.ec2.internal:20888/proxy/application_1457636583888_30686/

 

Tez jobs can be listed using: yarn application -list

Tez jobs can be killed using: yarn application -kill <Application-Id>

$ yarn application -list
16/03/16 17:24:25 INFO client.RMProxy: Connecting to ResourceManager at ip-10-10-10-10.ec2.internal/10.10.10.10:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):2
 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1457636583888_30638 HIVE-64481d5d-eb4f-4a13-a3d8-ab2584547368 TEZ hadoop default RUNNING UNDEFINED 99.85% http://ip-10-10-10-10.ec2.internal:52648/ui/
application_1457636583888_30690 oozie:launcher:T=hive:W=some_job:A=some_name:ID=0005123-160310193446393-oozie-oozi-W MAPREDUCE hadoop default RUNNING UNDEFINED 5% http://ip-10-10-10-10.ec2.internal:32777

 

 

What is training and model in Machine Learning

Machine Learning is about predicting future behavior based on past data. And this is done by machines (computers), hence the name Machine Learning. It’s also called Data Science in recent times.

What is model?

A model is an algorithm designed to draw some conclusions based on past data.

Example

Based on historical data of payments, it can be predicted the same behavior on a new person.

A person paying monthly rent, credit card, and mortgage payments on time => Another person paying rent, credit payment on time can pay mortgage payments on time too.

What is training a model?

Training a model is designing (computing) an algorithm based on some training data (sample data used to train).

 

P.S.: There is lot to learn on Machine Learning. These two words always confused me. It took long time to understand them.

 

Sqoop Export a Long Text Field

I have a text field on HDFS that can have a very long value of more than 4000 characters length.

I had to export this to Oracle using Sqoop. In Oracle table, this field is defined with data type VARCHAR2(4000 BYTE).

I get an error when Sqoop’ing.

Caused by: java.sql.SQLException: ORA-01461: can bind a LONG value only for insert into a LONG column

Why?

Oracle treats value longer than the defined limit for that field as LONG. Hence the error. The error is not very informative though.

Solution:

Use a CLOB datatype for that field on Oracle. CLOB can store value longer than 4000 characters.

Don’t forget to add –map-column-java option to Sqoop export command. As there is no COLB type in Java/Hive. Hive has to know how to treat this value. Full command is shown below.

sqoop export --connect jdbc:oracle:thin:@hostname:1521/sid \
   --username user --password passwd \ 
   --table Oracle_Table_Name \
   --columns col1,col2,col3 \
   --map-column-java col3=String \
   --export-dir 'location' \
   --input-fields-terminated-by '\001' --input-lines-terminated-by "\n" \
   --input-null-string '\\N' --input-null-non-string '\\N'

Hive UDF on AWS EMR

Hive UDF is a User Defined Function that can be applied to any input field on Hive table. It’s generally used to write a customized function.

Here I am trying to replace newline character (\n) in values of a column.

High Level Steps using Eclipse IDE (Mars 4.5 version)

  • Write a Java class and add a method called evaluate()
package com.mycompany;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class StripNewline extends UDF {
   private Text result = new Text();
   public Text evaluate(Text s) {
   try {
       String rep = s.toString().replaceAll("\\n", "");
       result.set(rep);
    } catch (Exception e) {
       result.set("Exception");
    }
    return result;
   }
}
  • Add Dependent external jars
    • To compile the class, add the required JARS:
    • Download hadoop-core.1.1.2.jar from Maven Repository
    • Get hive-exec-0.13.1-amzn-1.jar from /home/hadoop/hive/lib on EMR EC2 machine
    • Add these jars on Eclipse IDE using: Right click Project -> Configure Build Path ->Libraries -> Add External JARs (Navigate to where you downloaded)
  • Compile and Export as jar
    • Right click project -> Export as Java JAR -> Choose a file name for jar
  • Copy the jar to EMR EC2 master node
    • Using ‘scp’ command from a Linux Terminal or FTP to master node
  • Add jar to Hive shell and use the function in Hive query
    • on Hive CLI prompt:
hive> ADD JAR /home/hadoop/myjar.jar;
hive> create temporary function repl as 'com.mycompany.StripNewline';
hive> select repl(colName) from tableName;

Possible Errors

While adding JAR on Hive CLI prompt, you may get the following error.

hive> create temporary function repl as 'com.mycompany.StripNewline';
java.lang.UnsupportedClassVersionError: com/mycompany/StripNewline : Unsupported major.minor version 52.0

Why?

Hive expected this code to be compiled with Java 6, but you compiled with Java 8. To fix, compile the code with Java 6. On Eclipse, right click on Project -> Properties -> Java Compiler -> Compiler compliance level -> Pick 1.6 from dropdown

P.S. There is already a Hive built-in function to replace a character or string:

regexp_replace(colName, ‘\n’, ”)

Issues with Sqoop Export to Oracle

Sqoop is a tool used to move bulk data from an RDBMS Database like MySQL, Oracle, Postgres to HDFS (or AWS S3).

Couple of issues I faced with Sqoop export are summarized below.

Timestamp Format Issue

HDFS/S3 record has date and time format ‘2015-03-03T08:28:47.484Z’. This value goes to a column on Oracle transaction_datetime with a data type timestamp(9).

Got an error from Sqoop.

Caused by: java.lang.RuntimeException: Can't parse input data: '2015-03-03T08:28:47.484Z'
Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

I could not find a direct solution with Sqoop. So what I did is to insert the data into Hive table by converting the date and time to a format expected by Oracle.

INSERT OVERWRITE TABLE hiveTable
SELECT
 id,
 from_unixtime(unix_timestamp(regexp_replace(transaction_datetime,'[TZ]',' ')), 'yyyy-MM-dd HH:mm:ss'),
 .
 .
FROM sourceHiveTable;

Now the date stored on HDFS files is ‘2015-03-03 08:28:47’ that would work for Oracle.

Null Columns Issue

HDFS/Hive stores NULL values as ‘\N’. This would be exported to Oracle as is; and that does not look good on Oracle.

Adding some options to Sqoop utility would help recognize NULLs correctly. The full command is shown below.

sqoop export --connect jdbc:oracle:thin:@dbhost:1521/ORCL --username user --password password \
 --table Table_In_Oracle \
 --columns col1, col2 \
 --export-dir 'src-directory' \
 --input-fields-terminated-by ',' --input-lines-terminated-by "\n" \
 --input-null-string '\\N' --input-null-non-string '\\N'

After running that command, now I see a value of ‘(null)’ for that column on Oracle.

That solves the problem little, but I’m hoping to see an empty value (”) in Oracle. Trying to find a solution…

Kill a Running Sqoop Job

Sometimes I start a Sqoop export job but wanted to kill it in the middle for some reason.

I write the above command in a shell script and run using nohup command.

nohup sqoop_export.sh > /tmp/job.out 2>&1 &

To kill it I get the process id using ‘ps -aef | grep nohup’ and kill using ‘kill <processid>’.

The job is killed but the problem? If you try to truncate the table on Oracle later, you get error like below. I tried to find the PID of that process on Oracle, but could not find what process locked it up, and I had to drop the table to quickly solve the problem.

ERROR at line 1:
ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

The best way to kill Hadoop job is using ‘hadoop job’ command.

$ hadoop job -list
Total jobs:1
JobId     State     StartTime     UserName       Queue   Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem   AM info
job_1452010418661_0082   RUNNING 1452198342153       userid     default     NORMAL               9               0   15360M       0M     15360M ...

$ hadoop job -kill job_1452010418661_008
Killed job job_1452010418661_0082