Remove Special Characters from Data using Hive

I have data that has some special characters like carriage returns (\r), newlines (\n), and some non-printable control characters (like ^M). I need to process this data using Hive for some transformations and later load this data to a database like Oracle using Sqoop or Amazon Redshift using PSQL COPY for analysis.

The Sqoop or PSQL COPY fail to load data because of these invalid characters.
I was trying to find how to strip off these characters at Hive before loading to external database.
Hive’s regexp_extract function comes to rescue, but finding the right REGEX pattern was a challenge.
Finally I figured it out after searching on online forums. Good to know is that Hive supports Java REGEX patterns.
My final Hive query is given below.


SELECT regexp_extract(col1, '\\p{Print}*',0)
FROM table1;

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s