upgrade emr3 to 4 and hive from 0.13 to 1.0.0

We have been using emr3.9.x and Hive 0.13 for a while and pretty satisfying with it even with some known bugs. Hive 0.13 is fine, but EMR3 comes with an old version of hadoop which has an annoying bug on concurrency reading which is: when multiple thread is doing hadoop fs -get on the same file on hdfs, it would throw error. We can live with that by disabling hive fetch task and do everything in map reduce.

However recently we encounter some issue when the schema in metastore is updated(even with backward compatible change like add column to the end), we get exception on hive query:

Caused by: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

The thing is teams with newer version of Hive does not have this problem. This is very annoying since we are using a shared metastore and schema change is not totally controlled by our team. So we decided to upgrade our hive/EMR version. This way we also get all the security updates with the new EMR.

EMR 4/5 difference to 3.x

Path change

The first thing I hit is the Hive path all changed.

In EMR 3.x, the hive configurations and libraries are under /home/hadoop/hive/xxx.

In EMR 4/5, the conf is now on /etc/hive/conf, where the hive-site.xml should go. And the lib for jars are under /usr/lib/hive/lib, which is need when custom hive behavior is implemented like Authentication.

hive-server2 service

we used to use a pre-provided hive-init script to restart the hive-server2 service in EMR 3.x, however, in the EMR4/5, Service management is handled by upstart, and not the traditional SysVInit scripts. So we need to use upstart‘s commands to invoke jobs like:

sudo reload hive-server2

To get all the services managed: initctl list.

Still not sure why amazon is making this change, since now most of the linux distribution is using the newer systemd as init system. Even Ubuntu where upstart was initially used,  starts to use systemd in LTS version 16.0. Here is a good article comparing them, not agreeing with all of points but systemd is really the trend.

Here is 3 chinese articles for: 1.sysvinit  2.upstart  3.systemd
Another one explaining PID1 and systemd

Bootstrap action

Bootstrap action is another major different between 3 and 4+. We used to have a lot of shell execution in EMR bootstrap action. Now EMR 4+ deprecated many of them including the hive-site.xml installation.  This is important because previously hive-site.xml is preloaded and then hive/hive-server2 is started. Now I do it as a script runner action, for whatever reason, the hive-site xml will be overwritten by the system default. So to overcome this life cycle issue, I have to defer the copy process as a step after bootstrap and then reload the hive-server2 to apply the new setting. More detail in the start shell.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s