We have been using emr3.9.x and Hive 0.13 for a while and pretty satisfying with it even with some known bugs. Hive 0.13 is fine, but EMR3 comes with an old version of hadoop which has an annoying bug on concurrency reading which is: when multiple thread is doing
hadoop fs -get on the same file on hdfs, it would throw error. We can live with that by disabling hive
fetch task and do everything in map reduce.
However recently we encounter some issue when the schema in metastore is updated(even with backward compatible change like add column to the end), we get exception on hive query:
Caused by: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
The thing is teams with newer version of Hive does not have this problem. This is very annoying since we are using a shared metastore and schema change is not totally controlled by our team. So we decided to upgrade our hive/EMR version. This way we also get all the security updates with the new EMR.
EMR 4/5 difference to 3.x
The first thing I hit is the Hive path all changed.
In EMR 3.x, the hive configurations and libraries are under
In EMR 4/5, the conf is now on
/etc/hive/conf, where the
hive-site.xml should go. And the lib for jars are under
/usr/lib/hive/lib, which is need when custom hive behavior is implemented like Authentication.
we used to use a pre-provided
hive-init script to restart the hive-server2 service in EMR 3.x, however, in the EMR4/5, Service management is handled by upstart, and not the traditional SysVInit scripts. So we need to use
upstart‘s commands to invoke jobs like:
sudo reload hive-server2
To get all the services managed:
Still not sure why amazon is making this change, since now most of the linux distribution is using the newer
systemd as init system. Even Ubuntu where
upstart was initially used, starts to use
systemd in LTS version 16.0. Here is a good article comparing them, not agreeing with all of points but systemd is really the trend.
Bootstrap action is another major different between 3 and 4+. We used to have a lot of shell execution in EMR bootstrap action. Now EMR 4+ deprecated many of them including the hive-site.xml installation. This is important because previously
hive-site.xml is preloaded and then hive/hive-server2 is started. Now I do it as a script runner action, for whatever reason, the hive-site xml will be overwritten by the system default. So to overcome this life cycle issue, I have to defer the copy process as a step after bootstrap and then reload the hive-server2 to apply the new setting. More detail in the start shell.