One issue we typically have is our EMR cluster stops consuming hive queries due to the overload of the metastore loading/refreshing. This is partially caused by the usage of the
shared-metastore which hosts many teams’ schema/tables inside our organization. When this happens in prod, we have to ask help from RIM to terminate our persistent cluster and create a new one because we do not have the prod pem file. This becomes a pain for our team(preparing many emergency release stuff and getting into bridge line then waiting for all types of approval). Also for our client, they lose the time we process all the stuff we mentioned above(usually hours).
To solve this problem we created nagios alerts to ping our cluster every 10 minutes and have OPS watch for us. This is quite helpful since we know the state of the cluster all the time. However when hive server is down, we still have to go through the emergency release process. Even though we created the Jenkins pipeline to create/terminate/add-step, RIM does not allow us or OPS to use Jenkins pipeline to do the recovery.
We have different ways to resolve it ourself:
- have a dedicated server(ec2) to monitoring and take actions
- have the monitor code deploy in launcher box and do monitoring/recovering
- have the code in Lambda
Method 1 is a traditional way which requires a lot of setup with PUPPET/provision to get everything automated, which does not seem to worth.
Method 2 has less work because we typically have a launcher box in each env. And we could put our existing js code into a nodejs server managed by PM2. The challenge is (1, the launcher box is not a standard app server instance which is not reliable. (2. the nodejs hive connector does not currently have good support on ssl and custom authenticated that we are using in Hive server2.
More over, there is one common weak point for method 1/2, which is we could end up with another monitoring service to make sure it is up and running doing its job.
All the above analysis brings us to the Method 3 where we use Serverless monitoring with lambda. This gives us
(1, Lower operational and development costs
(2, smaller cost to scale
(3, Works with agile development and allows developers to focus on code and to deliver faster
(4, Fits with microservices, which can be implemented as functions.
It is not silver bullet of course. One drawback is with Lambda we could not reuse our nodejs script which is written in ES 6 and aws lambda’s node environment is 4.x. We only tested and run our script in Node 6.x env. With this, we have to re-write our cluster recovery code in java, which fortunately is not difficult thanks to the nice design of aws-java-sdk.
The implementation is quite straightforward.
On high level, we simulate our app path and start the connection from our elb to our cluster.
The basic flow is:
The code is : https://github.com/vcfvct/lambda-emr-monitoring
For hive password, The original thought was to pass in by Sprint boot args. Since we are running in lambda, our run arg will be plain text in the console, which is not ideal. We could also choose to use lambda’s KMS key to encrypt and then decrypt in app. Looks like it is not allowed in Org-level. After discuss with infrastructure team, our hive password will be store in credstash(dynamo).