regex dash – inside bracket []

We have a requirement to remove some illegal character in file name like slash etc… So I put on a regex in our code:

[^a-zA-Z0-9.-_]

I assume the above regex will match everything that is not number/char/dot/dash/underscore.

Turns out I was wrong! The - is special case inside [...] that is used for range. It should be in the beginning or in the last or escaped. Otherwise it will match all the character that is in between . and _ in ASCII character set. So in my case, the .-_ part will try to match characters from 46(.)-95(_) in ASCII char table.

The correct one should be putting it in the last  [^a-zA-Z0-9._-] . Or just escape it: [^a-zA-Z0-9.\-_]

print all data in paginated table/grid

direct Tabular data display

Recently our project has a page need to show tabular data from 30-6000 rows with 3-4 columns. At first, I thought this is pretty reasonable data to show in one page so I just throw the data into a ng-repeat table with my own implementation of filtering/sorting which is pretty straightforward in angular. Every time user select new category/type, fetch data from backend and replace the data in vm/$scope. Also with this implementation, it is quite easy to fulfill our another requirement which is export/print the page content. For export I just need to get the DOM content to the server and return as downloadable. For print, even easier, just call window.print() ,that’s it.

Performance issue with IE

Everything works fine until  our QA hits IE which is super slow when the data in the list is replaced from backend. Did some profiling in IE11, turns out the appendChild and removeChild calls are taking forever when it tries to clear the rows in the dom and put the new elements into dom. Also another slowness is from styleCalculation which it does for every column/row. Overall, IE takes 20s to render a page with 5000 rows and FF/safari/chrome need only 1-2 seconds. This forces us to abandon the straightforward way to use the more IE friendly way which is pagination with angular ui-grid. But this brings us to another problem which is print since data is now paginated and DOM only has 20 rows.

Server side render and client side print

What I eventually did is sending the model data back to server and do server side rendering and eventually send back to browser where an iFrame is created on the fly for printing. The pros of doing this is we have a lot of flexibility on content/layout by whatever manipulation/styling etc… The cons is we added more stuff to the stack and one more round trip comparing to the direct print.

server side

So on server side, when we get the REST call for print, we have a Thymeleaf template there for generating the html. I compared different java server side rendering engines like Velocity/Freemaker/Rythm etc, looks like Thymeleaf has the best Spring integration and most active development/release.

@Configuration
public class ThymeleafConfig
{
    @Autowired
    private Environment env;

    @Bean
    @Description("Thymeleaf template rendering HTML ")
    public ClassLoaderTemplateResolver exportTemplateResolver() {
        ClassLoaderTemplateResolver exportTemplateResolver = new ClassLoaderTemplateResolver();
        exportTemplateResolver.setPrefix("thymeleaf/");
        exportTemplateResolver.setSuffix(".html");
        exportTemplateResolver.setTemplateMode("HTML5");
        exportTemplateResolver.setCharacterEncoding(CharEncoding.UTF_8);
        exportTemplateResolver.setOrder(1);
        //for local development, we do not want template being cached so that we could do hot reload.
        if ("local".equals(env.getProperty("APP_ENV")))
        {
            exportTemplateResolver.setCacheable(false);
        }
        return exportTemplateResolver;
    }

    @Bean
    public SpringTemplateEngine templateEngine() {
        final SpringTemplateEngine engine = new SpringTemplateEngine();
        final Set<ITemplateResolver> templateResolvers = new HashSet<>();
        templateResolvers.add(exportTemplateResolver());
        engine.setTemplateResolvers(templateResolvers);
        return engine;
    }
}

With the engine we confined, we could used like:

            Context context = new Context();
            context.setVariable("firms", firms);
            context.setVariable("period", period);
            context.setVariable("rptName", rptName);
            context.setVariable("hasFirmId", hasFirmId);
            if (hasFirmId)
            {
                context.setVariable("firmIdType", FirmIdType.getFirmIdType(maybeFirmId).get());
            }

            return templateEngine.process("sroPrint", context);

Template with name sroPrint has some basic Theymleaf directives:

<html xmlns:th="http://www.thymeleaf.org">
<head>
<style>
    table thead tr th, table tbody tr td {
      border: 1px solid black;
      text-align: center;
    }
  </style>

</head>
<body>
<div>
<h4 th:text="${rptName}">report name</h4>
<div style="margin: 10px 0;"><b>Period:</b> <span th:text="${period}"></span>
<div>
  <h4 th:text="${rptName}">report name</h4>
  <div style="margin: 10px 0;"><b>Period:</b> <span th:text="${period}"></span></div>
  <table style="width: 100%; ">
    <thead>
    <tr>
      <th th:if="${hasFirmId}" th:text="${firmIdType}"></th>
      <th>crd #</th>
      <th>Firm Name</th>
    </tr>
    </thead>
    <tbody>
    <tr th:each="firm : ${firms}">
      <td th:if="${hasFirmId}" th:text="${firm.firmId}"></td>
      <td th:text="${firm.crdId}">CRD</td>
      <td th:text="${firm.firmName}">firm name</td>
    </tr>
    </tbody>
  </table>
</div>
</body>
</html>

client side

Now on the client side we need to consume the HTML string from the client side. The flow is we create an iFrame, write the html into it and call browser print on that iFrame and remove the element from DOM. The below implementation is inside the success callback of $http call for getting that dom string. It is in pure js without jQuery, with which it might be a bit more concise.


var printIFrame = document.createElement('iframe');
document.body.appendChild(printIFrame);
printIFrame.style.position = 'absolute';
printIFrame.style.top = '-9999px';
printIFrame.style.left = '-9999px';
var frameWindow = printIFrame.contentWindow || printIFrame.contentDocument || printIFrame;
var wdoc = frameWindow.document || frameWindow.contentDocument || frameWindow;
wdoc.write(res.data);
// tell browser write finished
wdoc.close();
$scope.$emit('UNLOAD');
// Fix for IE : Allow it to render the iframe
frameWindow.focus();
try {
    // Fix for IE11 - printng the whole page instead of the iframe content
    if (!frameWindow.document.execCommand('print', false, null)) {
        // document.execCommand returns false if it failed -http://stackoverflow.com/a/21336448/937891
        frameWindow.print();
    }
    // focus body as it is losing focus in iPad and content not getting printed
    document.body.focus();
}
catch (e) {
    frameWindow.print();
}
frameWindow.close();
setTimeout(function() {
    printIFrame.parentElement.removeChild(printIFrame);
}, 0);

PDF/XLS Export

For xls/pdf export, it is similar to the other POST that I have before. The only difference is the dom string was passed from client there. Here we generate the dom string in server side.

Understand optional true in maven dependency

Today one colleague  from the other team was trying to mimic our behavior doing custom authentication on Hive-Server2. He asked me why he could not get the HiveConf.get(key) working. It basically gets the key we defined in hive-site.xml. It is convenient because if we put key/value there, we do not have to worry about path issue in cluster, just do HiveConf c = new HiveConf() and call get(key) . (side note: this way seems to be not recommended officially since now they have a bunch of enum value to restrict what you can define there) After looking at the source code. get() is actually a method in the parent Configuration class.

import org.apache.hadoop.conf.Configuration;
public class HiveConf extends Configuration

On my pom, I explicitly included the

       <dependency>
         <groupId>org.apache.hadoop</groupId>
         <artifactId>hadoop-core</artifactId>
         <version>${hadoop.core.version}</version>
      </dependency>

So i can look at it from IDE directly. However there is no such dependency in his pom, I cannot find the artifact from the dependency tree either. This is really confusing to me.

It turns out the org.apache.hive -> hive-service has dependency on some hive-shims artifacts which has hadoop-core dependency specified as optional=true.

   <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-core</artifactId>
      <version>${hadoop-20.version}</version>
<optional>true</optional>
    </dependency>

so what is optional true? How come his jar can compile even without the hadoop-core dependency specified? The below pictures are from this POST.

Meaning of <optional>

In short, if project D depend on project C, Project C optionally depend on project A, then project D do NOT depend on project A.

image

Since project C has 2 classes use some classes from project A and project B. Project C can not get compiled without dependencies on A and B. But these two classes are only optional features, which may not be used at all in project D, which depend on project C. So to make the final war/ejb package don’t contain unnecessary dependencies, use to indicate the dependency is optional, be default will not be inherited by others.

What happens if project D really used OptionaFeatureOne in project C? Then in project D‘s pom, project A need to be explicitly declared in the dependencies section.

image

If optional feature one is used in project D, then project D‘s pom need to declare dependency on project A to pass compile. Also, the final war package of project D doesn’t contain any class from project B, since feature 2 is now used.

Our case

In our scenario, the hive-service depends on hive-exec which depends on hive-shims which has optional dependency on hadoop-core. So when the Configuration.get() is not used, his project could still compile even though HiveConf extends the Configuration class. Now if the get() method is to be used, then he need to explicitly declare the dependency on hadoop-core.

Fighting with browser popup block

Background

Recently in our project, we have a need of refactoring some old struct actions to rest based pages. This way we avoid multiple page navigation for our user so that all the stuff can be done in a single page.

One example is file download. Previously in the struts based app, if a page have 12 files. What user have to do is click the download link in the main page, if available, user will be taken to the download page where the real download link is, then download. if not available, user will be taken to a request page for confirmation and then once confirmed, to the download page to wait. So to download all the files, user have to constantly navigate between different pages with a lot of clicks which is kind of crazy. In the coming single page application, everything(request/confirm/download) is in the same page which is much better.

Issue

However, we hit one issue. When user click the download link, the same as the above flow, we first need to make an ajax call back to server to check, if not available, a modal will show up for confirming request. otherwise get the download id and open a new tab for download the stream. The problem comes from this point where the browser(chrome/FF, safari does not) will block the download tab from opening. Tried it both form submit and window open. What is really bad is in chrome the block notification is really not noticeable, which is a tiny icon on the upper-left where user can barely see.

check status

        this.requestDetail = function (requestObj, modalService) {
            that.checkDetailStatus(requestObj).then(
                function success(res) {
                    var status = res.data.status;
                    switch (status) {
                        case 'AVAIL_NOT_REQ':
                            that.createNewRequest(requestObj, modalService);
                            break;
                        case 'NO_DATA':
                            $.bootstrapGrowl('No data available!', {type: 'info'});
                            break;
                        case 'EXISTING_RPT':
                            that.downloadFile(res.data.requestId);
                            break;
                        case 'PENDING':
                            //add user to notify list then redirect
                            that.mapNotifyUser(res.data.requestId).then(
                                function success(res) {
                                    var DETAIL_RUN_INTERVAL = 3;
                                    var minute = DETAIL_RUN_INTERVAL - res.data.minute % DETAIL_RUN_INTERVAL;
                                    $.bootstrapGrowl('Your detail data file will be ready in ' + minute + ' minutes.', {type: 'info'});
                                });
                            break;
                        case 'ERROR':
                            $.bootstrapGrowl('Error Getting Detail data! Contact Admin or Wait for 24 hour to re-request.', {type: 'danger'});
                            break;
                        default:
                            $.bootstrapGrowl('Error Getting Detail data, Contact ADMIN', {type: 'danger'});
                    }
                },
                function error(err) {
                    console.log(err);
                    $.bootstrapGrowl('Network error or Server error!', {type: 'danger'});
                }
            );
        };

with form


        this.downloadFile = function (requestId) {
            //create a form which calls the download REST service on the fly
            var formElement = angular.element("
<form>");
            formElement.attr("action", "/scrc/rest/download/detail/requestId/" + requestId);
            formElement.attr("method", "get");
            formElement.attr("target", "_blank");
            // we need to attach iframe to the body before form could be attached to iframe(below) in ie8
            angular.element('body').append(formElement);
            //call the service
            formElement.submit();
        };

With window

        this.downloadFile = function (requestId) {
            $window.open('/scrc/rest/download/detail/requestId/' + requestId);
        };

Cause

Turns out the issue is: A browser will only open a tab/popup without the popup blocker warning if the command to open the tab/popup comes from a trusted event. That means the user has to actively click somewhere to open a popup.

In this case, the user performs a click so we have the trusted event. we do lose that trusted context, however, by performing the Ajax request. Our success handler does not have that event anymore.

Possible Solutions

  1. open the popup on click and manipulate it later when the callback fires

      var popup = $window.open('','_blank');
      popup.document.write('loading ...');
      ...
      inCallBack(){
        //existing:
        popup.location.href = '/scrc/rest/download/detail/requestId/' + res.data.requestId;
        // other:
        popup.close();
    
      }
    

    this will work but not elegant since it opens a tab and close instantly but still create a flash in browser that user could notice.

  2. you can require the user to click again some button to trigger the popup. This will work because we could update the link if existing then user click again, we init the download so popup is triggered by user directly. But still not quite user friendly

  3. Notify user to unblock our site.
    This is eventually what we do. So we detect on the client side if popup is blocked. If so, we ask user to unblock our site in setting. The reason we use this is the unblock/trust action is really a one time thing that browser will remember the behavior and will not bother user again.

            this.downloadFile = function (requestId) {
                var downloadWindow = $window.open('/scrc/rest/download/detail/requestId/' + requestId);
                if(!downloadWindow || downloadWindow.closed || typeof downloadWindow.closed=='undefined')
                {
                    $.bootstrapGrowl('Download Blocked!<br\> Please allow popup from our site in your browser setting!', {type: 'danger', delay: 8000});
                }
            };
    

Serverless EMR Cluster monitoring with Lambda

Background

One issue we typically have is our EMR cluster stops consuming hive queries due to the overload of the metastore loading/refreshing. This is partially caused by the usage of the shared-metastore which hosts many teams’ schema/tables inside our organization. When this happens in prod, we have to ask help from RIM to terminate our persistent cluster and create a new one because we do not have the prod pem file. This becomes a pain for our team(preparing many emergency release stuff and getting into bridge line then waiting for all types of approval). Also for our client, they lose the time we process all the stuff we mentioned above(usually hours).

To solve this problem we created nagios alerts to ping our cluster every 10 minutes and have OPS watch for us. This is quite helpful since we know the state of the cluster all the time. However when hive server is down, we still have to go through the emergency release process. Even though we created the Jenkins pipeline to create/terminate/add-step, RIM does not allow us or OPS to use Jenkins pipeline to do the recovery.

Proposals

We have different ways to resolve it ourself:

  1. have a dedicated server(ec2) to monitoring and take actions
  2. have the monitor code deploy in launcher box and do monitoring/recovering
  3. have the code in Lambda

Dedicated EC2

Method 1 is a traditional way which requires a lot of setup with PUPPET/provision to get everything automated, which does not seem to worth.

Launcher box

Method 2 has less work because we typically have a launcher box in each env. And we could put our existing js code into a nodejs server managed by PM2. The challenge is (1, the launcher box is not a standard app server instance which is not reliable. (2. the nodejs hive connector does not currently have good support on ssl and custom authenticated that we are using in Hive server2.

More over, there is one common weak point for method 1/2, which is we could end up with another monitoring service to make sure it is up and running doing its job.

Lambda

All the above analysis brings us to the Method 3 where we use Serverless monitoring with lambda. This gives us

(1, Lower operational and development costs

(2,  smaller cost to scale

(3, Works with agile development and allows developers to focus on code and to deliver faster

(4, Fits with microservices, which can be implemented as functions.

It is not silver bullet of course. One drawback is with Lambda we could not reuse our nodejs script which is written in ES 6 and aws lambda’s node environment is 4.x. We only tested and run our script in Node 6.x env. With this, we have to re-write our cluster recovery code in java, which fortunately is not difficult thanks to the nice design of aws-java-sdk. 

Implementation

The implementation is quite straightforward. 

Components

On high level, we simulate our app path and start the connection from our elb to our cluster.  

lambda-overview

Flow

The basic flow is:

lambda-flow

The code is : https://github.com/vcfvct/lambda-emr-monitoring

Credentials

For hive password, The original thought was to pass in by Sprint boot args. Since we are running in lambda, our run arg will be plain text in the console, which is not ideal. We could also choose to use lambda’s KMS key to encrypt and then decrypt in app. Looks like it is not allowed in Org-level. After discuss with infrastructure team, our hive password will be store in credstash(dynamo).

large file from hive to rdbms(oracle)

Recently we have a requirement of dumping a sizable file(4+G) to oracle from s3. The file itself is hive-compatiable. so instead of downloading the file and generate sql for it, we decided to transfer the content using hive jdbc and persist in via jpa/hiberante.

Hive

On the hive side, one important thing is to make sure batchsize is set in jdbc resultset.


hiveJdbcTemplate.query(sqlToExecute, rs -> {
            rs.setFetchSize(5000);
            while (rs.next()){
               ....do you handling
            }
        });

RDBMS

on the relational database side

  1. make sure index is turned off. otherwise it each insertion will trigger the b-tree index to be inserted.
  2. make sure leverage the hibernate batch-size
    hibernate.jdbc.batch_size. I set it to 50 since my table has over 200 columns.For example , if you save() 100 records and your hibernate.jdbc.batch_size is set to 50. During flushing, instead of issue the following SQL 100 times :

    insert into TableA (id , fields) values (1, 'val1');
    insert into TableA (id , fields) values (2, 'val2');
    insert into TableA (id , fields) values (3, 'val3');
    .........................
    insert into TableA (id , fields) values (100, 'val100');

    Hiberate will group them in batches of 50 , and only issue 2 SQL to the DB, like this:

    insert into TableA (id , fields) values (1, 'val1') , (2, 'val2') ,(3, 'val3') ,(4, 'val4') ,......,(50, 'val50')
    insert into TableA (id , fields) values (51, 'val51') , (52, 'val52') ,(53, 'val53') ,(54, 'val54'),...... ,(100, 'val100')  

    Please note that Hibernate would disable insert batching at the JDBC level transparently if the primary key of the inserting table isGenerationType.Identity.

  3. make sure flush()/clear() for certain size so that memory is not eaten up by the millions of objects built on the fly.
    flush will make sure query be executed and object saved(synced) to DB.
    clear will clear the persistence context so all managed entities are detached. entities that have not been flushed will not be persisted.

My main code is something like:


    public int doImport(int limit)
    {
        String sql = "SELECT * FROM erd.ERD_PRDCT_FIXED_INCM_MNCPL_HS_prc_txt";
        if (limit >= 0)
        {
            sql = sql + " LIMIT " + limit;
        }
        HiveBeanPropertyRowMapper<SrcErdFixedIncmMuniEntity> mapper = new HiveBeanPropertyRowMapper<>(SrcErdFixedIncmMuniEntity.class);
        int batchSize = 5000;
        int[] inc = {0};
        Instant start = Instant.now();
        List<SrcErdFixedIncmMuniEntity> listToPersist = new ArrayList<>(batchSize);
        hiveJdbcTemplate.query(sql, (rs) -> {
            rs.setFetchSize(batchSize);
            while (rs.next())
            {
                listToPersist.add(mapper.mapRow(rs, -1));
                inc[0]++;
                if (inc[0] % batchSize == 0)
                {
                    persistAndClear(inc, listToPersist);
                }
            }
            //left overs(last n items)
            if(!listToPersist.isEmpty())
            {
                persistAndClear(inc, listToPersist);
            }
            return null;
        });
        Instant end = Instant.now();
        System.out.println("Data Intake took: " + Duration.between(start, end));
        return inc[0];
    }

    private void persistAndClear(int[] inc, List<SrcErdFixedIncmMuniEntity> listToPersist)
    {
        listToPersist.stream().forEach(em::persist);
        em.flush();
        em.clear();
        LOGGER.info("Saved record milestone: " + inc[0]);
        listToPersist.clear();
    }

Result

not bad, ~3.5 Millions records get loaded in about an hour.

upgrade emr3 to 4 and hive from 0.13 to 1.0.0

We have been using emr3.9.x and Hive 0.13 for a while and pretty satisfying with it even with some known bugs. Hive 0.13 is fine, but EMR3 comes with an old version of hadoop which has an annoying bug on concurrency reading which is: when multiple thread is doing hadoop fs -get on the same file on hdfs, it would throw error. We can live with that by disabling hive fetch task and do everything in map reduce.

However recently we encounter some issue when the schema in metastore is updated(even with backward compatible change like add column to the end), we get exception on hive query:

Caused by: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

The thing is teams with newer version of Hive does not have this problem. This is very annoying since we are using a shared metastore and schema change is not totally controlled by our team. So we decided to upgrade our hive/EMR version. This way we also get all the security updates with the new EMR.

EMR 4/5 difference to 3.x

Path change

The first thing I hit is the Hive path all changed.

In EMR 3.x, the hive configurations and libraries are under /home/hadoop/hive/xxx.

In EMR 4/5, the conf is now on /etc/hive/conf, where the hive-site.xml should go. And the lib for jars are under /usr/lib/hive/lib, which is need when custom hive behavior is implemented like Authentication.

hive-server2 service

we used to use a pre-provided hive-init script to restart the hive-server2 service in EMR 3.x, however, in the EMR4/5, Service management is handled by upstart, and not the traditional SysVInit scripts. So we need to use upstart‘s commands to invoke jobs like:

sudo reload hive-server2

To get all the services managed: initctl list.

Still not sure why amazon is making this change, since now most of the linux distribution is using the newer systemd as init system. Even Ubuntu where upstart was initially used,  starts to use systemd in LTS version 16.0. Here is a good article comparing them, not agreeing with all of points but systemd is really the trend.

Here is 3 chinese articles for: 1.sysvinit  2.upstart  3.systemd

Bootstrap action

Bootstrap action is another major different between 3 and 4+. We used to have a lot of shell execution in EMR bootstrap action. Now EMR 4+ deprecated many of them including the hive-site.xml installation.  This is important because previously hive-site.xml is preloaded and then hive/hive-server2 is started. Now I do it as a script runner action, for whatever reason, the hive-site xml will be overwritten by the system default. So to overcome this life cycle issue, I have to defer the copy process as a step after bootstrap and then reload the hive-server2 to apply the new setting. More detail in the start shell.