regex dash – inside bracket []

We have a requirement to remove some illegal character in file name like slash etc… So I put on a regex in our code:


I assume the above regex will match everything that is not number/char/dot/dash/underscore.

Turns out I was wrong! The - is special case inside [...] that is used for range. It should be in the beginning or in the last or escaped. Otherwise it will match all the character that is in between . and _ in ASCII character set. So in my case, the .-_ part will try to match characters from 46(.)-95(_) in ASCII char table.

The correct one should be putting it in the last  [^a-zA-Z0-9._-] . Or just escape it: [^a-zA-Z0-9.\-_]

print all data in paginated table/grid

direct Tabular data display

Recently our project has a page need to show tabular data from 30-6000 rows with 3-4 columns. At first, I thought this is pretty reasonable data to show in one page so I just throw the data into a ng-repeat table with my own implementation of filtering/sorting which is pretty straightforward in angular. Every time user select new category/type, fetch data from backend and replace the data in vm/$scope. Also with this implementation, it is quite easy to fulfill our another requirement which is export/print the page content. For export I just need to get the DOM content to the server and return as downloadable. For print, even easier, just call window.print() ,that’s it.

Performance issue with IE

Everything works fine until  our QA hits IE which is super slow when the data in the list is replaced from backend. Did some profiling in IE11, turns out the appendChild and removeChild calls are taking forever when it tries to clear the rows in the dom and put the new elements into dom. Also another slowness is from styleCalculation which it does for every column/row. Overall, IE takes 20s to render a page with 5000 rows and FF/safari/chrome need only 1-2 seconds. This forces us to abandon the straightforward way to use the more IE friendly way which is pagination with angular ui-grid. But this brings us to another problem which is print since data is now paginated and DOM only has 20 rows.

Server side render and client side print

What I eventually did is sending the model data back to server and do server side rendering and eventually send back to browser where an iFrame is created on the fly for printing. The pros of doing this is we have a lot of flexibility on content/layout by whatever manipulation/styling etc… The cons is we added more stuff to the stack and one more round trip comparing to the direct print.

server side

So on server side, when we get the REST call for print, we have a Thymeleaf template there for generating the html. I compared different java server side rendering engines like Velocity/Freemaker/Rythm etc, looks like Thymeleaf has the best Spring integration and most active development/release.

public class ThymeleafConfig
    private Environment env;

    @Description("Thymeleaf template rendering HTML ")
    public ClassLoaderTemplateResolver exportTemplateResolver() {
        ClassLoaderTemplateResolver exportTemplateResolver = new ClassLoaderTemplateResolver();
        //for local development, we do not want template being cached so that we could do hot reload.
        if ("local".equals(env.getProperty("APP_ENV")))
        return exportTemplateResolver;

    public SpringTemplateEngine templateEngine() {
        final SpringTemplateEngine engine = new SpringTemplateEngine();
        final Set<ITemplateResolver> templateResolvers = new HashSet<>();
        return engine;

With the engine we confined, we could used like:

            Context context = new Context();
            context.setVariable("firms", firms);
            context.setVariable("period", period);
            context.setVariable("rptName", rptName);
            context.setVariable("hasFirmId", hasFirmId);
            if (hasFirmId)
                context.setVariable("firmIdType", FirmIdType.getFirmIdType(maybeFirmId).get());

            return templateEngine.process("sroPrint", context);

Template with name sroPrint has some basic Theymleaf directives:

<html xmlns:th="">
    table thead tr th, table tbody tr td {
      border: 1px solid black;
      text-align: center;

<h4 th:text="${rptName}">report name</h4>
<div style="margin: 10px 0;"><b>Period:</b> <span th:text="${period}"></span>
  <h4 th:text="${rptName}">report name</h4>
  <div style="margin: 10px 0;"><b>Period:</b> <span th:text="${period}"></span></div>
  <table style="width: 100%; ">
      <th th:if="${hasFirmId}" th:text="${firmIdType}"></th>
      <th>crd #</th>
      <th>Firm Name</th>
    <tr th:each="firm : ${firms}">
      <td th:if="${hasFirmId}" th:text="${firm.firmId}"></td>
      <td th:text="${firm.crdId}">CRD</td>
      <td th:text="${firm.firmName}">firm name</td>

client side

Now on the client side we need to consume the HTML string from the client side. The flow is we create an iFrame, write the html into it and call browser print on that iFrame and remove the element from DOM. The below implementation is inside the success callback of $http call for getting that dom string. It is in pure js without jQuery, with which it might be a bit more concise.

var printIFrame = document.createElement('iframe');
document.body.appendChild(printIFrame); = 'absolute'; = '-9999px'; = '-9999px';
var frameWindow = printIFrame.contentWindow || printIFrame.contentDocument || printIFrame;
var wdoc = frameWindow.document || frameWindow.contentDocument || frameWindow;
// tell browser write finished
// Fix for IE : Allow it to render the iframe
try {
    // Fix for IE11 - printng the whole page instead of the iframe content
    if (!frameWindow.document.execCommand('print', false, null)) {
        // document.execCommand returns false if it failed -
    // focus body as it is losing focus in iPad and content not getting printed
catch (e) {
setTimeout(function() {
}, 0);

PDF/XLS Export

For xls/pdf export, it is similar to the other POST that I have before. The only difference is the dom string was passed from client there. Here we generate the dom string in server side.

Understand optional true in maven dependency

Today one colleague  from the other team was trying to mimic our behavior doing custom authentication on Hive-Server2. He asked me why he could not get the HiveConf.get(key) working. It basically gets the key we defined in hive-site.xml. It is convenient because if we put key/value there, we do not have to worry about path issue in cluster, just do HiveConf c = new HiveConf() and call get(key) . (side note: this way seems to be not recommended officially since now they have a bunch of enum value to restrict what you can define there) After looking at the source code. get() is actually a method in the parent Configuration class.

import org.apache.hadoop.conf.Configuration;
public class HiveConf extends Configuration

On my pom, I explicitly included the


So i can look at it from IDE directly. However there is no such dependency in his pom, I cannot find the artifact from the dependency tree either. This is really confusing to me.

It turns out the org.apache.hive -> hive-service has dependency on some hive-shims artifacts which has hadoop-core dependency specified as optional=true.


so what is optional true? How come his jar can compile even without the hadoop-core dependency specified? The below pictures are from this POST.

Meaning of <optional>

In short, if project D depend on project C, Project C optionally depend on project A, then project D do NOT depend on project A.


Since project C has 2 classes use some classes from project A and project B. Project C can not get compiled without dependencies on A and B. But these two classes are only optional features, which may not be used at all in project D, which depend on project C. So to make the final war/ejb package don’t contain unnecessary dependencies, use to indicate the dependency is optional, be default will not be inherited by others.

What happens if project D really used OptionaFeatureOne in project C? Then in project D‘s pom, project A need to be explicitly declared in the dependencies section.


If optional feature one is used in project D, then project D‘s pom need to declare dependency on project A to pass compile. Also, the final war package of project D doesn’t contain any class from project B, since feature 2 is now used.

Our case

In our scenario, the hive-service depends on hive-exec which depends on hive-shims which has optional dependency on hadoop-core. So when the Configuration.get() is not used, his project could still compile even though HiveConf extends the Configuration class. Now if the get() method is to be used, then he need to explicitly declare the dependency on hadoop-core.

Serverless EMR Cluster monitoring with Lambda


One issue we typically have is our EMR cluster stops consuming hive queries due to the overload of the metastore loading/refreshing. This is partially caused by the usage of the shared-metastore which hosts many teams’ schema/tables inside our organization. When this happens in prod, we have to ask help from RIM to terminate our persistent cluster and create a new one because we do not have the prod pem file. This becomes a pain for our team(preparing many emergency release stuff and getting into bridge line then waiting for all types of approval). Also for our client, they lose the time we process all the stuff we mentioned above(usually hours).

To solve this problem we created nagios alerts to ping our cluster every 10 minutes and have OPS watch for us. This is quite helpful since we know the state of the cluster all the time. However when hive server is down, we still have to go through the emergency release process. Even though we created the Jenkins pipeline to create/terminate/add-step, RIM does not allow us or OPS to use Jenkins pipeline to do the recovery.


We have different ways to resolve it ourself:

  1. have a dedicated server(ec2) to monitoring and take actions
  2. have the monitor code deploy in launcher box and do monitoring/recovering
  3. have the code in Lambda

Dedicated EC2

Method 1 is a traditional way which requires a lot of setup with PUPPET/provision to get everything automated, which does not seem to worth.

Launcher box

Method 2 has less work because we typically have a launcher box in each env. And we could put our existing js code into a nodejs server managed by PM2. The challenge is (1, the launcher box is not a standard app server instance which is not reliable. (2. the nodejs hive connector does not currently have good support on ssl and custom authenticated that we are using in Hive server2.

More over, there is one common weak point for method 1/2, which is we could end up with another monitoring service to make sure it is up and running doing its job.


All the above analysis brings us to the Method 3 where we use Serverless monitoring with lambda. This gives us

(1, Lower operational and development costs

(2,  smaller cost to scale

(3, Works with agile development and allows developers to focus on code and to deliver faster

(4, Fits with microservices, which can be implemented as functions.

It is not silver bullet of course. One drawback is with Lambda we could not reuse our nodejs script which is written in ES 6 and aws lambda’s node environment is 4.x. We only tested and run our script in Node 6.x env. With this, we have to re-write our cluster recovery code in java, which fortunately is not difficult thanks to the nice design of aws-java-sdk. 


The implementation is quite straightforward. 


On high level, we simulate our app path and start the connection from our elb to our cluster.  



The basic flow is:


The code is :


For hive password, The original thought was to pass in by Sprint boot args. Since we are running in lambda, our run arg will be plain text in the console, which is not ideal. We could also choose to use lambda’s KMS key to encrypt and then decrypt in app. Looks like it is not allowed in Org-level. After discuss with infrastructure team, our hive password will be store in credstash(dynamo).

large file from hive to rdbms(oracle)

Recently we have a requirement of dumping a sizable file(4+G) to oracle from s3. The file itself is hive-compatiable. so instead of downloading the file and generate sql for it, we decided to transfer the content using hive jdbc and persist in via jpa/hiberante.


On the hive side, one important thing is to make sure batchsize is set in jdbc resultset.

hiveJdbcTemplate.query(sqlToExecute, rs -> {
            while ({
      you handling


on the relational database side

  1. make sure index is turned off. otherwise it each insertion will trigger the b-tree index to be inserted.
  2. make sure leverage the hibernate batch-size
    hibernate.jdbc.batch_size. I set it to 50 since my table has over 200 columns.For example , if you save() 100 records and your hibernate.jdbc.batch_size is set to 50. During flushing, instead of issue the following SQL 100 times :

    insert into TableA (id , fields) values (1, 'val1');
    insert into TableA (id , fields) values (2, 'val2');
    insert into TableA (id , fields) values (3, 'val3');
    insert into TableA (id , fields) values (100, 'val100');

    Hiberate will group them in batches of 50 , and only issue 2 SQL to the DB, like this:

    insert into TableA (id , fields) values (1, 'val1') , (2, 'val2') ,(3, 'val3') ,(4, 'val4') ,......,(50, 'val50')
    insert into TableA (id , fields) values (51, 'val51') , (52, 'val52') ,(53, 'val53') ,(54, 'val54'),...... ,(100, 'val100')  

    Please note that Hibernate would disable insert batching at the JDBC level transparently if the primary key of the inserting table isGenerationType.Identity.

  3. make sure flush()/clear() for certain size so that memory is not eaten up by the millions of objects built on the fly.
    flush will make sure query be executed and object saved(synced) to DB.
    clear will clear the persistence context so all managed entities are detached. entities that have not been flushed will not be persisted.

My main code is something like:

    public int doImport(int limit)
        String sql = "SELECT * FROM erd.ERD_PRDCT_FIXED_INCM_MNCPL_HS_prc_txt";
        if (limit >= 0)
            sql = sql + " LIMIT " + limit;
        HiveBeanPropertyRowMapper<SrcErdFixedIncmMuniEntity> mapper = new HiveBeanPropertyRowMapper<>(SrcErdFixedIncmMuniEntity.class);
        int batchSize = 5000;
        int[] inc = {0};
        Instant start =;
        List<SrcErdFixedIncmMuniEntity> listToPersist = new ArrayList<>(batchSize);
        hiveJdbcTemplate.query(sql, (rs) -> {
            while (
                listToPersist.add(mapper.mapRow(rs, -1));
                if (inc[0] % batchSize == 0)
                    persistAndClear(inc, listToPersist);
            //left overs(last n items)
                persistAndClear(inc, listToPersist);
            return null;
        Instant end =;
        System.out.println("Data Intake took: " + Duration.between(start, end));
        return inc[0];

    private void persistAndClear(int[] inc, List<SrcErdFixedIncmMuniEntity> listToPersist)
        em.clear();"Saved record milestone: " + inc[0]);


not bad, ~3.5 Millions records get loaded in about an hour.

Spring nested @Transactional rollback only

Recently we get some odd xxx Thread got an uncaught exception in Nagios alters. and the corresponding exception in log is :

Could not commit JPA transaction; nested exception is javax.persistence.RollbackException: Transaction marked as rollbackOnly org.springframework.transaction.TransactionSystemException: Could not commit JPA transaction; nested exception is javax.persistence.RollbackException: Transaction marked as rollbackOnly
    at org.springframework.orm.jpa.JpaTransactionManager.doCommit(
    at org.springframework.transaction.interceptor.TransactionAspectSupport.commitTransactionAfterReturning(
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(
    at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(
    at methodA()...

Turns out the reason is we have a nested method also marked @Transactional and some exception happened inside which causes Spring marked it as RollBackonly in the thread local TransactionStatus.

public void outterMethod(){

  A a = new A();
  //HERE, transaction is already marked as rollback only!

So the possible solution is:

  • remove the @Transacional from the nested method if it does not really require transaction control. So even it has exception, it just bubbles up and does not affect transactional stuff.
  • if nested method does need transaction control, make it as REQUIRE_NEW for the propagation policy that way even if throws exception and marked as rollback only, the caller will not be affected.

One caveat is :

Only unchecked exceptions (that is, subclasses of java.lang.RuntimeException) are rollbacked by default. For the case, a checked exception is thrown, the transaction will be committed!

And one customization can be done very easily by just adding the parameter rollBackFor to the @Transactional attribute:

@Transactional(rollbackFor = Exception.class)

Some source code from spring transactional implementation:

1. the Transaction abstraction

public interface PlatformTransactionManager {
    TransactionStatus getTransaction(TransactionDefinition definition) throws TransactionException;
    void commit(TransactionStatus status) throws TransactionException;
    void rollback(TransactionStatus status) throws TransactionException;

2. The TransactionDefinition

public interface TransactionDefinition {
    // Propagations
    // Isolations
    // timeout
    int TIMEOUT_DEFAULT = -1;

    // behaviors
    int getPropagationBehavior();
    int getIsolationLevel();
    int getTimeout();
    boolean isReadOnly();
    String getName();

3. the TransactionStatus

public interface TransactionStatus extends SavepointManager, Flushable {
    boolean isNewTransaction();
    boolean hasSavepoint();
    void setRollbackOnly();
    boolean isRollbackOnly();
    void flush();
    boolean isCompleted();

A Chinese article about the source code.


logback spring boot rolling 配置


Spring Boot支持各种日志工具, 最常用的是Logback. 我们可以对日志进行配置, 由于日志是在ApplicationContext创建之前初始化的, 所以对日志的配置不能通过在@Configuration配置类上使用@PropertySources注解加载进来. 可以使用系统变量或者外部配置application.properties来加载. 配置文件中可以指定这些属性:

  • logging.config=: 配置文件的位置, 比如:classpath:logback.xml(logback的配置文件)
  • logging.file=: 日志文件名, 如:myapp.log, 输出日志到当前目录的myapp.log文件
  • logging.path=: 日志文件位置, 如:/var/log, 输出日志到/var/log/spring.log文件
  • logging.level.*=: 日志等级, 如
  • logging.pattern.console=: 输出到console的日志格式, 只有logback有效
  • logging.pattern.file=: 输出到文件的日志格式, 只有logback有效
  • logging.pattern.level=: 日志级别的格式, 默认是%5p. 只有logback有效
  • logging.exception-conversion-word=%wEx: log异常时使用哪个格式转换器(base.xml中定义了三个conversionRule)
  • logging.register-shutdown-hook=false # Register a shutdown hook for the logging system when it is initialized(没用过)

上面这些属性配置, 一般写在application.properties中, 这样会被加载到Spring Environment中, 为了方便其他地方使用, Spring Environment中的一些属性也被转换到了系统属性(System property)里, 下面是这些属性于系统属性的对应关系:

Spring Environment

System Property
















logging.config属性用于指定日志配置文件的位置, 以logback为例.

  • 如果不指定该属性, logback本身会默认寻找classpath下的配置文件, 寻找顺序为: logback.groovy > logback-test.xml > logback.xml;
  • Spring Boot又加了俩默认的配置文件:logback-spring.groovy > logback-spring.xml, 这俩优先级低于上面的那三个. 推荐指定使用logback-spring.xml.
  • 不指定配置文件时, 寻找上面的配置文件, 制定了则加载指定的配置文件. 如:logging.config=classpath:logback-abc.xml, 则会加载classpath下的logback-abc.xml文件








logging.filelogging.path这俩属性用于指定日志文件输出的位置. 默认情况下Spring Boot只会把日志输出到console, 添加了这两个属性(任意一个即可), 才会把日志输出到文件里.

  • 两个属性都不指定, 只输出到控制台, 不输出到文件
  • logging.file指定文件, 可以是相对路径, 可以是绝对路径.
  • logging.path指定目录, 若制定了目录, 则会输出日志到指定目录下的spring.log文件中
  • 两个同时指定, 以logging.file为准

spring-boot包里关于logback的配置file-appender.xml中定义了文件输出到${LOG_FILE}, 在同一包下的base.xml文件里有这么一句:<property name=”LOG_FILE” value=”${LOG_FILE:-${LOG_PATH:-${LOG_TEMP:-${}}}/spring.log}”/>. 稍微分析下就知道为什么以logging.file为主, 指定logging.path时会输出到该目录下的spring.log文件里了.



logging.level.*用于指定日志级别, 比如:


注意: 该属性配置的日志级别优先级要高于日志配置文件(如logback.xml), 即日志配置文件中与该属性定义的日志级别不一致时, 以该属性定义的级别为准.


  • ogging.pattern.console指定在控制台输出的日志格式;
  • ogging.pattern.file指定在文件输出的日志格式;
  • ogging.pattern.level指定日之级别(DEBUG, INFO, ERROR等)的格式, 默认为%5p;

这些属性不指定时, 默认的格式在spring-boot包中的DefaultLogbackConfiguration类里有定义, 在defaults.xml里也有定义


2016-11-02 21:59:11.366 INFO 11969 — [ main] o.apache.catalina.core.StandardService : Starting service Tomcat

依次为: 时间 日志级别 PID — [线程名] 日志名 : 日志内容



  • console-appender.xml: 定义了控制台输出的日志格式
  • file-appender.xml: 定义了一个日志的文件输出格式(指定每个文件10M)
  • defaults.xml: 定义了一些日志级别
  • base.xml: 包含了上面3个文件, 并指定了root的输出级别和输出方式

我们的日志配置线上不需要输出到console, 日志文件的大小一般也不会是10M, 所以上面那几个文件, 我们可以参考.


<?xml version=”1.0″ encoding=”UTF-8″?>


    <!– 这里面定义了 CONSOLE_LOG_PATTERN, FILE_LOG_PATTERN 等日志格式, 还定义了一些日志级别 –>

    <include resource=“org/springframework/boot/logging/logback/defaults.xml”/>

    <!– 命令行输出, 一般线上不用 –>

    <appender name=“CONSOLE” class=“ch.qos.logback.core.ConsoleAppender”>

        <encoder  charset=“UTF-8”>




    <property name=“LOG_FILE_NAME” value=“myLog”/> <!– 定义一个属性, 下面用 –>

    <!– 输出格式 appender –>

    <appender name=“FILE” class=“ch.qos.logback.core.rolling.RollingFileAppender”>

        <file>${catalina.base}/logs/${LOG_FILE_NAME}.log</file>  <!– 可自己定义 –>


            <pattern>${FILE_LOG_PATTERN}</pattern> <!– 输出格式也可自己定义 –>


        <rollingPolicy class=“ch.qos.logback.core.rolling.TimeBasedRollingPolicy”>




    <!– error 日志 appender –>

    <appender name=“ERROR_FILE” class=“ch.qos.logback.core.rolling.RollingFileAppender”>


        <filter class=“ch.qos.logback.classic.filter.ThresholdFilter”>



        <encoder  charset=“UTF-8”>



        <rollingPolicy class=“ch.qos.logback.core.rolling.TimeBasedRollingPolicy”>




    <!– 定义日志级别, 也可在应用配置中指定 –>

    <logger name =“com.example.project” level=“INFO” />

    <logger name=“org.springframework.web” level=“DEBUG”/>

    <root level=“ERROR”>

        <appender-ref ref=“CONSOLE” /> <!– 线上不需要输出到 CONSOLE –>

        <appender-ref ref=“FILE” />

        <appender-ref ref=“ERROR_FILE” />



  • 上例中, 日志会输出到文件XXX.log, 错误日志单独输出到一个XXX_error.log文件, 日志文件并每天打包一次.
  • 上例中, 应用配置(里用于指定日志文件名文件位置的属性(logging.filelogging.path)将不起作用, 因为例子里没有用到这些属性, 其他配置(比如日志级别)仍有作用.
  • 上例中的哪个${catalina.base}算是一个系统变量, 表示应用所在目录, 文件名(位置)完全可以自己指定, 也可参考spring-boot包里的使用方式.