large file from hive to rdbms(oracle)

Recently we have a requirement of dumping a sizable file(4+G) to oracle from s3. The file itself is hive-compatiable. so instead of downloading the file and generate sql for it, we decided to transfer the content using hive jdbc and persist in via jpa/hiberante.

Hive

On the hive side, one important thing is to make sure batchsize is set in jdbc resultset.


hiveJdbcTemplate.query(sqlToExecute, rs -> {
            rs.setFetchSize(5000);
            while (rs.next()){
               ....do you handling
            }
        });

RDBMS

on the relational database side

  1. make sure index is turned off. otherwise it each insertion will trigger the b-tree index to be inserted.
  2. make sure leverage the hibernate batch-size
    hibernate.jdbc.batch_size. I set it to 50 since my table has over 200 columns.For example , if you save() 100 records and your hibernate.jdbc.batch_size is set to 50. During flushing, instead of issue the following SQL 100 times :

    insert into TableA (id , fields) values (1, 'val1');
    insert into TableA (id , fields) values (2, 'val2');
    insert into TableA (id , fields) values (3, 'val3');
    .........................
    insert into TableA (id , fields) values (100, 'val100');

    Hiberate will group them in batches of 50 , and only issue 2 SQL to the DB, like this:

    insert into TableA (id , fields) values (1, 'val1') , (2, 'val2') ,(3, 'val3') ,(4, 'val4') ,......,(50, 'val50')
    insert into TableA (id , fields) values (51, 'val51') , (52, 'val52') ,(53, 'val53') ,(54, 'val54'),...... ,(100, 'val100')  

    Please note that Hibernate would disable insert batching at the JDBC level transparently if the primary key of the inserting table isGenerationType.Identity.

  3. make sure flush()/clear() for certain size so that memory is not eaten up by the millions of objects built on the fly.
    flush will make sure query be executed and object saved(synced) to DB.
    clear will clear the persistence context so all managed entities are detached. entities that have not been flushed will not be persisted.

My main code is something like:


    public int doImport(int limit)
    {
        String sql = "SELECT * FROM erd.ERD_PRDCT_FIXED_INCM_MNCPL_HS_prc_txt";
        if (limit >= 0)
        {
            sql = sql + " LIMIT " + limit;
        }
        HiveBeanPropertyRowMapper<SrcErdFixedIncmMuniEntity> mapper = new HiveBeanPropertyRowMapper<>(SrcErdFixedIncmMuniEntity.class);
        int batchSize = 5000;
        int[] inc = {0};
        Instant start = Instant.now();
        List<SrcErdFixedIncmMuniEntity> listToPersist = new ArrayList<>(batchSize);
        hiveJdbcTemplate.query(sql, (rs) -> {
            rs.setFetchSize(batchSize);
            while (rs.next())
            {
                listToPersist.add(mapper.mapRow(rs, -1));
                inc[0]++;
                if (inc[0] % batchSize == 0)
                {
                    persistAndClear(inc, listToPersist);
                }
            }
            //left overs(last n items)
            if(!listToPersist.isEmpty())
            {
                persistAndClear(inc, listToPersist);
            }
            return null;
        });
        Instant end = Instant.now();
        System.out.println("Data Intake took: " + Duration.between(start, end));
        return inc[0];
    }

    private void persistAndClear(int[] inc, List<SrcErdFixedIncmMuniEntity> listToPersist)
    {
        listToPersist.stream().forEach(em::persist);
        em.flush();
        em.clear();
        LOGGER.info("Saved record milestone: " + inc[0]);
        listToPersist.clear();
    }

Result

not bad, ~3.5 Millions records get loaded in about an hour.

serialize enum fields with gson

By default, Gson just serialize the ‘name’ of the Enum which might not be enough since we might need also want to carry all the fields during the serialization. To achieve this we need to has our own gson adaptor and make use of reflection.

public class EnumAdapterFactory implements TypeAdapterFactory
{

    @Override
    public <T> TypeAdapter<T> create(final Gson gson, final TypeToken<T> type)
    {
        Class<? super T> rawType = type.getRawType();
        if (rawType.isEnum())
        {
            return new EnumTypeAdapter<T>();
        }
        return null;
    }

    public class EnumTypeAdapter<T> extends TypeAdapter<T>
    {
        @Override
        public void write(JsonWriter out, T value) throws IOException
        {
            if (value == null || !value.getClass().isEnum())
            {
                out.nullValue();
                return;
            }

            try
            {
                out.beginObject();
                out.name("value");
                out.value(value.toString());
                Arrays.stream(Introspector.getBeanInfo(value.getClass()).getPropertyDescriptors())
                      .filter(pd -> pd.getReadMethod() != null && !"class".equals(pd.getName()) && !"declaringClass".equals(pd.getName()))
                      .forEach(pd -> {
                          try
                          {
                              out.name(pd.getName());
                              out.value(String.valueOf(pd.getReadMethod().invoke(value)));
                          } catch (IllegalAccessException | InvocationTargetException | IOException e)
                          {
                              e.printStackTrace();
                          }
                      });
                out.endObject();
            } catch (IntrospectionException e)
            {
                e.printStackTrace();
            }
        }

        public T read(JsonReader in) throws IOException
        {
            // Properly deserialize the input (if you use deserialization)
            return null;
        }
    }
}

usage:
Enum class:

public enum ReportTypes
{
    SP(1), CA(2), ADF(3), ORF(4), CTO(5), CDS(6), TSP(7);

    private int reportTypeId;
    ReportTypes(int reportTypeId)
    {
        this.reportTypeId = reportTypeId;
    }

    public int getReportTypeId()
    {
        return reportTypeId;
    }
}

Test Code:

    @Test
    public void testReportTypesGsonSerialization()
    {
        GsonBuilder builder = new GsonBuilder();
        builder.registerTypeAdapterFactory(new EnumAdapterFactory());
        Gson gson = builder.create();
        System.out.println(gson.toJson(ReportTypes.values()));
    }

Result:

[
  {
    "value": "SP",
    "reportTypeId": "1"
  },
  {
    "value": "CA",
    "reportTypeId": "2"
  },
  {
    "value": "ADF",
    "reportTypeId": "3"
  },
  {
    "value": "ORF",
    "reportTypeId": "4"
  },
  {
    "value": "CTO",
    "reportTypeId": "5"
  },
  {
    "value": "CDS",
    "reportTypeId": "6"
  },
  {
    "value": "TSP",
    "reportTypeId": "7"
  }
]

DNS原理以及A/NS Record Cname

阮一峰 老师的一篇关于DNS的好博客,尤其喜欢里面对于分级查询以及A-Record, NS-Record, CNAME的解释, 简单明了, 所以转载了这一部分如下:

 

四、域名的层级

DNS服务器怎么会知道每个域名的IP地址呢?答案是分级查询。

请仔细看前面的例子,每个域名的尾部都多了一个点。

比如,域名math.stackexchange.com显示为math.stackexchange.com.。这不是疏忽,而是所有域名的尾部,实际上都有一个根域名。

举例来说,http://www.example.com真正的域名是http://www.example.com.root,简写为http://www.example.com.。因为,根域名.root对于所有域名都是一样的,所以平时是省略的。

根域名的下一级,叫做”顶级域名”(top-level domain,缩写为TLD),比如.com.net;再下一级叫做”次级域名”(second-level domain,缩写为SLD),比如http://www.example.com里面的.example,这一级域名是用户可以注册的;再下一级是主机名(host),比如http://www.example.com里面的www,又称为”三级域名”,这是用户在自己的域里面为服务器分配的名称,是用户可以任意分配的。

总结一下,域名的层级结构如下。


主机名.次级域名.顶级域名.根域名

# 即

host.sld.tld.root

五、根域名服务器

DNS服务器根据域名的层级,进行分级查询。

需要明确的是,每一级域名都有自己的NS记录,NS记录指向该级域名的域名服务器。这些服务器知道下一级域名的各种记录。

所谓”分级查询”,就是从根域名开始,依次查询每一级域名的NS记录,直到查到最终的IP地址,过程大致如下。

  1. 从”根域名服务器”查到”顶级域名服务器”的NS记录和A记录(IP地址)
  2. 从”顶级域名服务器”查到”次级域名服务器”的NS记录和A记录(IP地址)
  3. 从”次级域名服务器”查出”主机名”的IP地址

仔细看上面的过程,你可能发现了,没有提到DNS服务器怎么知道”根域名服务器”的IP地址。回答是”根域名服务器”的NS记录和IP地址一般是不会变化的,所以内置在DNS服务器里面。

下面是内置的根域名服务器IP地址的一个例子

上面列表中,列出了根域名(.root)的三条NS记录A.ROOT-SERVERS.NETB.ROOT-SERVERS.NETC.ROOT-SERVERS.NET,以及它们的IP地址(即A记录)198.41.0.4192.228.79.201192.33.4.12

另外,可以看到所有记录的TTL值是3600000秒,相当于1000小时。也就是说,每1000小时才查询一次根域名服务器的列表。

目前,世界上一共有十三组根域名服务器,从A.ROOT-SERVERS.NET一直到M.ROOT-SERVERS.NET

六、分级查询的实例

dig命令的+trace参数可以显示DNS的整个分级查询过程。


$ dig +trace math.stackexchange.com

上面命令的第一段列出根域名.的所有NS记录,即所有根域名服务器。

根据内置的根域名服务器IP地址,DNS服务器向所有这些IP地址发出查询请求,询问math.stackexchange.com的顶级域名服务器com.的NS记录。最先回复的根域名服务器将被缓存,以后只向这台服务器发请求。

接着是第二段。

上面结果显示.com域名的13条NS记录,同时返回的还有每一条记录对应的IP地址。

然后,DNS服务器向这些顶级域名服务器发出查询请求,询问math.stackexchange.com的次级域名stackexchange.com的NS记录。

上面结果显示stackexchange.com有四条NS记录,同时返回的还有每一条NS记录对应的IP地址。

然后,DNS服务器向上面这四台NS服务器查询math.stackexchange.com的主机名。

上面结果显示,math.stackexchange.com有4条A记录,即这四个IP地址都可以访问到网站。并且还显示,最先返回结果的NS服务器是ns-463.awsdns-57.com,IP地址为205.251.193.207

七、NS 记录的查询

dig命令可以单独查看每一级域名的NS记录。


$ dig ns com
$ dig ns stackexchange.com

+short参数可以显示简化的结果。


$ dig +short ns com
$ dig +short ns stackexchange.com

八、DNS的记录类型

域名与IP之间的对应关系,称为”记录”(record)。根据使用场景,”记录”可以分成不同的类型(type),前面已经看到了有A记录和NS记录。

常见的DNS记录类型如下。

(1) A:地址记录(Address),返回域名指向的IP地址。

(2) NS:域名服务器记录(Name Server),返回保存下一级域名信息的服务器地址。该记录只能设置为域名,不能设置为IP地址。

(3)MX:邮件记录(Mail eXchange),返回接收电子邮件的服务器地址。

(4)CNAME:规范名称记录(Canonical Name),返回另一个域名,即当前查询的域名是另一个域名的跳转,详见下文。

(5)PTR:逆向查询记录(Pointer Record),只用于从IP地址查询域名,详见下文。

一般来说,为了服务的安全可靠,至少应该有两条NS记录,而A记录和MX记录也可以有多条,这样就提供了服务的冗余性,防止出现单点失败。

CNAME记录主要用于域名的内部跳转,为服务器配置提供灵活性,用户感知不到。举例来说,facebook.github.io这个域名就是一个CNAME记录。


$ dig facebook.github.io

...

;; ANSWER SECTION:
facebook.github.io. 3370    IN  CNAME   github.map.fastly.net.
github.map.fastly.net.  600 IN  A   103.245.222.133

上面结果显示,facebook.github.io的CNAME记录指向github.map.fastly.net。也就是说,用户查询facebook.github.io的时候,实际上返回的是github.map.fastly.net的IP地址。这样的好处是,变更服务器IP地址的时候,只要修改github.map.fastly.net这个域名就可以了,用户的facebook.github.io域名不用修改。

由于CNAME记录就是一个替换,所以域名一旦设置CNAME记录以后,就不能再设置其他记录了(比如A记录和MX记录),这是为了防止产生冲突。举例来说,foo.com指向bar.com,而两个域名各有自己的MX记录,如果两者不一致,就会产生问题。由于顶级域名通常要设置MX记录,所以一般不允许用户对顶级域名设置CNAME记录。

PTR记录用于从IP地址反查域名。dig命令的-x参数用于查询PTR记录。


$ dig -x 192.30.252.153

...

;; ANSWER SECTION:
153.252.30.192.in-addr.arpa. 3600 IN    PTR pages.github.com.

上面结果显示,192.30.252.153这台服务器的域名是pages.github.com

逆向查询的一个应用,是可以防止垃圾邮件,即验证发送邮件的IP地址,是否真的有它所声称的域名。

dig命令可以查看指定的记录类型。


$ dig a github.com
$ dig ns github.com
$ dig mx github.com

JPA SequenceGenerator with allocationSize 1 performance tuning

I had a blog last year about fixing the sequence number going wild by setting the allocationSize to 1.

Overall it solves the inconsistency problem if you are using a sequence with ‘INCREMENT BY’ value 1 in database.

Issue

One problem comes up today is I am facing some performance issue with the above setting when I was trying to persist a lot of records(entities) because for every entity need to perform a ‘select SEQ.nextval from DUAL’ in order to get a ID from the specified sequence. So when persisting hundreds of thousands of entities, this becomes a problem.

First Try

Did some search and tried to set my allocationSize to 500 also increased my sequence’s ‘INCREMENT BY’ value to 500 by

alter sequence SEQ_EQUITY_PROC_DAILY_ID increment by 500

At doing this, the saving process is much faster(10 times). However  when I query the database, i found another inconsistency that my sequence next value is ‘2549522’ but the ID I have in the db table is something like ‘1274761000’. So the problem for using the MultipleHiLoPerTableGenerator where the id will be allocationSize*sequenceValue. This generator is perfectly fine is you have a new table with sequence init value 1 given that you can tolerate this kind of inconsistency between the ID value and the actual sequence value. So how it works is, by default we have allocation size 50, so hibernate will get the 50 and use the 1-50 for the current entities. Next round it will use 51-100 when the sequence value is 2. The drawback is if there are some other JDBC connection or jpa using a different setting, we will probably get ID collision.

Solution

To solve this problem, we need to set a property in hibernate:

properties.setProperty("hibernate.id.new_generator_mappings", Boolean.toString(true));

This ‘hibernate.id.new_generator_mappings’ by default is false which uses the ‘SequenceHiLoGenerator‘ which will have that multiply behavior. Once we set it to true, it will then use the ‘SequenceStyleGenerator‘, which is more jpa and oracle friendly. It generates identifier values based on an sequence-style database structure. Variations range from actually using a sequence to using a table to mimic a sequence.

3d css with perspective and transform rotate

The perspective CSS property gives an element a 3D-space by affecting the distance between the Z plane and the user.

The strength of the effect is determined by the value. The smaller the value, the closer you get from the Z plane and the more impressive the visual effect. The greater the value, the more subtle will be the effect.

Important: Please note the perspective property doesn’t affect how the element is rendered; it simply enables a 3D-space for children elements. This is the main difference between the transform: perspective() function and the perspective property. The first gives element depth while the later creates a 3D-space shared by all its transformed children.

/**
 * Syntax
 * perspective: none | &lt;length&gt; 
 */

.parent {
    perspective: 1000px;
}

.child {
    transform: rotateY(50deg);
}

Here in the bottom is the some other examples to create cool 3d with hover (transform: rotate(0))effect.

java generic multi-level wild card

Was having a interesting issue on Generic which i thought i was pretty familiar with.

Basically i need to pass a

Map<String, List<SubClass>()>

to a

Map<String, List<SuperClass>()>

I was thinking i could just do

Map<String, List<? extends SuperClass>()>

in the method signature, then i should be fine. But i get compiler error complaining type not compatible.

It turns out that the Map i pass in is actually a

HashMap<String, ArrayList<SubClass>()>

where the ArrayList is not the same type as List.

To fix this, i need to make the signature:

Map<String, ? extends  List<? extends SuperClass>()>

 

hive jdbc with Spring Beanpropertyrowmapper

In our project we need to port some hive table data to our local RDBMS(Oracle). For tables with a lot of columns(hundreds), it could be very tedious to wrote the hive sql and convert the resultSet to the Jpa entity object.

Spring jdbctemplate provides us a good class which would do camel-case conversion to the underscore for us.  Do leverage that, we just need to make sure the hive table has the same column name as the RDBMS table. Then we just need to call:

hiveJdbcTemplate.query(sql, new BeanPropertyRowMapper&lt;&gt;(YourOwnEntity.class));

However you might find that the result size will be good but all the entity fields will be null. That is because beginning with Hive .13, a new Hive property called hive.resultset.use.unique.column.names was created. The default value of this property is TRUE.

If the default value is used, the Hive table name is prepended to all column names. This is a change in behavior from previous versions of Hive.

This change will cause the all nulls described above because Spring converts all the fields from the target entity class to underscore format(col_name1, col_name2), however when it does the comparison with the resultSet, the ‘ResultSetMetaData’ returned by hive jdbc contains header as ‘table.col_name1’, ‘table.col_name2’ etc. As a result, the fields will be all null.
To prevent the Hive table names from being prepended to the column names, use this property setting in the Hive-site.xml file.

     <property>
       <name>hive.resultset.use.unique.column.names</name>
       <value>false</value>
     </property>