Generate Large file via Hive JDBC

Recently in our project, we have a requirement of generating comparatively large file via Hive JDBC for web app user to download, by large I mean millions of rows.

Since our previous use case are all small files containing less than maybe 50000 rows, our approach is grabbing them all into memory and dump into file. However now since we have much more rows, it can easily eat up 2-4G RAM. One thing we did is using the PrintWriter which takes a FileOutputStream to append content and flush to real file on the go.

 

Another important factor is the jdbc fetch size, by default, Hive jdbc driver set the fetchSize to 50 which is way too low if we need to grab millions of rows. Once we set it to 500, the processing time decreases from 16 min to 4.5 min. And if 5000, it becomes 3min 25 sec. But when 10K, it increases to 3min 40. So we guess 5k is the sweet spot in our case.

hiveJdbcTemplate.query(sqlToExecute, rs -> {
            rs.setFetchSize(5000);
            while (rs.next()){
               ....do you handling
            }
        });

Note: As of Hive 1.3.0, the default fetchSize has been increased to 1000 based on this ticket. We are still on Hive 0.13.0, so it would affect versions like: 0.14.0, 1.0.0, 1.2.0, 1.2.1. 

I also tried to add the fetchSize=5000 as a parameter in the jdbc connection String but it seems not being honored by the driver(jdbc:hive2://{0}:{1}/{2};ssl=true;fetchSize=5000).

Here is the related source code.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s