large file from hive to rdbms(oracle)

Recently we have a requirement of dumping a sizable file(4+G) to oracle from s3. The file itself is hive-compatiable. so instead of downloading the file and generate sql for it, we decided to transfer the content using hive jdbc and persist in via jpa/hiberante.

Hive

On the hive side, one important thing is to make sure batchsize is set in jdbc resultset.


hiveJdbcTemplate.query(sqlToExecute, rs -> {
            rs.setFetchSize(5000);
            while (rs.next()){
               ....do you handling
            }
        });

RDBMS

on the relational database side

  1. make sure index is turned off. otherwise it each insertion will trigger the b-tree index to be inserted.
  2. make sure leverage the hibernate batch-size
    hibernate.jdbc.batch_size. I set it to 50 since my table has over 200 columns.For example , if you save() 100 records and your hibernate.jdbc.batch_size is set to 50. During flushing, instead of issue the following SQL 100 times :

    insert into TableA (id , fields) values (1, 'val1');
    insert into TableA (id , fields) values (2, 'val2');
    insert into TableA (id , fields) values (3, 'val3');
    .........................
    insert into TableA (id , fields) values (100, 'val100');

    Hiberate will group them in batches of 50 , and only issue 2 SQL to the DB, like this:

    insert into TableA (id , fields) values (1, 'val1') , (2, 'val2') ,(3, 'val3') ,(4, 'val4') ,......,(50, 'val50')
    insert into TableA (id , fields) values (51, 'val51') , (52, 'val52') ,(53, 'val53') ,(54, 'val54'),...... ,(100, 'val100')  

    Please note that Hibernate would disable insert batching at the JDBC level transparently if the primary key of the inserting table isGenerationType.Identity.

  3. make sure flush()/clear() for certain size so that memory is not eaten up by the millions of objects built on the fly.
    flush will make sure query be executed and object saved(synced) to DB.
    clear will clear the persistence context so all managed entities are detached. entities that have not been flushed will not be persisted.

My main code is something like:


    public int doImport(int limit)
    {
        String sql = "SELECT * FROM erd.ERD_PRDCT_FIXED_INCM_MNCPL_HS_prc_txt";
        if (limit >= 0)
        {
            sql = sql + " LIMIT " + limit;
        }
        HiveBeanPropertyRowMapper<SrcErdFixedIncmMuniEntity> mapper = new HiveBeanPropertyRowMapper<>(SrcErdFixedIncmMuniEntity.class);
        int batchSize = 5000;
        int[] inc = {0};
        Instant start = Instant.now();
        List<SrcErdFixedIncmMuniEntity> listToPersist = new ArrayList<>(batchSize);
        hiveJdbcTemplate.query(sql, (rs) -> {
            rs.setFetchSize(batchSize);
            while (rs.next())
            {
                listToPersist.add(mapper.mapRow(rs, -1));
                inc[0]++;
                if (inc[0] % batchSize == 0)
                {
                    persistAndClear(inc, listToPersist);
                }
            }
            //left overs(last n items)
            if(!listToPersist.isEmpty())
            {
                persistAndClear(inc, listToPersist);
            }
            return null;
        });
        Instant end = Instant.now();
        System.out.println("Data Intake took: " + Duration.between(start, end));
        return inc[0];
    }

    private void persistAndClear(int[] inc, List<SrcErdFixedIncmMuniEntity> listToPersist)
    {
        listToPersist.stream().forEach(em::persist);
        em.flush();
        em.clear();
        LOGGER.info("Saved record milestone: " + inc[0]);
        listToPersist.clear();
    }

Result

not bad, ~3.5 Millions records get loaded in about an hour.

Advertisements

Optimize JPA one to many with fetch

 N+1

We all know the famous N+1 problem in the ORM world where the generated sql is not that optimized that we have to get the child collections with N number of separate sql query which leads to serious performance problem if there are many levels of objects tree.

So we can fix the problem in different ways

Let’s say we have an Parent object with a Set of Child objects.

1. Use JPA criteria fetch call

We can explicitly tell the criteria to do fetch instead of having separate query:

    public List&lt;Parent&gt; findAllParent()
    {
        CriteriaBuilder cb = entityManager.getCriteriaBuilder();
        CriteriaQuery&lt;Parent&gt; cq = cb.createQuery(Parent.class);
        Root&lt;Parent&gt; root = cq.from(Parent.class);
        root.fetch(Parent_.Child, JoinType.INNER);
        cq.distinct(true);
        return entityManager.createQuery(cq).getResultList();
    }

Note the distinct(true) is important here otherwise you get a lot of duplicate rows. reference HERE.

2. Use @Fetch(SUBSELECT or JOIN)

JPA doesn’t provide any specification on mapping annotations to select fetch strategy. In general, related entities can be fetched in any one of the ways given below

  • SELECT => one query for root entities + one query for related mapped entity/collection of each root entity = (n+1) queries, this is the DEFAULT
  • SUBSELECT => one query for root entities + second query for related mapped entity/collection of all root entities retrieved in first query = 2 queries
  • JOIN => one query to fetch both root entities and all of their mapped entity/collection = 1 query

So SELECT and JOIN are two extremes and SUBSELECT falls in between. One can choose suitable strategy based on her/his domain model.

By default SELECT is used by both JPA/EclipseLink and Hibernate. This can be overridden by using

@Fetch(FetchMode.JOIN)
@Fetch(FetchMode.SUBSELECT)

3. User @Fetch(SELECT) + @BatchSize

in Hibernate. It also allows to set SELECT mode explicitly using @Fetch(FetchMode.SELECT)which can be tuned by using batch size e.g. @BatchSize(size=10)

Corresponding annotations in EclipseLink are

@JoinFetch
@BatchFetch

JPA hibernate @GeneratedValue with sequence on Oracle

When using Oracle, the JPA with hibernate implementation on the @GeneratedValue with sequence would be quite tricky. I am trying to make a summary here. The version used are JPA 2.1 and Hibernate 4.3.11.Final

Default generator setting

@GeneratedValue with strategy AUTO and default allocationSize(50)

This will use the ‘org.hibernate.id.SequenceGenerator’ which will get a SEQ+1 value.

@GeneratedValue with strategy SEQUNCE and default allocationSize(50)

This will use the ‘org.hibernate.id.SequenceHiLoGenerator’ because the maxLo value in this class is 49 which is due to the default allocationSize 50,  which will return a SEQ*50 value.

@GeneratedValue with strategy AUTO and allocationSize=1

This would have the same behavior with the default allocationSize using ‘org.hibernate.id.SequenceGenerator’, which gives SEQ+1

@GeneratedValue with strategy SEQUNCE and default allocationSize=1

This would use the ‘org.hibernate.id.SequenceHiLoGenerator’, but since we set the allocationSize to 1, it would not use the HiLo optimizer but delegate the work to the ‘SequenceGenerator’, so we will get the same SEQ+1 value.

 

hibernate.id.new_generator_mappings set to true

@GeneratedValue with strategy AUTO/SEQUNCE and default allocationSize(50)

This will use ‘org.hibernate.id.enhanced.SequenceStyleGenerator’ which will subtract the current sequence with allocationSize. so we get the value: SEQ-49

@GeneratedValue with strategy AUTO/SEQUNCE and default allocationSize=1

This would also use the ‘org.hibernate.id.enhanced.SequenceStyleGenerator’ but since we assigned the allocationSize to 1, it would return the regular SEQ+1 value.

 

Summary

As we can see the inconsistency behaviors above, the best way IMO is to set the new ‘hibernate.id.new_generator_mappings’ to true to leverage the new ‘SequenceStyleGenerator’ where we have the flexibility to use regular sequence but also could bump up sequence value with larger allocationSize if we need to persist a lot of entities at one time. The only thing we need to pay attention is to not forget to have allocationSize=1 to the existing annotation.

If we set this attribute to TRUE, For the existing entities using HiLo generator, we could either 1. bump up the sequence, add allocationSize to 50 OR 2. use @GenericGenerator annotation to still use SequenceHiLoGenerator.

    @GenericGenerator(name = "RPT_QMRC_SCRT_PRDCT_6_ID_GEN", strategy = "org.hibernate.id.SequenceHiLoGenerator",
        parameters =  {@Parameter(name = "sequence", value = "SEQ_RPT_QMRC_SCRT_PRDCT_6_ID"), @Parameter(name = "max_lo", value = "50")})
    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "RPT_QMRC_SCRT_PRDCT_6_ID_GEN")
    @Column(name = "RPT_QMRC_SCRT_PRDCT_6_ID", nullable = false, insertable = true, updatable = true)
    public Long getRptQmrcScrtPrdct6Id()
    {
        return rptQmrcScrtPrdct6Id;
    }

JPA SequenceGenerator with allocationSize 1 performance tuning

I had a blog last year about fixing the sequence number going wild by setting the allocationSize to 1.

Overall it solves the inconsistency problem if you are using a sequence with ‘INCREMENT BY’ value 1 in database.

Issue

One problem comes up today is I am facing some performance issue with the above setting when I was trying to persist a lot of records(entities) because for every entity need to perform a ‘select SEQ.nextval from DUAL’ in order to get a ID from the specified sequence. So when persisting hundreds of thousands of entities, this becomes a problem.

First Try

Did some search and tried to set my allocationSize to 500 also increased my sequence’s ‘INCREMENT BY’ value to 500 by

alter sequence SEQ_EQUITY_PROC_DAILY_ID increment by 500

At doing this, the saving process is much faster(10 times). However  when I query the database, i found another inconsistency that my sequence next value is ‘2549522’ but the ID I have in the db table is something like ‘1274761000’. So the problem for using the MultipleHiLoPerTableGenerator where the id will be allocationSize*sequenceValue. This generator is perfectly fine is you have a new table with sequence init value 1 given that you can tolerate this kind of inconsistency between the ID value and the actual sequence value. So how it works is, by default we have allocation size 50, so hibernate will get the 50 and use the 1-50 for the current entities. Next round it will use 51-100 when the sequence value is 2. The drawback is if there are some other JDBC connection or jpa using a different setting, we will probably get ID collision.

Solution

To solve this problem, we need to set a property in hibernate:

properties.setProperty("hibernate.id.new_generator_mappings", Boolean.toString(true));

This ‘hibernate.id.new_generator_mappings’ by default is false which uses the ‘SequenceHiLoGenerator‘ which will have that multiply behavior. Once we set it to true, it will then use the ‘SequenceStyleGenerator‘, which is more jpa and oracle friendly. It generates identifier values based on an sequence-style database structure. Variations range from actually using a sequence to using a table to mimic a sequence.

PESSIMISTIC_READ and PESSIMISTIC_WRITE

Readers–writer lock

A database system is a highly concurrent environment, therefore many concurrency theory idioms apply to database access as well. Concurrent changes must be serialized to preserve data integrity, so most database systems use a two-phase locking strategy, even if it’s usually supplemented by a Multiversion concurrency control mechanism.

Because a mutual exclusion locking would hinder scalability (treating reads and writes equally), most database systems use a readers-writer locking synchronization scheme, so that:

  • A shared (read) lock blocks writers, allowing multiple readers to proceed
  • An exclusive (write) lock blocks both readers and writers, making all write operations be applied sequentially

Because the locking syntax is not part of the SQL Standard, each RDBMS has opted for a different syntax:

Database name Shared lock statement Exclusive lock statement
Oracle FOR UPDATE FOR UPDATE
MySQL LOCK IN SHARE MODE FOR UPDATE
Microsoft SQL Server WITH (HOLDLOCK, ROWLOCK) WITH (UPDLOCK, ROWLOCK)
PostgreSQL FOR SHARE FOR UPDATE
DB2 FOR READ ONLY WITH RS FOR UPDATE WITH RS

Java Persistence abstraction layer hides the database specific locking semantics, offering a common API that only requires two Lock Modes. The shared/read lock is acquired using the PESSIMISTIC_READ Lock Mode Type, and the exclusive/write lock is requested using PESSIMISTIC_WRITE instead.

PostgreSQL row-level lock modes

For the next test cases, we are going to use PostgreSQL for it supports both exclusive and share explicit locking.

All the following tests will use the same concurrency utility, emulating two users: Alice and Bob. Each test scenario will verify a specific read/write locking combination.

Case 1: PESSIMISTIC_READ doesn’t block PESSIMISTIC_READ lock requests

The first test will check how two concurrent PESSIMISTIC_READ lock requests interact:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@Test
public void testPessimisticReadDoesNotBlockPessimisticRead() throws InterruptedException {
    LOGGER.info("Test PESSIMISTIC_READ doesn't block PESSIMISTIC_READ");
    testPessimisticLocking(
        (session, product) -> {
            session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_READ)).lock(product);
            LOGGER.info("PESSIMISTIC_READ acquired");
        },
        (session, product) -> {
            session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_READ)).lock(product);
            LOGGER.info("PESSIMISTIC_READ acquired");
        }
    );
}

Running this test, we get the following output:

In this scenario, there is no contention whatsoever. Both Alice and Bob can acquire a shared lock without running into any conflict.

Case 2: PESSIMISTIC_READ blocks UPDATE implicit lock requests

The second scenario will demonstrate how the shared lock prevents a concurrent modification. Alice will acquire a shared lock and Bob will attempt to modify the locked entity:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
@Test
public void testPessimisticReadBlocksUpdate() throws InterruptedException {
    LOGGER.info("Test PESSIMISTIC_READ blocks UPDATE");
    testPessimisticLocking(
        (session, product) -> {
            session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_READ)).lock(product);
            LOGGER.info("PESSIMISTIC_READ acquired");
        },
        (session, product) -> {
            product.setDescription("USB Flash Memory Stick");
            session.flush();
            LOGGER.info("Implicit lock acquired");
        }
    );
}

The test generates this output:

While Bob could select the Product entity, the UPDATE is delayed up until Alice transaction is committed (that’s why the UPDATE took 427ms to run).

Case 3: PESSIMISTIC_READ blocks PESSIMISTIC_WRITE lock requests

The same behaviour is exhibited by a secondary PESSIMISTIC_WRITE lock request:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@Test
public void testPessimisticReadBlocksPessimisticWrite() throws InterruptedException {
    LOGGER.info("Test PESSIMISTIC_READ blocks PESSIMISTIC_WRITE");
    testPessimisticLocking(
        (session, product) -> {
            session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_READ)).lock(product);
            LOGGER.info("PESSIMISTIC_READ acquired");
        },
        (session, product) -> {
            session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_WRITE)).lock(product);
            LOGGER.info("PESSIMISTIC_WRITE acquired");
        }
    );
}

Giving the following output:

Bob’s exclusive lock request waits for Alice’s shared lock to be released.

Case 4: PESSIMISTIC_READ blocks PESSIMISTIC_WRITE lock requests, NO WAIT fails fast

Hibernate provides a PESSIMISTIC_NO_WAIT timeout directive, which translates to a database specific NO_WAIT lock acquire policy.

The PostgreSQL NO WAIT directive is described as follows:

To prevent the operation from waiting for other transactions to commit, use the NOWAIT option. With NOWAIT, the statement reports an error, rather than waiting, if a selected row cannot be locked immediately. Note that NOWAIT applies only to the row-level lock(s) — the required ROW SHARE table-level lock is still taken in the ordinary way (see Chapter 13). You can use LOCK with the NOWAIT option first, if you need to acquire the table-level lock without waiting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@Test
public void testPessimisticReadWithPessimisticWriteNoWait() throws InterruptedException {
    LOGGER.info("Test PESSIMISTIC_READ blocks PESSIMISTIC_WRITE, NO WAIT fails fast");
    testPessimisticLocking(
        (session, product) -> {
            session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_READ)).lock(product);
            LOGGER.info("PESSIMISTIC_READ acquired");
        },
        (session, product) -> {
            session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_WRITE)).setTimeOut(Session.LockRequest.PESSIMISTIC_NO_WAIT).lock(product);
            LOGGER.info("PESSIMISTIC_WRITE acquired");
        }
    );
}

This test generates the following output:

Since Alice already holds a shared lock on the Product entity associated database row, Bob’s exclusive lock request fails immediately.

Case 5: PESSIMISTIC_WRITE blocks PESSIMISTIC_READ lock requests

The next test proves that an exclusive lock will always blocks a shared lock acquire attempt:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@Test
public void testPessimisticWriteBlocksPessimisticRead() throws InterruptedException {
    LOGGER.info("Test PESSIMISTIC_WRITE blocks PESSIMISTIC_READ");
    testPessimisticLocking(
        (session, product) -> {
            session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_WRITE)).lock(product);
            LOGGER.info("PESSIMISTIC_WRITE acquired");
        },
        (session, product) -> {
            session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_READ)).lock(product);
            LOGGER.info("PESSIMISTIC_WRITE acquired");
        }
    );
}

Generating the following output:

Bob’s shared lock request waits for Alice’s transaction to end, so that all acquired locks are released.

Case 6: PESSIMISTIC_WRITE blocks PESSIMISTIC_WRITE lock requests

An exclusive lock blocks an exclusive lock as well:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@Test
public void testPessimisticWriteBlocksPessimisticWrite() throws InterruptedException {
    LOGGER.info("Test PESSIMISTIC_WRITE blocks PESSIMISTIC_WRITE");
    testPessimisticLocking(
            (session, product) -> {
                session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_WRITE)).lock(product);
                LOGGER.info("PESSIMISTIC_WRITE acquired");
            },
            (session, product) -> {
                session.buildLockRequest(new LockOptions(LockMode.PESSIMISTIC_WRITE)).lock(product);
                LOGGER.info("PESSIMISTIC_WRITE acquired");
            }
    );
}

The test generates this output:

Bob’s exclusive lock request has to wait for Alice to release its lock.

Conclusion

Relational database systems use locks for preserving ACID guarantees, so it’s important to understand how shared and exclusive row-level locks inter-operate. An explicit pessimistic lock is a very powerful database concurrency control mechanism and you might even use it for fixing an optimistic locking race condition.

FROM HERE

Fix Hibernate JPA sequence generating odd id with allocationSize

I config my jpa entity class as usual using Intellij generated stub from DB schema with below config:


@Entity
@javax.persistence.Table(name = "IMPORT_RUN", schema = "SCRCOWNER", catalog = "")
public class ImportRunEntity
{
    private Long importRunId;

    @SequenceGenerator(name = "IMPORT_RUN_ID_GEN", sequenceName = "SEQ_IMPORT_RUN_ID")
    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "IMPORT_RUN_ID_GEN")
    @javax.persistence.Column(name = "IMPORT_RUN_ID", nullable = false, insertable = true, updatable = true, precision = -127)
    public Long getImportRunId()
    {
        return importRunId;
    }

    public void setImportRunId(Long importRunId)
    {
        this.importRunId = importRunId;
    }
.....
}

However the id that is inserted into DB are way different(larger) than it should be. For example, id should be 10012 but it becomes 5223326. This odd number is actually getting from some other sequence.

To fix this, I have to add the

    @SequenceGenerator(name = "IMPORT_RUN_ID_GEN", sequenceName = "SEQ_IMPORT_RUN_ID", allocationSize=1)

to the ‘SequenceGenerator’ to make sure it get its own sequence and add 1. By default the allocationSize is 50.

TheJavadoc of the javax.persistence.SequenceGenerator class provided an partial answer, because I found out that it contained a property called allocationSize, which is specified as follows:

(Optional) The amount to increment by when allocating sequence numbers from the sequence.

After this modification the value inserted to the id column of the Persons table was the same than value fetched from the person_sequence sequence. I am not sure how common this problem is, but I have never encountered this problem when building web applications by using Hibernate without JPA. This leads me to wonder, if this is a Hibernate specific problem or are other JPA2 implementations behaving in the same way. However, in my opinion the default value of the allocationSize property should be one, because it is the most common use case of id generation by using a sequence.
And this is a related discussion
Some useful notes:

AllocationSize is a range of primarykey values reserved for Hibernate. And the select ….nextval from dual will be done only after hibernate consumed this range of primarykeys.

So you MUST declare the same value on both allocationSize (Hibernate) AND sequence increment (DB)

When explicitly set allocationSize=500, e.g. on Oracle

create sequence SEQ_ACE_WORKERS_QUEUE_STATS_ID
       MINVALUE 1 
       MAXVALUE 999999999999999999999999999 
       START WITH 1
       INCREMENT BY 500 
       NOCACHE 
       NOCYCLE;

Otherwise, you’ll notice negative values or constraint errors raised from your DB because of primary key collisions.

When the app server is restarted, you’ll notice the ‘jump’ between the latest primarykey allocated, and the ‘newly’ sequence number selected at restart.

Final comment : default value is 50. So if you don’t specify allocationSize on the Hibernate side, you MUST declare INCREMENT BY 50 on the DB side.

For clustering

If you’re using load-balancers, but the DB is shared between the instances, you cannot cache the sequences’ values.

You have to ask for the next sequence value each time you insert into DB. And each insert has to be inside a transaction.
Hibernate, (by default) has allocationSize = 50, so to fix the problem, you have to explicitly set it to 1.

 @SequenceGenerator(allocationSize=1 ...)

Hi/Lo generator

Using the JPA @SequenceGenerator along with the legacy Hibernate identifiers will give you theSequenceHiLoGenerator that applies a HI/LO optimization algorithm anyway.

But for Sequences, the actual identifier value is applied during flush-time, so you won’t get the actual value until the session gets flushed (a manual flush or the commit-time flush).

For IDENITY generator, you get the identifier generated prior to flushing, but that disables JDBC batching so it’s not a silver-bullet either.

If you want full control, you need to resort to assigned identifiers and UUID surrogate keys are perfect for this job.

memcached usage

memcached shares similar logic with Cassandra when locating the record, where the hash of the key is used to find which server the record could be.

It could be used to cached frequent executed query(objects) or server-expensive html content we have to generate(jsp etc).

More about the usage for java appliation is here: part1 for concept and part2 for usage with hibernate.