Optimize JPA one to many with fetch

 N+1

We all know the famous N+1 problem in the ORM world where the generated sql is not that optimized that we have to get the child collections with N number of separate sql query which leads to serious performance problem if there are many levels of objects tree.

So we can fix the problem in different ways

Let’s say we have an Parent object with a Set of Child objects.

1. Use JPA criteria fetch call

We can explicitly tell the criteria to do fetch instead of having separate query:

    public List<Parent> findAllParent()
    {
        CriteriaBuilder cb = entityManager.getCriteriaBuilder();
        CriteriaQuery<Parent> cq = cb.createQuery(Parent.class);
        Root<Parent> root = cq.from(Parent.class);
        root.fetch(Parent_.Child, JoinType.INNER);
        cq.distinct(true);
        return entityManager.createQuery(cq).getResultList();
    }

Note the distinct(true) is important here otherwise you get a lot of duplicate rows. reference HERE.

2. Use @Fetch(SUBSELECT or JOIN)

JPA doesn’t provide any specification on mapping annotations to select fetch strategy. In general, related entities can be fetched in any one of the ways given below

  • SELECT => one query for root entities + one query for related mapped entity/collection of each root entity = (n+1) queries, this is the DEFAULT
  • SUBSELECT => one query for root entities + second query for related mapped entity/collection of all root entities retrieved in first query = 2 queries
  • JOIN => one query to fetch both root entities and all of their mapped entity/collection = 1 query

So SELECT and JOIN are two extremes and SUBSELECT falls in between. One can choose suitable strategy based on her/his domain model.

By default SELECT is used by both JPA/EclipseLink and Hibernate. This can be overridden by using

@Fetch(FetchMode.JOIN)
@Fetch(FetchMode.SUBSELECT)

3. User @Fetch(SELECT) + @BatchSize

in Hibernate. It also allows to set SELECT mode explicitly using @Fetch(FetchMode.SELECT)which can be tuned by using batch size e.g. @BatchSize(size=10)

Corresponding annotations in EclipseLink are

@JoinFetch
@BatchFetch
Advertisements

My Postgresql snippets

Install psql

To get the psql, I simply installed the postgresapp for OSX which bundles the psql . We can start the psql from its GUI, or added its bin to our PATH. The specific path is: /Applications/Postgres.app/Contents/Versions/latest/bin

You can also get psql by installing postgresql one click installer.

Connect to DB with psql

psql -h DbUrl -p DbPort -U DbUser

typically a password prompt comes up(if not add -W to force it).

skip password

If we need to avoid entering pw interactively, we could set the pw as PGPASSWORD in the PATH so that psql could pickup.

export PGPASSWORD='password'

For all other environment variables, reference here.

run batch sql file

append -a -f YourScript.sql to the above command.

-a will print all nonempty input lines to standard output as they are read.

After the file is processed, psql terminates.

Or in the psql shell, \i path_to_sql_file

quit psql shell

use \quite

meta sqls

All DBs: select datname from pg_database;

All Tables: select tablename from pg_table where tableowner='YOUROWNER';

describe table: select column_name, data_type, character_maximum_length from INFORMATION_SCHEMA.COLUMNS where table_name = 'YourTableName';
Or use  \d schemaName.tableName in psql shell.

install ora2pg on mac osx

Recently we need to migrate our on prem oracle database into postgresql on aws RDS. Looks like ora2pg is a nice little tool to mitigate the problem though not perfect.

I am running OS X 10.10.5 Yosemite. It already comes with perl v5.18.2 as well as cpan for installing packages.

download/install Oracle client

The official document specifies quite vaguely about what is the oracle client libraries.  Turns out we need to download 3 packages from oracle website, for my use case we have oracle 12c so my current version is 12.1.02. The packages you need are:

  1. Oracle Basic
  2. Sqlplus
  3. sdk

Then we unzip them to the SAME directory like: /Users/liha/develop/instantclient_12_1 .

Now we need to set 2 env variables for installation:

export LD_LIBRARY_PATH=/Users/liha/develop/instantclient_12_1
export ORACLE_HOME=/Users/liha/develop/instantclient_12_1

install DBI/DBD

Now we need to install DBI/DBD stuff that is required during the ora2pg Makefile phase, especially the DBD::Oracle. The DBD:Pg is optional, only if you need to dump data/schema live rather than offline to file.

#perl –MCPAN –e shell
…………
cpan> install DBI
…………
cpan> install DBD::Oracle
…………
cpan> install DBD::Pg
…………

Check module

Now we can check whether the dependencies are correctly installed by running a perl script.

 #!/usr/bin/perl
use ExtUtils::Installed;
my $inst = ExtUtils::Installed->new();
print join "\n",$inst->modules();

result should look like something:

➜  ora2pg perl test.pl                                                                                                                                                                              12:48:46
Bundle::NetSNMP
DBD::Oracle
DBI
Ora2Pg
Perl⏎

install ora2Pg

now can start to install the real thing by download and extract the ora2Pg tar.bz2 and run its makefile.pl which will eventually generate our makefile which we can make and install:

perl Makefile.PL
make && make install

After installation, the default conf file will be under etc/ora2pg/ora2pg.conf.

ora2Pg config

Now we can start config the ora2Pg by changing the default conf file or create a new conf file based on default conf. If you have multiple DB schemas to migrate or have multiple environments, it would be better to create different conf for each use case.

The important fields are:

ORACLE_DSN      dbi:Oracle:host=hostname;sid=Oracle SID

ORACLE_USER     schema_owner name

ORACLE_PWD      password

You should set as below to connect and retrieve required schema data and objects

USER_GRANTS     1

I also turned debug on to view progress

DEBUG           1

Other settings used were

EXPORT_SCHEMA   1

SCHEMA          schema_owner_name

CREATE_SCHEMA   1

I also used the same schema name in PG

PG_SCHEMA       schema_owner_name

I did have some problems with characters not being copied across as required and had to use settings below to override defaults, on some occasions

NLS_LANG       AMERICAN_AMERICA.UTF8

NLS_CHAR        LATIN1

Now you can run the dump:

ora2pg -c myOra2pg.conf -t TABLE -o tables.sql

另外一个中文文章

oracle DBMS_STATS.GATHER_TABLE_STATS

The Oracle RDBMS allows you to collect statistics of many different kinds as an aid to improving performance. DBMS_STATS package is concerned with optimizer statistics only. Given that Oracle sets automatic statistics collection of this kind on by default, DBMS_STATS package is intended for only specialized cases.

For GATHER_TABLE_STATS

Most enterprise databases, Oracle included, use a cost-based optimizer to determine the appropriate query plan for a given SQL statement. This means that the optimizer uses information about the data to determine how to execute a query rather than relying on rules (this is what the older rule-based optimizer did).

For example, imagine a table for a simple bug-tracking application

CREATE TABLE issues (

  issue_id number primary key,

  issue_text clob,

  issue_status varchar2(10)

);

CREATE INDEX idx_issue_status

    ON issues( issue_status );

If I’m a large company, I might have 1 million rows in this table. Of those, 100 have an issue_status of ACTIVE, 10,000 have an issue_status of QUEUED, and 989,900 have a status of COMPLETE. If I want to run a query against the table to find my active issues

SELECT *

  FROM issues

WHERE issue_status = ‘ACTIVE’

the optimizer has a choice. It can either use the index on issue_status and then do a single-row lookup in the table for each row in the index that matches or it can do a table scan on the issues table. Which plan is more efficient will depend on the data that is in the table. If Oracle expects the query to return a small fraction of the data in the table, using the index would be more efficient. If Oracle expects the query to return a substantial fraction of the data in the table, a table scan would be more efficient.

DBMS_STATS.GATHER_TABLE_STATS is what gathers the statistics that allow Oracle to make this determination. It tells Oracle that there are roughly 1 million rows in the table, that there are 3 distinct values for the issue_status column, and that the data is unevenly distributed. So Oracle knows to use an index for the query to find all the active issues. But it also knows that when you turn around and try to look for all the closed issues

SELECT *

  FROM issues

WHERE issue_status = ‘CLOSED’

that it will be more efficient to do a table scan.

Gathering statistics allows the query plans to change over time as the data volumes and data distributions change. When you first install the issue tracker, you’ll have very few COMPLETED issues and more ACTIVE and QUEUED issues. Over time, the number of COMPLETED issues rises much more quickly. As you get more rows in the table and the relative fraction of those rows that are in the various statuses change, the query plans will change so that, in the ideal world, you always get the most efficient plan possible.

JPA hibernate @GeneratedValue with sequence on Oracle

When using Oracle, the JPA with hibernate implementation on the @GeneratedValue with sequence would be quite tricky. I am trying to make a summary here. The version used are JPA 2.1 and Hibernate 4.3.11.Final

Default generator setting

@GeneratedValue with strategy AUTO and default allocationSize(50)

This will use the ‘org.hibernate.id.SequenceGenerator’ which will get a SEQ+1 value.

@GeneratedValue with strategy SEQUNCE and default allocationSize(50)

This will use the ‘org.hibernate.id.SequenceHiLoGenerator’ because the maxLo value in this class is 49 which is due to the default allocationSize 50,  which will return a SEQ*50 value.

@GeneratedValue with strategy AUTO and allocationSize=1

This would have the same behavior with the default allocationSize using ‘org.hibernate.id.SequenceGenerator’, which gives SEQ+1

@GeneratedValue with strategy SEQUNCE and default allocationSize=1

This would use the ‘org.hibernate.id.SequenceHiLoGenerator’, but since we set the allocationSize to 1, it would not use the HiLo optimizer but delegate the work to the ‘SequenceGenerator’, so we will get the same SEQ+1 value.

 

hibernate.id.new_generator_mappings set to true

@GeneratedValue with strategy AUTO/SEQUNCE and default allocationSize(50)

This will use ‘org.hibernate.id.enhanced.SequenceStyleGenerator’ which will subtract the current sequence with allocationSize. so we get the value: SEQ-49

@GeneratedValue with strategy AUTO/SEQUNCE and default allocationSize=1

This would also use the ‘org.hibernate.id.enhanced.SequenceStyleGenerator’ but since we assigned the allocationSize to 1, it would return the regular SEQ+1 value.

 

Summary

As we can see the inconsistency behaviors above, the best way IMO is to set the new ‘hibernate.id.new_generator_mappings’ to true to leverage the new ‘SequenceStyleGenerator’ where we have the flexibility to use regular sequence but also could bump up sequence value with larger allocationSize if we need to persist a lot of entities at one time. The only thing we need to pay attention is to not forget to have allocationSize=1 to the existing annotation.

If we set this attribute to TRUE, For the existing entities using HiLo generator, we could either 1. bump up the sequence, add allocationSize to 50 OR 2. use @GenericGenerator annotation to still use SequenceHiLoGenerator.

    @GenericGenerator(name = "RPT_QMRC_SCRT_PRDCT_6_ID_GEN", strategy = "org.hibernate.id.SequenceHiLoGenerator",
        parameters =  {@Parameter(name = "sequence", value = "SEQ_RPT_QMRC_SCRT_PRDCT_6_ID"), @Parameter(name = "max_lo", value = "50")})
    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "RPT_QMRC_SCRT_PRDCT_6_ID_GEN")
    @Column(name = "RPT_QMRC_SCRT_PRDCT_6_ID", nullable = false, insertable = true, updatable = true)
    public Long getRptQmrcScrtPrdct6Id()
    {
        return rptQmrcScrtPrdct6Id;
    }

Oracle partition table

We have an oracle table which used to use store proc to populate from other oracle table. Now since our upstream data source migrated to Hive, we need to do it thru jdbc and persist with JPA. And we hit a problem with: ‘inserted partition key does not map to any partition’.

Turns out that in the store proc, before we insert the data, the store proc would create a new partition(month+year) for the table. Googled a lot, it seems there is no good way of handling it in the java world, i guess we have to go to the native sql way to tackle this.

Some notes on what learnt during the research.

Good oracle doc about table partitioning.

command for checking partition info


--view all partitioned tables
SELECT DISTINCT TABLE_NAME FROM ALL_TAB_PARTITIONS;
--user partitioned tables
SELECT DISTINCT TABLE_NAME FROM user_tab_partitions;
--look at all partitions inside a specific table
select * from ALL_TAB_PARTITIONS where table_name = 'YOUR_PAR_TABLE _NAME';
--check partition key
select * from ALL_PART_KEY_COLUMNS;

Hive JDBC Architecture

Part 1: What is the actual contract that Hive provides us with?

  • Hive’s contract to users is defined in the HiveInterface class

That is – thrift is a communication channel that hive uses to expose its main service : which is the translation of SQL commands into hadoop / mapreduce commands.  The ultimate class invoked by the JDBC layer of hive is the HiveServer ~ or client ~ both of which implement the HiveInterface:

Hive JDBC always seemed like a hairy beast to me. Its actually not all that bad:  The URL mapping is translated via the JDBC driver to either use a thrift connection, or else, a “HiveServerHandler”, to dial up a HiveInterface.

The above figure illustrates how the HiveServerHandler implements the HiveInterface.  In fact, its pretty easy to run HiveCommands in pure java without even using JDBC !

  • Just creating an instance of a HiveServerHandler gives you direct access to lower level hive operations (although in an application you shouldn’t probably be instantiating this lower level object).

Nevertheless, if you’re an IDE junkie, you can then play around in your IDE with “lower level” Hive functionality by instantiating a “HiveServerHandler”.

  • This is a nice way to see the API calls that hive provides as a service to clients.
Playing with the HiveServerHandler implementation of HiveServer to see what underlying methods get called from the normal Hive interface in your IDE.

So, if you are curious about going deeper into the hive api and like to hack, calling the methods from this class manually is an interesting way to familiarize yourself with hive.

Part 2: Going deeper into the way the HiveInterface is invoked by tracing the path from JDBC-to-MapReduce. 

Now, lets take another route for understanding hive’s architecture:  Lets trace the path from JDBC to hadoop.

  • After all, we all know how JDBC works — the input is SQL, and the output is a ResultSet which is created by a “driver” which makes a database connection.
  • The Hive “Driver” has a runtime dependency on /bin/hadoop (in much the same way that in MySQL, the driver depends on a running MySQL instance
  • The Hive “Driver” allows you to create “HiveStatement” objects, which we know, are the backbone of any JDBC App.

So Lets start tracing the HiveDriver class, which is the JDBC Driver for hive.  If you don’t know the basics of JDBC, I’d suggest you read up on it before proceeding:

1) When you connect to JDBC via a URL, you explicitly define the driver:

“org.apache.hive.jdbc.HiveDriver” for hive2.

Note that it used to be, you used org.apache.hadoop.hive.jdbc.HiveDriver.

2) This driver, in turn, registers itself when it is the class is first loaded:

public class HiveDriver implements Driver {
static {
try {
java.sql.DriverManager.registerDriver(new HiveDriver());
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

3) Looking closer, the Driver also declares its URL_PREFIX, which is used in the acceptsURL implementation (im using hive2:// but for older hive, just “jdbc:hive://” was used).

private static final String URL_PREFIX = “jdbc:hive2://”;

public boolean acceptsURL(String url) throws SQLException {
return Pattern.matches(URL_PREFIX + “.*”, url);
}

4) Now – when we make a JDBC statement – the generic DriverManager calls “acceptsUrl” on all registered drivers, and if they match, it uses the matching driver in the runtime to run a query.   As we all know, at this point – we normally create a JDBC Connection to issue a query.  The Driver which is dynamically loaded above provides “public Connection connect(..)” as part of its interface, and returns a “HiveConnection” implementation:

  public Connection connect(String url, Properties info) throws SQLException {
return new HiveConnection(url, info);
}

5) The HiveConnection implementation  now has to figure out how to provide a hive service.  There are two scenarios: local and non local.   For non-local ~ i.e. a real hive server ~ a thrift communication channel is opened to talk to the hive server.  For local the HiveServerHandler is spun up:

if uri is empty:

     client = HiveServer.HiverServerHandler();

else (uri nonempty, and thus has a host/port):

     client = new HiveClient(TBinaryProtocol(new TSocket(host,port)))

Looking quickly into the HiveServerHandler class, the header describes why it is used when the URI is empty:

/**
* Handler which implements the Hive Interface This class can be used *in  lieu of the HiveClient class to get an embedded server.
*/
public static class HiveServerHandler extends HiveMetaStore.HMSHandler
implements HiveInterface {

6) At this point, the hive connection is ready to go.  The individual implementations of Hive SQL statements can be seen in the HiveConnection class, for example (from HiveConnection.java):

public PreparedStatement prepareStatement(String sql, int resultSetType,
int resultSetConcurrency) throws SQLException {
return new HivePreparedStatement(client, sql);
}

The tricky thing to get here, which explains it all, is that the “client” above is a “HiveInterface” object implmementation, which can either be an instance of HiveServer, OR of the HiveServerHandler.

7)  One last remaining question:  Where does HIVE end and MapReduce begin?    To understand that – we look directly into the HiveInterface implementations.  The are both in the HiveServer class.  (remember in 5 above – the HiveServerclass is either accessed via creation of a local handler, or via a thrift client service).   

The HiveServer implementation ultimately uses org.apache.hive.ql.Driver class.  This can be seen in the following stacktrace:

at org.apache.hadoop.hive.ql.Driver.run(Driver.java:945)
at org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:198)
at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:192)
at org.bigtop.bigpetstore.etl.HiveETL.main(HiveETL.java:105)

The application code creates a “HiveStatement”, which triggers an operation on the HiveServer, which feeds the command to the “org.apache.hadoop.hive.ql.Driver” implementation, which calls compile(), and then execute().  The execute() method has all the “magic” of chaining jobs together and running them:

///From org.apache.hadoop.hive.ql.Driver.java

// Loop while you either have tasks running, or tasks queued up
while (running.size() != 0 || runnable.peek() != null) {
// Launch upto maxthreads tasks
while (runnable.peek() != null && running.size() < maxthreads) {
Task<? extends Serializable> tsk = runnable.remove();
launchTask(tsk, queryId, noName, running, jobname, jobs, driverCxt);
}

So i guess Hive isnt such a black box after all ! 🙂 
 
The driver’s run implementation is what ultimately calls hadoop.

– Hive’s JDBC front end uses the URI to decide wether or not to implement the hive service on its own, or through a “client”.

– The client terminology is kinda funny to me : because the client itself is actually a “HiveInterface”, which is implemented via a thrift service (which talks to a running HiveServer) or else, via a embedded hive server.

– Both of these implmentations come are provided by the HiveServer class.

 

FROM HERE