large file from hive to rdbms(oracle)

Recently we have a requirement of dumping a sizable file(4+G) to oracle from s3. The file itself is hive-compatiable. so instead of downloading the file and generate sql for it, we decided to transfer the content using hive jdbc and persist in via jpa/hiberante.

Hive

On the hive side, one important thing is to make sure batchsize is set in jdbc resultset.


hiveJdbcTemplate.query(sqlToExecute, rs -> {
            rs.setFetchSize(5000);
            while (rs.next()){
               ....do you handling
            }
        });

RDBMS

on the relational database side

  1. make sure index is turned off. otherwise it each insertion will trigger the b-tree index to be inserted.
  2. make sure leverage the hibernate batch-size
    hibernate.jdbc.batch_size. I set it to 50 since my table has over 200 columns.For example , if you save() 100 records and your hibernate.jdbc.batch_size is set to 50. During flushing, instead of issue the following SQL 100 times :

    insert into TableA (id , fields) values (1, 'val1');
    insert into TableA (id , fields) values (2, 'val2');
    insert into TableA (id , fields) values (3, 'val3');
    .........................
    insert into TableA (id , fields) values (100, 'val100');

    Hiberate will group them in batches of 50 , and only issue 2 SQL to the DB, like this:

    insert into TableA (id , fields) values (1, 'val1') , (2, 'val2') ,(3, 'val3') ,(4, 'val4') ,......,(50, 'val50')
    insert into TableA (id , fields) values (51, 'val51') , (52, 'val52') ,(53, 'val53') ,(54, 'val54'),...... ,(100, 'val100')  

    Please note that Hibernate would disable insert batching at the JDBC level transparently if the primary key of the inserting table isGenerationType.Identity.

  3. make sure flush()/clear() for certain size so that memory is not eaten up by the millions of objects built on the fly.
    flush will make sure query be executed and object saved(synced) to DB.
    clear will clear the persistence context so all managed entities are detached. entities that have not been flushed will not be persisted.

My main code is something like:


    public int doImport(int limit)
    {
        String sql = "SELECT * FROM erd.ERD_PRDCT_FIXED_INCM_MNCPL_HS_prc_txt";
        if (limit >= 0)
        {
            sql = sql + " LIMIT " + limit;
        }
        HiveBeanPropertyRowMapper<SrcErdFixedIncmMuniEntity> mapper = new HiveBeanPropertyRowMapper<>(SrcErdFixedIncmMuniEntity.class);
        int batchSize = 5000;
        int[] inc = {0};
        Instant start = Instant.now();
        List<SrcErdFixedIncmMuniEntity> listToPersist = new ArrayList<>(batchSize);
        hiveJdbcTemplate.query(sql, (rs) -> {
            rs.setFetchSize(batchSize);
            while (rs.next())
            {
                listToPersist.add(mapper.mapRow(rs, -1));
                inc[0]++;
                if (inc[0] % batchSize == 0)
                {
                    persistAndClear(inc, listToPersist);
                }
            }
            //left overs(last n items)
            if(!listToPersist.isEmpty())
            {
                persistAndClear(inc, listToPersist);
            }
            return null;
        });
        Instant end = Instant.now();
        System.out.println("Data Intake took: " + Duration.between(start, end));
        return inc[0];
    }

    private void persistAndClear(int[] inc, List<SrcErdFixedIncmMuniEntity> listToPersist)
    {
        listToPersist.stream().forEach(em::persist);
        em.flush();
        em.clear();
        LOGGER.info("Saved record milestone: " + inc[0]);
        listToPersist.clear();
    }

Result

not bad, ~3.5 Millions records get loaded in about an hour.

install ora2pg on mac osx

Recently we need to migrate our on prem oracle database into postgresql on aws RDS. Looks like ora2pg is a nice little tool to mitigate the problem though not perfect.

I am running OS X 10.10.5 Yosemite. It already comes with perl v5.18.2 as well as cpan for installing packages.

download/install Oracle client

The official document specifies quite vaguely about what is the oracle client libraries.  Turns out we need to download 3 packages from oracle website, for my use case we have oracle 12c so my current version is 12.1.02. The packages you need are:

  1. Oracle Basic
  2. Sqlplus
  3. sdk

Then we unzip them to the SAME directory like: /Users/liha/develop/instantclient_12_1 .

Now we need to set 2 env variables for installation:

export LD_LIBRARY_PATH=/Users/liha/develop/instantclient_12_1
export ORACLE_HOME=/Users/liha/develop/instantclient_12_1

install DBI/DBD

Now we need to install DBI/DBD stuff that is required during the ora2pg Makefile phase, especially the DBD::Oracle. The DBD:Pg is optional, only if you need to dump data/schema live rather than offline to file.

#perl –MCPAN –e shell
…………
cpan> install DBI
…………
cpan> install DBD::Oracle
…………
cpan> install DBD::Pg
…………

Check module

Now we can check whether the dependencies are correctly installed by running a perl script.

 #!/usr/bin/perl
use ExtUtils::Installed;
my $inst = ExtUtils::Installed->new();
print join "\n",$inst->modules();

result should look like something:

➜  ora2pg perl test.pl                                                                                                                                                                              12:48:46
Bundle::NetSNMP
DBD::Oracle
DBI
Ora2Pg
Perl⏎

install ora2Pg

now can start to install the real thing by download and extract the ora2Pg tar.bz2 and run its makefile.pl which will eventually generate our makefile which we can make and install:

perl Makefile.PL
make && make install

After installation, the default conf file will be under etc/ora2pg/ora2pg.conf.

ora2Pg config

Now we can start config the ora2Pg by changing the default conf file or create a new conf file based on default conf. If you have multiple DB schemas to migrate or have multiple environments, it would be better to create different conf for each use case.

The important fields are:

ORACLE_DSN      dbi:Oracle:host=hostname;sid=Oracle SID

ORACLE_USER     schema_owner name

ORACLE_PWD      password

You should set as below to connect and retrieve required schema data and objects

USER_GRANTS     1

I also turned debug on to view progress

DEBUG           1

Other settings used were

EXPORT_SCHEMA   1

SCHEMA          schema_owner_name

CREATE_SCHEMA   1

I also used the same schema name in PG

PG_SCHEMA       schema_owner_name

I did have some problems with characters not being copied across as required and had to use settings below to override defaults, on some occasions

NLS_LANG       AMERICAN_AMERICA.UTF8

NLS_CHAR        LATIN1

Now you can run the dump:

ora2pg -c myOra2pg.conf -t TABLE -o tables.sql

另外一个中文文章

JPA hibernate @GeneratedValue with sequence on Oracle

When using Oracle, the JPA with hibernate implementation on the @GeneratedValue with sequence would be quite tricky. I am trying to make a summary here. The version used are JPA 2.1 and Hibernate 4.3.11.Final

Default generator setting

@GeneratedValue with strategy AUTO and default allocationSize(50)

This will use the ‘org.hibernate.id.SequenceGenerator’ which will get a SEQ+1 value.

@GeneratedValue with strategy SEQUNCE and default allocationSize(50)

This will use the ‘org.hibernate.id.SequenceHiLoGenerator’ because the maxLo value in this class is 49 which is due to the default allocationSize 50,  which will return a SEQ*50 value.

@GeneratedValue with strategy AUTO and allocationSize=1

This would have the same behavior with the default allocationSize using ‘org.hibernate.id.SequenceGenerator’, which gives SEQ+1

@GeneratedValue with strategy SEQUNCE and default allocationSize=1

This would use the ‘org.hibernate.id.SequenceHiLoGenerator’, but since we set the allocationSize to 1, it would not use the HiLo optimizer but delegate the work to the ‘SequenceGenerator’, so we will get the same SEQ+1 value.

 

hibernate.id.new_generator_mappings set to true

@GeneratedValue with strategy AUTO/SEQUNCE and default allocationSize(50)

This will use ‘org.hibernate.id.enhanced.SequenceStyleGenerator’ which will subtract the current sequence with allocationSize. so we get the value: SEQ-49

@GeneratedValue with strategy AUTO/SEQUNCE and default allocationSize=1

This would also use the ‘org.hibernate.id.enhanced.SequenceStyleGenerator’ but since we assigned the allocationSize to 1, it would return the regular SEQ+1 value.

 

Summary

As we can see the inconsistency behaviors above, the best way IMO is to set the new ‘hibernate.id.new_generator_mappings’ to true to leverage the new ‘SequenceStyleGenerator’ where we have the flexibility to use regular sequence but also could bump up sequence value with larger allocationSize if we need to persist a lot of entities at one time. The only thing we need to pay attention is to not forget to have allocationSize=1 to the existing annotation.

If we set this attribute to TRUE, For the existing entities using HiLo generator, we could either 1. bump up the sequence, add allocationSize to 50 OR 2. use @GenericGenerator annotation to still use SequenceHiLoGenerator.

    @GenericGenerator(name = "RPT_QMRC_SCRT_PRDCT_6_ID_GEN", strategy = "org.hibernate.id.SequenceHiLoGenerator",
        parameters =  {@Parameter(name = "sequence", value = "SEQ_RPT_QMRC_SCRT_PRDCT_6_ID"), @Parameter(name = "max_lo", value = "50")})
    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "RPT_QMRC_SCRT_PRDCT_6_ID_GEN")
    @Column(name = "RPT_QMRC_SCRT_PRDCT_6_ID", nullable = false, insertable = true, updatable = true)
    public Long getRptQmrcScrtPrdct6Id()
    {
        return rptQmrcScrtPrdct6Id;
    }

Oracle partition table

We have an oracle table which used to use store proc to populate from other oracle table. Now since our upstream data source migrated to Hive, we need to do it thru jdbc and persist with JPA. And we hit a problem with: ‘inserted partition key does not map to any partition’.

Turns out that in the store proc, before we insert the data, the store proc would create a new partition(month+year) for the table. Googled a lot, it seems there is no good way of handling it in the java world, i guess we have to go to the native sql way to tackle this.

Some notes on what learnt during the research.

Good oracle doc about table partitioning.

command for checking partition info


--view all partitioned tables
SELECT DISTINCT TABLE_NAME FROM ALL_TAB_PARTITIONS;
--user partitioned tables
SELECT DISTINCT TABLE_NAME FROM user_tab_partitions;
--look at all partitions inside a specific table
select * from ALL_TAB_PARTITIONS where table_name = 'YOUR_PAR_TABLE _NAME';
--check partition key
select * from ALL_PART_KEY_COLUMNS;

Understanding Oracle Indexes

B-tree Index

Standard Oracle index is B-tree based.

A key is a column or expression on which you can build an index. Follow these guidelines for choosing keys to index:

  • Consider indexing keys that are used frequently in WHERE clauses.
  • Consider indexing keys that are used frequently to join tables in SQL statements. For more information on optimizing joins, see the section “Using Hash Clusters”.
  • Index keys that have high selectivity. The selectivity of an index is the percentage of rows in a table having the same value for the indexed key. An index’s selectivity is optimal if few rows have the same value.

    Note:

    Oracle automatically creates indexes, or uses existing indexes, on the keys and expressions of unique and primary keys that you define with integrity constraints.


    Indexing low selectivity columns can be helpful if the data distribution is skewed so that one or two values occur much less often than other values.

  • Do not use standard B-tree indexes on keys or expressions with few distinct values. Such keys or expressions usually have poor selectivity and therefore do not optimize performance unless the frequently selected key values appear less frequently than the other key values. You can use bitmap indexes effectively in such cases, unless a high concurrency OLTP application is involved where the index is modified frequently.
  • Do not index columns that are modified frequently. UPDATE statements that modify indexed columns and INSERT and DELETE statements that modify indexed tables take longer than if there were no index. Such SQL statements must modify data in indexes as well as data in tables. They also generate additional undo and redo.
  • Do not index keys that appear only in WHERE clauses with functions or operators. A WHERE clause that uses a function, other than MIN or MAX, or an operator with an indexed key does not make available the access path that uses the index except with function-based indexes.
  • Consider indexing foreign keys of referential integrity constraints in cases in which a large number of concurrent INSERT, UPDATE, and DELETE statements access the parent and child tables. Such an index allows UPDATEs and DELETEs on the parent table without share locking the child table.
  • When choosing to index a key, consider whether the performance gain for queries is worth the performance loss for INSERTs, UPDATEs, and DELETEs and the use of the space required to store the index. You might want to experiment by comparing the processing times of the SQL statements with and without indexes. You can measure processing time with the SQL trace facility.

Bitmap Indexes

1. 案例

  有张表名为table的表,由三列组成,分别是姓名、性别和婚姻状况,其中性别只有男和女两项,婚姻状况由已婚、未婚、离婚这三项,该表共有100w个记录。现在有这样的查询:     select * from table where Gender=‘男’ and Marital=“未婚”;

姓名(Name)

性别(Gender)

婚姻状况(Marital)

张三

已婚

李四

已婚

王五

未婚

赵六

离婚

孙七

未婚

1)不使用索引

  不使用索引时,数据库只能一行行扫描所有记录,然后判断该记录是否满足查询条件。

2)B树索引

  对于性别,可取值的范围只有’男’,’女’,并且男和女可能各站该表的50%的数据,这时添加B树索引还是需要取出一半的数据, 因此完全没有必要。相反,如果某个字段的取值范围很广,几乎没有重复,比如身份证号,此时使用B树索引较为合适。事实上,当取出的行数据占用表中大部分的数据时,即使添加了B树索引,数据库如oracle、mysql也不会使用B树索引,很有可能还是一行行全部扫描。

2. 位图索引出马

如果用户查询的列的基数非常的小, 即只有的几个固定值,如性别、婚姻状况、行政区等等。要为这些基数值比较小的列建索引,就需要建立位图索引。

对于性别这个列,位图索引形成两个向量,男向量为10100…,向量的每一位表示该行是否是男,如果是则位1,否为0,同理,女向量位01011。

RowId

1

2

3

4

5

1

0

1

0

0

0

1

0

1

1

  对于婚姻状况这一列,位图索引生成三个向量,已婚为11000…,未婚为00100…,离婚为00010…。

RowId

1

2

3

4

5

已婚

1

1

0

0

0

未婚

0

0

1

0

1

离婚

0

0

0

1

0

   当我们使用查询语句“select * from table where Gender=‘男’ and Marital=“未婚”;”的时候 首先取出男向量10100…,然后取出未婚向量00100…,将两个向量做and操作,这时生成新向量00100…,可以发现第三位为1,表示该表的第三行数据就是我们需要查询的结果。

RowId

1

2

3

4

5

1

0

1

0

0

and

未婚

0

0

1

0

1

结果

0

0

1

0

0

3.位图索引的适用条件

  上面讲了,位图索引适合只有几个固定值的列,如性别、婚姻状况、行政区等等,而身份证号这种类型不适合用位图索引。

  此外,位图索引适合静态数据,而不适合索引频繁更新的列。举个例子,有这样一个字段busy,记录各个机器的繁忙与否,当机器忙碌时,busy为1,当机器不忙碌时,busy为0。

  这个时候有人会说使用位图索引,因为busy只有两个值。好,我们使用位图索引索引busy字段!假设用户A使用update更新某个机器的busy值,比如update table set table.busy=1 where rowid=100;,但还没有commit,而用户B也使用update更新另一个机器的busy值,update table set table.busy=1 where rowid=12; 这个时候用户B怎么也更新不了,需要等待用户A commit。

  原因:用户A更新了某个机器的busy值为1,会导致所有busy为1的机器的位图向量发生改变,因此数据库会将busy=1的所有行锁定,只有commit之后才解锁。

Cluster Index

Oracle will create an index to police an unique constraint where no pre-existing index is suitable. Without the index, Oracle would need to serialize operations (such as a table lock) whenever someone tries to insert or delete a row (or update the PK).

Contrarily to MS-SQL Server, this index is not clustered on heap tables (default table organization), i.e. this index won’t change the underlying table structure and natural order. The rows won’t be reordered when Oracle creates the index. The index will be a B-tree index and will exist as a separate entity where each entry points to a row in the main table.

Oracle doesn’t have clustered index as MS SQL, however indexed-organized tables share some properties with cluster-indexed tables. The PK is an integral part of such tables and has to be specified during creation.

(Oracle also has table clusters, but they are a completely different concept).

Reference:

1. Understanding Indexes and Clusters

2. Indexes and Index-Organized Tables  

FORWARD SLASH IN SQL*PLUS

Note to the visitor: Thank you for stopping by. I see a lot of people trying to find out about forward slash in SQL*Plus. I want to clarify: This post only applies to Oracle SQL. If you are querying about a “/” usage in any other database, you are in the wrong place.

Forward Slash (“/”) in Oracle SQL

This post is about using forward slash (“/”) in Oracle SQL and PL/SQL. This gets a lot of developers (new and experienced alike) as it’s not always needed and sometimes it shouldn’t be used.

Have you ever tried typing “/” (without quotes), as soon as you logged into SQL*Plus? Try it. You will see something like below.

SQL> /
SP2-0103: Nothing in SQL buffer to run.
SQL>

The above snippet explains it all pretty well. Forward slash (“/” ) is a shortcut for RUNcommand in SQL*Plus (like “go” in SqlServer or mysql). Since there is nothing to run yet, SQL*Plus returned an error message (prefixed with SP2-).

The message in the above example also shows there is a SQL buffer in SQL*Plus. Any SQL sent to DB ends up in this (one statement) buffer. Also, if there is a statement in the buffer, “/” would have simply re-executed it.

When you type a line and press ENTER, SQL*Plus checks if it’s one of it’s own commands, a SQL statement or just bad text. If it’s a SQL (identified by keywords like SELECT etc), it starts storing it in the buffer until it sees a “/” or a semi-colon, at which time it executes it (actually sends it to the server to execute it).

Gotcha with Forward Slash (/)

SQL Statements typically end in Semi-colon. Semi-colon is the Statement Terminator in ANSI standard. But, in Oracle SQL, you can use “/” to run a SQL, instead of semicolon, like shown. (This is how SPANISH record got inserted in my attached example at the bottom of this post).

SQL> SELECT dummy FROM dual
 2 /
Dummy
-----
X

But, here is the gotcha: If the statement itself contained a semi-colon, then “/” would have re-executed it!!!!

SQL> select dummy from dual;
Dummy
-----
X
SQL> /
DUMMY
----------
X

Oops. Imagine this was an INSERT statement instead!!! (And this is what happened in my example script attached. If you notice the “/” before commit, this is what caused FRENCH to be inserted twice!!).

So be careful with your usage of “/”. For plain SQL, it’s a matter of choice. Typically, this is used in DDL statements and semi-colons in DML statements. Be consistent to avoid surprises.

Remember “/” is just a RUN (or push) command in the tool, not part of Oracle SQL itself. If you run my attached example script in SQLTools, you will get an error on “/”.

Always, run your scripts in SQL*Plus to make sure it will run fine in production, as this is a tool of choice for several DBAs.

Forward Slash (“/”) in PL/SQL

PL/SQL on the other hand is a group of SQL statements with embedded semicolons. In this case, a semicolon alone cannot be used to send to DB. It needs a push with “/”! So, in this case, it acts as a Statement Terminator as well.

SQL> list
 1 DECLARE
 2    x NUMBER;
 3 BEGIN
 4    SELECT 10 as num INTO x FROM dual;
 5    Dbms_Output.put_line('x = ' || x);
 6
 7 END;
 8
 9
 10
 11
 12
 13*
SQL>

In the above example, Notice lines  7 – 13 are blank. This is because I pressed ENTER several times. Since SQL*Plus knows this a PL/SQL (declare, begin…end), it waits for me to signal the real end, which is….. a “/”. Until I typed “/” on line 13, it didn’t come out of the PL/SQL editing mode. “/” also pushed the PL/SQL block to server to execute it.

Even though this has a bunch of lines, this whole text is a single PL/SQL statement. Type list at the SQL*Plus Prompt. You will see the “single” statement listed in full (minus the “/”).

SQL*Plus commands

SQL*Plus commands like SET, SHOW, SPOOL etc are typically one liners and should not be ended with semicolon (and don’t need “/” either).

SQL> show serveroutput;
serveroutput OFF
SQL>

These commands will be executed locally in the tool and won’t be stored in SQL buffer. So, a “/” after a SQL*Plus command will not re-execute it. In fact, it will re-execute the previous SQL Statement in the buffer!!

(Note: A “/” must be the first non-whitespace character (space, tab etc) on a line by itself. It cannot be at the end of a line, like semi-colon can be).

I’ve also posted a more detailed description of SQL*Plus here.

Notes

  1. There is a subtle difference: RUN actually lists before it runs; / doesn’t!
  2. This is a local message from SQL*Plus. If this was from Oracle database server, you would see a “ORA” prefix.

Example Script

-- Example data for SQL*Plus related blogs
DROP TABLE greetings;
CREATE TABLE greetings (id NUMBER, lang VARCHAR2(10), msg VARCHAR2(30));
INSERT INTO greetings VALUES (1, 'ENGLISH', 'Hello World');
INSERT INTO greetings VALUES (2, 'FRENCH', 'Bonjour le monde');
/
INSERT INTO greetings VALUES (3, 'GERMAN', 'Hallo Welt');
list
/
INSERT INTO greetings VALUES (4, 'SPANISH', 'Hola Mundo')
/
COMMIT;
-- Here are some extra notes about this script:
-- Just Hello world in some languages. Translations are approx. from Google.
-- Even though we are submitting this script to SQL*Plus entirely,
--   it executes one SQL Statement at a time. List command shows this.

-- Forward Slash after FRENCH, reexecutes previous command. so you will see
-- 2 FRENCH records inserted.
-- Forward Slash after GERMAN was, by mistake, thought to run list, but instead it
-- it ran the previous SQL statement, thus we have 2 GERMANs!!
-- Forward Slash after SPANISH inserts only once - because it's missing semi-colon (;)
-- and / substitutes for it!!
-- Commit is required in Oracle SQL. By default, Autocommit is not turned on in SQL*Plus
-- To see/set default use SHOW/SET AUTOCOMMIT

 

FROM HERE

create table / sequence if not exist.

it’s quite uncommon to check first and then create the table. If you want to have a running code according to your method, this will be:

BEGIN
  DECLARE
    nCount NUMBER;
    v_sql LONG;
  BEGIN -- try to execute sql for UFI_RQST table
    SELECT count(*) into nCount FROM dba_tables where table_name = 'UFI_RQST';
    IF(nCount <= 0)
    THEN
      v_sql:='create table UFI_RQST (UFI_RQST_ID NUMBER(5) PRIMARY KEY, RQST CLOB, RSPNS VARCHAR2(15), CRTD_DT DATE DEFAULT (sysdate))';
      EXECUTE IMMEDIATE v_sql;
      EXECUTE IMMEDIATE 'GRANT SELECT ON UFI_RQST_SEQ TO SNAPADMIN';
      EXECUTE IMMEDIATE 'GRANT SELECT, INSERT, UPDATE, DELETE ON UFI_RQST TO SNAPADMIN';

    END IF;
  END;

  DECLARE v_dummy NUMBER;
  BEGIN -- try to find sequence in data dictionary
    SELECT
      1
    INTO v_dummy
    FROM user_sequences
    WHERE sequence_name = 'UFI_RQST_SEQ';

    -- if sequence found, do nothing
    EXCEPTION WHEN no_data_found THEN -- sequence not found, create it
    EXECUTE IMMEDIATE 'CREATE SEQUENCE UFI_RQST_SEQ
                       START WITH     1
                       INCREMENT BY   1
                       NOCACHE
                       NOCYCLE';
  END;
END;

But I’d rather go catch on the Exception, saves you some unnecessary lines of code:

declare
v_sql LONG;
begin

v_sql:='create table EMPLOYEE
  (
  ID NUMBER(3),
  NAME VARCHAR2(30) NOT NULL
  )';
execute immediate v_sql;

EXCEPTION
    WHEN OTHERS THEN
      IF SQLCODE = -955 THEN
        NULL; -- suppresses ORA-00955 exception
      ELSE
         RAISE;
      END IF;
END; 
/