js tilde IIEF

// Without superfluous operator, we need to surround the anonymous ‘scoping’ function in
// parenthesis to force it to be parsed as an expression instead of a *declaration*,
// which allows us to immediately function-call-pattern it.
;(function(){
   // ...
})()

// By inserting a superfluous operator, we can omit those parentheses,
// as the operator forces the parser to view the anonymous function as
// an expression *within* the statement, instead of as the
// statement itself, which saves us a character overall, as well as some ugliness:
;+function(){
   // ...
}()

// But, in all of the above examples, if one is depending on ASI, and
// doesn't needlessly scatter semicolons all over their code out of ignorance,
// a prepended semicolon is necessary to prevent snafus like the following:
var foo = 4
+function(){
   // ...
}()
// ... in which case, the variable `foo` would be set to a crazy
// addition / concatenation involving the (probably non-existent) *return value*
// of our anonymous ‘scoping’ function. Hence, our friend the bitflip:
var foo = 4
~function(){
   // ...
}()
// ... he solves all of our problems, by disnecessitating the prepended semicolon
// *and* the wrapping parentheses.

JVM JIT and running mode

JIT

At the beginning of Java, all code are executed in interpreted manner, I.E execute code line by line after interpretation which would result in slowness. Especially for code that are frequently executed.

So later JIT was introduced so that when when some code are frequently executed, it becomes ‘Hot Spot Code’ and would be compiled to  machine code. Typically there 2 types of hot-spot-code:

  1. A function that are called very frequently
  2. The code in loop.

JVM maintains a count as of how many time a function is executed. If this count exceeds a predefined limit JIT compiles the code into machine language which can directly be executed by the processor (unlike the normal case in which javac compile the code into bytecode and then java – the interpreter interprets this bytecode line by line converts it into machine code and executes).

Also next time this function is calculated same compiled code is executed again unlike normal interpretation in which the code is interpreted again line by line.

Server vs client mode

when we check java version, there are 3 lines, we will check what the the 3rd line.


---> java -version

java version "1.8.0_66"

Java(TM) SE Runtime Environment (build 1.8.0_66-b17)

Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

Hotspot jvm has 2 mode: client and server. we can use java -server xxx / java -client xxx to start. The former is more heavier on compilation and optimization hence with longer startup time.

int/comp/mixed

use java -Xint/comp/mixed to trigger, mixed is the default. 

java -Xint -version 

will show that it is in the interpreted mode so that code will be interpreted and executed line by line which would be quite slow specially in loop. On the other hand, -Xcomp will execute the code after compile all the code to machine code. 

中文blog-关于JVM的类型和模式

GC friendly java coding

1. Give size when initing collections

When initing Map/List etc, if we could know the size, pass into the constructor. This way JVM does not have to constantly allocate memory to new arrays when the size grows.

2. Use immutable

Immutability has a lot of benefits. One of them is typically ignored which is its effect on GC.

We cannot modify the fields/references in the immutables.

public class ObjectPair {

   private final Object first;
   private final Object second;

   public ObjectPair(Object first, Object second) {
       this.first = first;
       this.second = second;
    }

    publicObject getFirst() {
       return first;
    }

   public Object getSecond() {
       return second;
    }

}

This means immutables cannot reference objects that are created after it. This way when GC is performed in the YoungGen, it could skip the immutables in OldGen which will not have reference to the objects in the current YoungGen which means less memory page scan and result in shorter GC cycle.

3. prefer Stream to big blob

When dealing with large files, reading it to memory will generate a large object on heap and could easily result in OOM. Expose the file as Stream and do proper handling would be more efficient since typically we would have many API for processing Streams.

4. string concat in loop

Typically java compiler will do pretty good optimization for the String concats(with ‘+’) and use StringBuilder.

However when it comes to for loop, it would be a different story. The temp strings in the loop will result in a lot of new StringBuilder objects being created. So a better way is to use StringBuilder directly when it comes to concat with loop. More detail can be found in this SO answer.

hoist for var, let, const, function, function*, class

I have been playing with ES6 for a while and I noticed that while variables declared with var are hoisted as expected…

console.log(typeof name); // undefined
var name = "John";

…variables declared with let or const seem to have some problems with hoisting:

console.log(typeof name); // ReferenceError
let name = "John";

and

console.log(typeof name); // ReferenceError
const name = "John";

 

these variables cannot be accessed before they are declared. However, it’s a bit more complicated than that.

Are variables declared with let or const not hoisted? What is really going on here?

All declarations (var, let, const, function, function*, class) are hoisted in JavaScript. This means that if a name is declared in a scope, in that scope the identifier will always reference that particular variable:

x = "global";
(function() {
    x; // not "global"

    var/let/… x;
}());
{
    x; // not "global"

    let/const/… x;
}

This is true both for function and block scopes1.

The difference between var/function/function* declarations and let/const/classdeclara­tions is the initialisation.
The former are initialised with undefined or the (generator) function right when the binding is created at the top of the scope. The lexically declared variables however stay uninitialised. This means that a ReferenceError exception is thrown when you try to access it. It will only get initialised when the let/const/class statement is evaluated, everything above that is called the temporal dead zone.

x = y = "global";
(function() {
    x; // undefined
    y; // Reference error: y is not defined

    var x = "local";
    let y = "local";
}());

Notice that a let y; statement initialises the variable with undefined like let y = undefined;would have.

Is there any difference between let and const in this matter?

No, they work the same as far as hoisting is regarded. The only difference between them is that a constant must be and can only be assigned in the initialiser part of the declaration (const one = 1;, both const one; and later reassignments like one = 2 are invalid).

 

 

 

Hive Hadoop mapper size

Below is an article from MapR

Hive table contains files in HDFS, if one table or one partition has too many small files, the HiveQL performance may be impacted.
Sometimes, it may take lots of time to prepare a MapReduce job before submitting it, since Hive needs to get the metadata from each file.
This article explains how to control the file numbers of hive table after inserting data on MapRFS; or simply saying, it explains how many files will be generated for “target” table by below HiveQL:

INSERT OVERWRITE TABLE target SELECT * FROM source;

Above HiveQL may have below 2 major steps:

1. MapReduce(In this example, Map only) job to read the data from “source” table.

The number of Mappers determines the number of intermediate files, and the number of Mappers is determined by below 3 factors:

a. hive.input.format

Different input formats may start different number of Mappers in this step.
Default value in Hive 0.13 is org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.
It will combine all files together and then try to split, so that it can improve the performance if the table has too many small files.
One old default value is org.apache.hadoop.hive.ql.io.HiveInputFormat which will split each file separately. Eg, If you have 10 small files and each file only has 1 row, Hive may spawn 10 mappers to read the whole table.
This article is using the default CombineHiveInputFormat as example.

b. File split size

mapred.max.split.size and mapred.min.split.size control the “target” file split size.
(In latest Hadoop 2.x, those 2 parameters are deprecated. New ones are mapreduce.input.fileinputformat.split.maxsize and mapreduce.input.fileinputformat.split.minsize).
For example, if one Hive table have one 1GB file, and the target split size is set to 100MB, 10 mappers MAY be spawned in this step. The reason of “MAY” is because of below factor c.

c. MapR-FS chunk size

Files in MapR-FS are split into chunks (similar to Hadoop blocks) that are normally 256 MB by default. Any multiple of 65,536 bytes is a valid chunk size.
The actual split size is max(target split size, chunk size).
Take above 1GB file with 100MB “target” split size for example, if the chunk size is 200MB, then the actual split size is 200MB, 5 mappers spawned; if the chunk size is 50MB, then the actual split size is 100MB, 10 mappers spawned.

Lab time:

Imagine here we have prepared 3 hive tables with the same size — 644MB and only 1 file for each table.
The only difference is the chunk size of the 3 hive tables.
source  — chunk size=8GB.
source2 — chunk size=256MB(default in mfs).
source3 — chunk size=64k(Minimum).

# hadoop mfs -ls /user/hive/warehouse/|grep -i source
drwxr-xr-x Z U   - root root          1 2014-12-04 11:22 8589934592 /user/hive/warehouse/source
drwxr-xr-x Z U   - root root          1 2014-12-04 11:31  268435456 /user/hive/warehouse/source2
drwxr-xr-x Z U   - root root          1 2014-12-04 12:24      65536 /user/hive/warehouse/source3

Then the question is how many mappers will be spawned for below INSERT, after setting target split size to 100MB?

set mapred.max.split.size=104857600;
set mapred.min.split.size=104857600;
INSERT OVERWRITE TABLE target SELECT * FROM <source table>;

Results:
1.  Table “source”
The whole table 644MB is in 1 chunk(8G each), so only 1 mapper.
2. Table “source2”
The whole table 644MB is in 3 chunks(256MB each), so 3 mappers.
3. Table “source3”
The whole table 644MB is in more than 10000 chunks(64KB each), and target split size(100MB) is larger than each chunk size 100MB, so 7 mappers.

Thinking accordingly, if the target split size is 10GB, for all above 3 tables, only 1 mapper will be spawned in the 1st step; If the target split size is 1MB, the mappers counts are : source(1), source2(3), source3(645).
After figuring out the 1st step, let’s move to the 2nd step.

2. Small file merge MapReduce job

After the 1st MapReduce job finishes, Hive will decide if it needs to start another MapReduce job to merge the intermediate files. If small file merge is disabled, the number of target table files is the same as the number of mappers from 1st MapReduce job. Below 4 parameters determine if and how Hive does small file merge.

  • hive.merge.mapfiles — Merge small files at the end of a map-only job.
  • hive.merge.mapredfiles — Merge small files at the end of a map-reduce job.
  • hive.merge.size.per.task — Size of merged files at the end of the job.
  • hive.merge.smallfiles.avgsize — When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

By default hive.merge.smallfiles.avgsize=16000000 and hive.merge.size.per.task=256000000, so if the average file size is about 17MB, the merge job will not be triggered. Sometimes if we really want only 1 file being generated in the end, we need to increase hive.merge.smallfiles.avgsize to large enough to trigger the merge; and also you need to increase hive.merge.size.per.task to the get the needed number of files in the end.

Quiz time:

In hive 0.13 on MapR-FS with all default configurations, how many files will be generated in the end for below HiveQL? Have a guess on the file size in the end?
Reminder: chunk size for “source3” is 64KB.
1.

set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=104857600;
set hive.merge.size.per.task=209715200;
set mapred.max.split.size=68157440;
set mapred.min.split.size=68157440;
INSERT OVERWRITE TABLE target SELECT * FROM source3;

2.

set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=283115520;
set hive.merge.size.per.task=209715200;
set mapred.max.split.size=68157440;
set mapred.min.split.size=68157440;
INSERT OVERWRITE TABLE target SELECT * FROM source3;

Answers:
1. Target split size is 65MB and chunk size is only 64KB, so 1st job will spawn 10 mappers and each mapper will generate one 65MB intermediate file.
Merge job will be triggered because average file size from previous job is less than 100MB(hive.merge.smallfiles.avgsize).
For each task, to achieve file size 200MB(hive.merge.size.per.task), 4 x 65MB files will be merged into one 260MB file.
So in the end, 3 files will be generated for target table — 644MB = 260MB+260MB+124MB.

[root@n2a warehouse]# ls -altr target
total 659725
-rwxr-xr-x  1 root root 130296036 Dec  5 17:26 000002_0
-rwxr-xr-x  1 root root 272629772 Dec  5 17:26 000001_0
drwxr-xr-x  2 root root         3 Dec  5 17:26 .
-rwxr-xr-x  1 root root 272629780 Dec  5 17:26 000000_0
drwxr-xr-x 38 mapr mapr        37 Dec  5 17:26 ..

2. Target split size is 65MB and chunk size is only 64KB, so 1st job will spawn 10 mappers and each mapper will generate one 65MB intermediate file.
Merge job will be triggered because average file size from previous job is less than 270MB(hive.merge.smallfiles.avgsize).
For each task, to achieve file size 200MB(hive.merge.size.per.task), 4 x 65MB files *should* be merged into one 260MB file. However if so, the average file size is still less than 270MB(hive.merge.smallfiles.avgsize), so they are still considered as “small files”.
In this case, 5 x 65MB files are merged into one 325MB file.
So in the end, 2 files will be generated for table table — 644MB = 325MB+319MB.

[root@n1a warehouse]# ls -altr target
total 659724
-rwxr-xr-x  1 root root 334768396 Dec  8 10:46 000001_0
drwxr-xr-x  2 root root         2 Dec  8 10:46 .
-rwxr-xr-x  1 root root 340787192 Dec  8 10:46 000000_0
drwxr-xr-x 38 mapr mapr        37 Dec  8 10:46 ..

Key takeaways:

1.  MapR-FS chunk size and target split size determine the number of mappers and the number of intermediate files.
2. Small file merge job controls the final number of files for target table.
3. Too many small files for one table may introduce job performance overhead.

oracle DBMS_STATS.GATHER_TABLE_STATS

The Oracle RDBMS allows you to collect statistics of many different kinds as an aid to improving performance. DBMS_STATS package is concerned with optimizer statistics only. Given that Oracle sets automatic statistics collection of this kind on by default, DBMS_STATS package is intended for only specialized cases.

For GATHER_TABLE_STATS

Most enterprise databases, Oracle included, use a cost-based optimizer to determine the appropriate query plan for a given SQL statement. This means that the optimizer uses information about the data to determine how to execute a query rather than relying on rules (this is what the older rule-based optimizer did).

For example, imagine a table for a simple bug-tracking application

CREATE TABLE issues (

  issue_id number primary key,

  issue_text clob,

  issue_status varchar2(10)

);

CREATE INDEX idx_issue_status

    ON issues( issue_status );

If I’m a large company, I might have 1 million rows in this table. Of those, 100 have an issue_status of ACTIVE, 10,000 have an issue_status of QUEUED, and 989,900 have a status of COMPLETE. If I want to run a query against the table to find my active issues

SELECT *

  FROM issues

WHERE issue_status = ‘ACTIVE’

the optimizer has a choice. It can either use the index on issue_status and then do a single-row lookup in the table for each row in the index that matches or it can do a table scan on the issues table. Which plan is more efficient will depend on the data that is in the table. If Oracle expects the query to return a small fraction of the data in the table, using the index would be more efficient. If Oracle expects the query to return a substantial fraction of the data in the table, a table scan would be more efficient.

DBMS_STATS.GATHER_TABLE_STATS is what gathers the statistics that allow Oracle to make this determination. It tells Oracle that there are roughly 1 million rows in the table, that there are 3 distinct values for the issue_status column, and that the data is unevenly distributed. So Oracle knows to use an index for the query to find all the active issues. But it also knows that when you turn around and try to look for all the closed issues

SELECT *

  FROM issues

WHERE issue_status = ‘CLOSED’

that it will be more efficient to do a table scan.

Gathering statistics allows the query plans to change over time as the data volumes and data distributions change. When you first install the issue tracker, you’ll have very few COMPLETED issues and more ACTIVE and QUEUED issues. Over time, the number of COMPLETED issues rises much more quickly. As you get more rows in the table and the relative fraction of those rows that are in the various statuses change, the query plans will change so that, in the ideal world, you always get the most efficient plan possible.

DNS原理以及A/NS Record Cname

阮一峰 老师的一篇关于DNS的好博客,尤其喜欢里面对于分级查询以及A-Record, NS-Record, CNAME的解释, 简单明了, 所以转载了这一部分如下:

 

四、域名的层级

DNS服务器怎么会知道每个域名的IP地址呢?答案是分级查询。

请仔细看前面的例子,每个域名的尾部都多了一个点。

比如,域名math.stackexchange.com显示为math.stackexchange.com.。这不是疏忽,而是所有域名的尾部,实际上都有一个根域名。

举例来说,http://www.example.com真正的域名是http://www.example.com.root,简写为http://www.example.com.。因为,根域名.root对于所有域名都是一样的,所以平时是省略的。

根域名的下一级,叫做”顶级域名”(top-level domain,缩写为TLD),比如.com.net;再下一级叫做”次级域名”(second-level domain,缩写为SLD),比如http://www.example.com里面的.example,这一级域名是用户可以注册的;再下一级是主机名(host),比如http://www.example.com里面的www,又称为”三级域名”,这是用户在自己的域里面为服务器分配的名称,是用户可以任意分配的。

总结一下,域名的层级结构如下。


主机名.次级域名.顶级域名.根域名

# 即

host.sld.tld.root

五、根域名服务器

DNS服务器根据域名的层级,进行分级查询。

需要明确的是,每一级域名都有自己的NS记录,NS记录指向该级域名的域名服务器。这些服务器知道下一级域名的各种记录。

所谓”分级查询”,就是从根域名开始,依次查询每一级域名的NS记录,直到查到最终的IP地址,过程大致如下。

  1. 从”根域名服务器”查到”顶级域名服务器”的NS记录和A记录(IP地址)
  2. 从”顶级域名服务器”查到”次级域名服务器”的NS记录和A记录(IP地址)
  3. 从”次级域名服务器”查出”主机名”的IP地址

仔细看上面的过程,你可能发现了,没有提到DNS服务器怎么知道”根域名服务器”的IP地址。回答是”根域名服务器”的NS记录和IP地址一般是不会变化的,所以内置在DNS服务器里面。

下面是内置的根域名服务器IP地址的一个例子

上面列表中,列出了根域名(.root)的三条NS记录A.ROOT-SERVERS.NETB.ROOT-SERVERS.NETC.ROOT-SERVERS.NET,以及它们的IP地址(即A记录)198.41.0.4192.228.79.201192.33.4.12

另外,可以看到所有记录的TTL值是3600000秒,相当于1000小时。也就是说,每1000小时才查询一次根域名服务器的列表。

目前,世界上一共有十三组根域名服务器,从A.ROOT-SERVERS.NET一直到M.ROOT-SERVERS.NET

六、分级查询的实例

dig命令的+trace参数可以显示DNS的整个分级查询过程。


$ dig +trace math.stackexchange.com

上面命令的第一段列出根域名.的所有NS记录,即所有根域名服务器。

根据内置的根域名服务器IP地址,DNS服务器向所有这些IP地址发出查询请求,询问math.stackexchange.com的顶级域名服务器com.的NS记录。最先回复的根域名服务器将被缓存,以后只向这台服务器发请求。

接着是第二段。

上面结果显示.com域名的13条NS记录,同时返回的还有每一条记录对应的IP地址。

然后,DNS服务器向这些顶级域名服务器发出查询请求,询问math.stackexchange.com的次级域名stackexchange.com的NS记录。

上面结果显示stackexchange.com有四条NS记录,同时返回的还有每一条NS记录对应的IP地址。

然后,DNS服务器向上面这四台NS服务器查询math.stackexchange.com的主机名。

上面结果显示,math.stackexchange.com有4条A记录,即这四个IP地址都可以访问到网站。并且还显示,最先返回结果的NS服务器是ns-463.awsdns-57.com,IP地址为205.251.193.207

七、NS 记录的查询

dig命令可以单独查看每一级域名的NS记录。


$ dig ns com
$ dig ns stackexchange.com

+short参数可以显示简化的结果。


$ dig +short ns com
$ dig +short ns stackexchange.com

八、DNS的记录类型

域名与IP之间的对应关系,称为”记录”(record)。根据使用场景,”记录”可以分成不同的类型(type),前面已经看到了有A记录和NS记录。

常见的DNS记录类型如下。

(1) A:地址记录(Address),返回域名指向的IP地址。

(2) NS:域名服务器记录(Name Server),返回保存下一级域名信息的服务器地址。该记录只能设置为域名,不能设置为IP地址。

(3)MX:邮件记录(Mail eXchange),返回接收电子邮件的服务器地址。

(4)CNAME:规范名称记录(Canonical Name),返回另一个域名,即当前查询的域名是另一个域名的跳转,详见下文。

(5)PTR:逆向查询记录(Pointer Record),只用于从IP地址查询域名,详见下文。

一般来说,为了服务的安全可靠,至少应该有两条NS记录,而A记录和MX记录也可以有多条,这样就提供了服务的冗余性,防止出现单点失败。

CNAME记录主要用于域名的内部跳转,为服务器配置提供灵活性,用户感知不到。举例来说,facebook.github.io这个域名就是一个CNAME记录。


$ dig facebook.github.io

...

;; ANSWER SECTION:
facebook.github.io. 3370    IN  CNAME   github.map.fastly.net.
github.map.fastly.net.  600 IN  A   103.245.222.133

上面结果显示,facebook.github.io的CNAME记录指向github.map.fastly.net。也就是说,用户查询facebook.github.io的时候,实际上返回的是github.map.fastly.net的IP地址。这样的好处是,变更服务器IP地址的时候,只要修改github.map.fastly.net这个域名就可以了,用户的facebook.github.io域名不用修改。

由于CNAME记录就是一个替换,所以域名一旦设置CNAME记录以后,就不能再设置其他记录了(比如A记录和MX记录),这是为了防止产生冲突。举例来说,foo.com指向bar.com,而两个域名各有自己的MX记录,如果两者不一致,就会产生问题。由于顶级域名通常要设置MX记录,所以一般不允许用户对顶级域名设置CNAME记录。

PTR记录用于从IP地址反查域名。dig命令的-x参数用于查询PTR记录。


$ dig -x 192.30.252.153

...

;; ANSWER SECTION:
153.252.30.192.in-addr.arpa. 3600 IN    PTR pages.github.com.

上面结果显示,192.30.252.153这台服务器的域名是pages.github.com

逆向查询的一个应用,是可以防止垃圾邮件,即验证发送邮件的IP地址,是否真的有它所声称的域名。

dig命令可以查看指定的记录类型。


$ dig a github.com
$ dig ns github.com
$ dig mx github.com