Apache POI and jxl 比较

首先,这两种目前都支持2007
其次,我个人感觉JXL较为简洁,如果是简单操作,建议使用JXL.POI的功能相当强大,但同时处理问题也相当的繁琐.

自从 MS 公开了 OFFICE 的编码格式以来,很多开源的组织都提供了对 Excel 支持读写操作的插件包。Java 世界里,Apache应该算是开源世界中的领头羊。他的Jakarta Project 中的 POI Project 就提供了对OFFICE的完美支持(不过最近好像他对Word的支持项目已经停止,而且公开在网站上圈人呢 ^^ ,如果你对Word文件的编码格式非常熟悉,可以发封邮件哦!)。JXL ,Java Excel是一开放源码项目,通过它Java开发人员可以读取Excel文件的内容、创建新的Excel文件、更新已经存在的Excel文件。使用该API非Windows操作系统也可以通过纯Java应用来处理Excel数据表。因为是使用Java编写的,所以我们在Web应用中可以通过JSP、Servlet来调用API实现对Excel数据表的访问。

Jakarta 的 POI Project 与 Java Excel API 在开源世界中可以说是并驾齐驱,但是也各有优劣,poi在某些细节有些小Bug并且不支持写入图片(poi其实可以写入图片,不过没有jxl来的方便,比较麻烦),其他方面都挺不错的;而JXL提供了对图片的支持(但是仅仅支持png格式的图片),问题就是对公式支持不是很好,但还是提供了简单的公式读取支持。因此你的项目中要选用什么样的第三方插件为完全由你的应用来决定。如果你的软件是跟财务有相当的关系的话,建议采用 POI Project,如果用不到计算公式,而且很可能需要导出图片,可选择JXL 。

就这两者的区别,主要谈下JVM虚拟机内存消耗的情况.
数据量3000条数据,每条60列.JVM虚拟机内存大小64M.
使用POI:运行到2800条左右就报内存溢出.
使用JXL:3000条全部出来,并且内存还有21M的空间.
可想而知,在对内存的消耗方面差距还是挺大的.
也许是由于JXL在对资源回收利用方面做的还挺不错的.
关于两者效率方面,没有研究过,我想这个也是基于大数据量而言的,数据量小的话基本上差别不大,也不难被发觉.但是大的数据量,POI消耗的JVM内存远比JXL消耗的多.但相比提供的功能的话,JXL又相对弱了点.所以如果要实现的功能比较复杂的情况下可以考虑使用POI,但如果只想生成一些大数据量可以考虑使用JXL,或者CSV也是一个不错的选择,不过CSV并不是真正的excel.

对excel的基本操作
jxl:最基本的excel api
poi:也是基本api,读取2M文件的时候没有jxl效率高,优点是能保持Excel里原有的宏(但不能用它写新的宏)。

Tomcat Clustering Analysis

How Clustering Works

Three parts:

  1. Load Balancer
  2. Tomcat server
  3. session replication

Scalability

Scalability and clustering are not the same thing. Rather, clustering is a method of achieving scalability. Scalability has to do with the ability of a server to efficiently process multiple concurrent requests simultaneously, with the stated goal that the time it takes to process an ever increasing number of simultaneous requests should be as close to the time it took to process the initial request as possible.

Load Balancing

Load balancing is a group of technologies aimed at distributing request load across a group of servers. Load balancing is a key component of a clustering solution, as it provides several services required to achieve the other goals of clustering.

To enable scalability, a load balancing implementation attempts to route requests to the server with the least amount of current load, for faster processing. To enable high-availability, which we will define next, a load balancing implementation must keep track of the status of its various servers, so that requests are never dropped.

Many load balancing solutions also take advantage of the fact that a server is now fronting the actual request processing software to provide an additional layer of security, ignoring and dropping malicious traffic before it can even reach the application servers.

Finally, the load balancing implementation makes the whole clustering structure functional by encapsulating the cluster within a virtual container, with one point of access. This means that the client attempting to access the web application served by the cluster never needs to know whether or not a cluster is being used.

High Availability

High availability is a group of interrelated technologies and strategies with the aim of increasing the amount of time that the network is available to process requests. The most common of these techniques are failover, state replication, and load balancing.

Apache HTTPD and mod_jk/mod_proxy

The most popular server/software set-up for Tomcat clustering is to front a cluster of Tomcat servers with an Apache Web Server running either the mod_JK or mod_proxy connector module. These modules, which are also often used simply to provide basic interoperability between Apache Web Server and Tomcat, also each include built-in load balancing capabilities.

At one time, it was common practice to favor mod_jk over mod_proxy; this was because mod_jk was developed as part of the JK project, a Tomcat subproject aimed at improving connectivity between Tomcat and various web servers, and had support for AJP, an efficient protocol developed specifically for meta-data-rich communication between Apache Web Server and other types of servers.

The speed of AJP made this protocol preferable, and was a big vote on the side of mod_jk. However, when mod_proxy was refactored in Apache Web Server 2.2, it was vastly improved, and included new sub-modules offering support for AJP and load balancing features.Thus, the key differentiators between the two protocols are now the maturity of their load balancing features and the ease with which they can be configured.

As far as ease of configuration is concerned, mod_proxy is the clear winner. The module was developed alongside the Apache Web Server, and its configuration is very straightforward, only requiring a set of changes within Apache Web Server’s main configuration file, httpd.conf.

By comparison, mod_jk must be configured within httpd.conf, and then directed to an additional file called workers.properties, which defines all the available Tomcat servers as “workers”, as well as a number of “virtual workers”, processes that are responsible for the actual work of load balancing. This is often confusing, and can be a real source of frustration. On the other hand, mod_proxy, being the more mature project, offers a much finer-grained level of control over the load balancing.

In terms of sophistication, mod_jk wins hands down, and this makes it our recommended choice if you want real control over your load balancing. Although mod_proxy and mod_jk both include a web GUI, but mod_jk’s is much richer, offering a full page of information about each node, as well as a GUI tool for configuring hot load balancing properties, meaning that servers can be taken online and offline for updates one by one without interruption of service.

The load balancing algorithms used by mod_jk are also more robust than mod_proxy’s, distributing load based on the number of HTTP sessions per server and each server’s “lbfactor”, a user-defined value used to incorporate the absolute performance potential of different servers into the equation

Session Persistance

The final piece of the clustering puzzle is session persistence – making sure that the information from an individual user’s session is always available to them, even if the server currently hosting their session goes down, so that application state is maintained. There are a number of ways that session persistence can be factored into a cluster.

First of all, factoring the need to run in a clustered environment into the initial spec of an application can influence design decisions to a certain point. Non-complex state information that does not pose a security risk, such as the user’s current tab, can be preserved on the client side via hidden fields, cookies, and URI-rewriting. These methods can be used effectively for a variety of data types, but are unsuitable for complex or security-sensitive operations.

Secondly, the majority of modern load balancers, including mod_proxy and mod_jk, support a feature called “session stickiness”, which means that the load balancer remembers which cluster worker is storing the session information for each client’s request, and proxies all concurrent requests from the same client to the same worker. While this ensures that state is maintained while all servers are working properly, if a server goes down for any reason, while the load balancer will begin directing requests to the remaining active servers, state data stored on the failed server will be lost.

Thus, a method of replicating the server-side session data must be provided to ensure that the cluster will truly never lose a transaction. There are several methods of doing this, which can be combined to create the best performing solution.

The simplest method of replicating session data within a cluster is to copy the data to at least one other worker. This “buddy system” method, in combination with some kind of health check or heartbeat function, allows the load balancer to detect when a server goes offline, and begin passing requests to its appropriate buddy worker.

Ideally, the client should perceive the service as uninterrupted. However, this method can introduce overhead under high loads – the load balancer must preserve increasing amounts of session-routing information, while the Tomcat workers take on database-like load in addition to their dynamic content processing load, which can create a bottleneck.

Load balancer bottleneck can be eliminated by using a multi-cast replication model, where each node of the cluster replicates its session data to every other node. For large environments, this can mean that the overall cluster is split into several smaller clusters using the DeltaManager component. However, small cluster set-ups without significant load should not experience these problems.

Other methods of achieving session persistence are to store the session information in a shared file system or JDBC-compliant database, or to use a cloud-based object cache system such as Terracotta. All of these methods carry an additional performance cost, as they require the additional step of writing and retrieving information to and from a database. However, as the overall goal of clustering is to improve availability, performance, and failover protection, this performance hit must be balanced against the other factors.

 

FROM HERE

Apache Lucene – Index File Formats

Summary of File Extensions

The following table summarizes the names and extensions of the files in Lucene:

Name Extension Brief Description
Segments File segments.gen, segments_N Stores information about segments
Lock File write.lock The Write lock prevents multiple IndexWriters from writing to the same file.
Compound File .cfs An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles.
Fields .fnm Stores information about the fields
Field Index .fdx Contains pointers to field data
Field Data .fdt The stored fields for documents
Term Infos .tis Part of the term dictionary, stores term info
Term Info Index .tii The index into the Term Infos file
Frequencies .frq Contains the list of docs which contain each term along with frequency
Positions .prx Stores position information about where a term occurs in the index
Norms .nrm Encodes length and boost factors for docs and fields
Term Vector Index .tvx Stores offset into the document data file
Term Vector Documents .tvd Contains information about each document that has term vectors
Term Vector Fields .tvf The field level info about term vectors
Deleted Documents .del Info about what files are deleted

 

More information here

 

 

 

Why is apache used in front of tomcat

  • Performance – If you have a lot of static content, serving it with Apache will improve your performance. If most of your content is dynamic, using Tomcat or Glassfish alone will be just as fast (probably faster).
  • Scalability – You can load balance multiple instances of your application behind Apache. This will allow you to handle more volume, and increase stability in the event one of your instances goes down.
  • Security – Apache, Tomcat, and Glassfish all support SSL, but if you decide to use Apache, most likely thats where you should configure it. If you want additional protection against attacks (DoS, XSS, SQL injection, etc.) you can install the mod_security web application firewall.
  • Additional Features – Apache has a bunch of nice modules available for URL rewriting, interfacing with other programming languages, authentication, and a ton of other stuff.

    Different view is  HERE:

    • The AJP connector does not and will not support advanced IO meaning no CometWebsockets, etc.
    • If your not using AJP I have noticed there is a pretty big proxy overhead when using mod_proxy for Apache. So if your looking for low latency Apache infront would not be good.
    • Apache has a rather big foot print compared to Nginx or Lighttpd etc.

    Putting Apache infront does NOT: