In Java, implementing via SQL is a well-developed practice for database computation. However, the structured data is not only stored in the database, but also in the text, Excel, and XML files. Considering this, how to compute appropriately regarding the structured data from non-database files? This article raises 3 solutions for your reference: implement via Java API, convert to database computation, and adopt the common data computation layers.
Implement via Java API. This is the most straightforward method. Programmers will benefit from Java API in controlling every computational step meticulously, monitoring the computed result in each step intuitively, and debugging conveniently. Needless to say, no learning cost is also an additional advantage of Java API.
Thanks to the well-developed API for retrieving and writing-back data to Txt, Excel, and XML files, Java has enough technical strength to offer the full support for such computation, in particular the simple computational goals.
However, this method requires great workload and quite inconvenient.
For example, since the common data algorithms have not implemented in Java, programmers will have to spend great time and efforts to implement all the ins and outs manually by aggregating, filtering, grouping, and sorting and some other common actions.
For another example of data storage and detail data retrieval through Java API, programmers will have to combine every data and 2D table with List/map and other objects, and then compute in nested loops at multi-levels. Moreover, such computation usually involves the set operations and relational computations on massive data, as well as the computations between objects and object properties. It takes great efforts to implement the underlying logics and even greater workload in handling the complex ordered computation.
In order to reduce the programing workload, programmers always prefer leveraging the existing algorithms to implementing all specifics by themselves. In view of this, the second choice below would be a better choice:
Convert to database computation. This is the most conservative method. Concretely speaking, it is to import the non-database data to the database via the common ETL tools like DataStage, DTS, Informatica, and Kettle. The advantages of this practice include the high computational efficiency, steadfast running, and less workload for Java programmers. It fits for the scenarios of great data volume, high performance demand, and medium-level computational complexity. These advantages are evident for the mixed computation on the database and the non-database files in particular.
The main drawback of this method is the great workload in the early stage of ETL and the great maintenance difficulty. First, since the non-database data cannot be used directly without field-splitting, merging, and judging, programmers have to write a great many of Perl/JS scripts to clean and re-organize the data. Second, the data is usually updatable, so the scripting must handle the changing incremental update issues. The data from various data sources can hardly be compatible with a normal form. So, the data is unusable before the level 2 or even the level 3 ETL process. Third, scheduling is also a problem when there are lots of tables – which table must be uploaded first? Which one is the second to upload? What’s the interval? In facts, the huge workload of ETL is always beyond our expectation, and it is always quite tough to evade project risk. Plus, the real-time performance of ETL is poor owing to the regular transit of the database.
In some operating environments, there is probably no database service at all for the sake of security or performance. For another example, if most data is saved in the TXT/XML/Excel and no database involved, then the existence value of ETL gets void. What can we do? Let’s try the 3rd method:
The common data computational layer is typified by the esProc and R. The data computational layer is a layer in-between the data persistence layer and the application layer. This layer is responsible for computing the data from data persistence layer uniformly and returning the computed result to the application layer. The data computation layer of Java is mainly used to reduce the coupling between the application layer and the data persistence layer, and alleviate the computational pressure on them.
The common data computational layer offers the direct support for various data sources - not only the database, but also the non-database data sources. By taking the advantage, programmers can access to various data sources directly, free from such things as real-time problems. In addition, programmers are allowed to implement the interactive computation between various data sources conveniently, for example, the computations between DB2 and Oracle, and MYSQL and Excel. In the past, such access is by no means easy to implement.
The versatile data computational layers are usually more professional on structured data, for example, it supports the generic, explicit set, and ordered array. So, the complex computational goals, which are tough jobs for ETL/SQL and other conventional tools, can be solved with this layer easily.
The drawback of such method mainly lies in the performance. The common data computation layer is of the full memory computation, so the size of memory determines the upper limit of the data volumes to handle. But both esProc and R support the Hadoop directly so that their users can handle the big data in the distributed environment.
The main difference between esProc and R is that esProc supports the direct JDBC output and convenient integrating with Java codes. In addition, esProc IDE is much easier to use, with the support for the true debugging, and scripts in grid, and cell name for direct referencing the computed result. R does not provide such advantages, nor support for JDBC, and thus a bit complex for R users to integrate. However, R supports the correlation analyses and other model analyses. R programmers do not have to implement all specifics to generate the computed result directly. R also supports the Txt/ Excel / XML files and other lots of more non-database data sources. By comparison, esProc only supports 2 of them. The last but not the least advantage of R is that the low-end edition of R supports the open source to the full.
The above is the comparison between these three methods, and you can choose the right one based on your project characteristics.
August 28, 2013
August 21, 2013
Innovative Tool-Defined Solution for Data Preparation of Reports
According to research, most complex report development
work can be simplified by performing the data source computation in advance. For
example, find out the clients who bought all products in the given list, and then
present the details of these clients.
In developing such reports, it is the
“computation” part and not the “presentation” part that brings about major
difficulties. In which stage will the computation be most cost-effective? Shall
the computation be set in the data retrieval scripting or the post-retrieval report
presentation?
The report developers as usual are more willing
to compute in the report straightforwardly after retrieving data with SQL or
Wizard. On the one hand, it is because most report tools are capable of some
step-by-step simple computations by themselves, while SQL
only allows for incorporating all logics in one statement and is impossible to
be decomposed into several examinable components; on the other hand, most
report developers are more familiar with the report functions than that of SQL/SP,
and the SQL/SP scripts are more difficult to understand.
However, the report alone cannot give the
satisfactory result. Many report developers find the computational goal is hard
to achieve in the report. They will ultimately be hard-pressed to learn the SQL/SP,
or request the assistance from the database administer. Why?
The root cause is that the report is mainly
developed to present but not to compute. The computation is a non-core feature
of a report designed to solve the commonest and easiest problem. Achieving the
truly complex computational goal will still depend on the professional scripts for
computing like SQL. So, only computing the data source in advance can simplify
and streamline the developing procedure of such reports.
Stuck in a dilemma? On the one hand, the report
can only provide the limited data computing capability; on the other hand, SQL/SP
is hard to comprehend and the computational procedure is neither intuitive, nor
step-by-step. This is such a headache for most report developers.
esProc can solve the dilemma. It is a professional
development tool for report
data source, offering the expected computational capability and the user-friendly
grid style. In addition, it enables the step-by-step computation to present the
result at each step more clearly than report. Compared with SQL, esProc is
easier for report developers to learn and understand. They can use it to solve
the complex computation more easily and independently, including the
computation of the above case.
esProc scripts:
Like SQL, esProc supports the external
parameters. The report can reference the esProc directly through the JDBC
interface.
In addition, esProc is built with the
perfect debugging function, and is also capable of retrieve and operating on
the data from multiple databases, text files, and Excel sheets to implement the
cross-database computation. esProc is the good assistant to reporting tools and
the expert in report data source computation.
August 13, 2013
Developers Find Innovative Way for Data Process in Java
In Java development,
the typical data computation problems are characterized with:
esProc boasts the grid-style and agile syntax specially designed for the massive amount of structured data. In addition, esProc can directly retrieve and operate on the data from multiple databases, Text files, and Excel sheets. With the support for external parameters, native support for cross-database computations, and code reuse, esProc boosts the data computing efficiency of Java greatly.
- Long computation procedure requiring a great deal of debugging
- Data may from database, or Excel/Txt
- Data may from multiple databases, instead of just one.
- Some computation goals are complex, such as relative position computation, and set-related computation
Just suppose a sales department needs to make
statistics on the top 3 outstanding salesmen ranking by their monthly sales in
every month from Jan to the previous month, based on the order data.
Java alone is difficult to handle such
computations. Although it is powerful enough and also quite convenient in debugging,
Java has not directly implemented the common computational algorithms yet. So,
Java programmers still have to spend great time and efforts to implement the
details like aggregating, filtering, grouping, sorting, and ranking. In the
respect of data storage and access, programmers have to use List and other
objects to assemble every 2D table and every piece of data, and then arrange the
nested multi-level loops. In addition, such computation involves set and
relation operations on massive data, or relative position between object and
object properties. The underlying logics for these computations demand great
efforts, not to mention the Excel or Text data, data from set, and the complex
computational goal.
How to improve the data computational
capability for Java? How to solve this problem easily?
SQL is an option. SQL implements lots of
data computational algorithms and alleviates the workload to some extent. But,
it is far from solving the problem due to the below weak points:
First, SQL
takes a long query as a basic computation unit. Programmers are only allowed to
view the final result but not the details of running. It is awkward to prepare
the stored procedure and a great many of stage tables just to debug barely. Using
special scripting for debugging? Low cost-efficiency indeed! A lengthy SQL
statement will bring about exponential increase in the difficulty of reading or
writing, possibility of error, and maintenance cost.
Second, to address the Excel, text, or heterogeneous
data computation with SQL, programmers have to establish the data mart or
global view with ETL or Linked Server at great cost. In addition, SQL does not
support the step-by-step computations for decomposing the complex computation
goal. Its incomplete support for the set makes programmers still feel tough to
solve some complex problems.
So, we can conclude that SQL has limited
impact on improving the computational efficiency for Java.
In this case, esProc is highly recommended
– a database computation development tool ideal for simplifying the complex
computations and tailored for cross-database computation and explicit sets with
convenient debugging, and direct support for JDBC to integrate with the Java
apps easily.Still the above example, esProc scripts are as shown below:
esProc boasts the grid-style and agile syntax specially designed for the massive amount of structured data. In addition, esProc can directly retrieve and operate on the data from multiple databases, Text files, and Excel sheets. With the support for external parameters, native support for cross-database computations, and code reuse, esProc boosts the data computing efficiency of Java greatly.
August 7, 2013
A New Way to Consolidate Various Data Sources for Reporting Tool
In report development,
we may need to present the data from multiple databases in one report, such as
data from MSSQL database for CRM and Oracle database for ERP. If the reporting
tool like iReport only supports single data source, then we need to consolidate
the multiple data sources into a single data source.
The Crystal, BIRT, and other so-called
reporting tools for multiple data source can only join 2 result sets roughly,
and are also very inconvenient for the complex multi-data-source computations. For
example, compute the yearly growth rate of order value for each client in ERP,
group by the client data from CRM, and then present in a report.
It is believed that most companies adopt
the commence strategy to only provide the single-data-source edition reporting
tools, even if they are already capable of providing full support for multiple
data sources.
How can we handle such a situation?
The commonest practice is to utilize the ETL
or Data Warehouse tool - consolidate the data from various platforms to a
single database. This practice will surely require preparing ETL scripts for
regular updates according to the specific rules, building a global view or
organizing the data as a single data source with stored procedures, as well as
the contribution of DBA. All in all, this way to achieve the goal is at the
great cost of additional human resource, a great deal of time, and modification
to the database.
If the cost is not a priority concern, then
the higher edition of database administration tools could be a better choice
for implementation, such as, Server Link or Linked DB. In doing so, we will
have to purchase the database server for separate use, recruit additional
staffing, bear maintenance expenditure, and keep the safety considerations in
mind. In facts, such practice only automates a few ETL functions based on the
same core. The inconveniences in handling the complex computations for data
sources are still there unsolved.
We need a tool that is not only lightweight,
convenient, and easy-to-use, but also powerful enough to handle such situation.
esProc is such a tool that is specially
built for database
computation, expert at simplifying the complex computation, and perfect in debugging.
With the support for cross-database computation and JDBC interface, esProc can
easily integrate with reporting tools.
Still the above case, the whole procedure of data retrieving, computation, and merging can be accomplished in a few lines of esProc scripts clearly and concisely, as shown below:
Still the above case, the whole procedure of data retrieving, computation, and merging can be accomplished in a few lines of esProc scripts clearly and concisely, as shown below:
Then, the result can be queried directly
through the JDBC interface for Java report, just as easy as connecting to the
common database.
Subscribe to:
Posts (Atom)