September 27, 2013

A creative computing script designed to enhance computing performance


Database plays an irreplaceable role in the modern economy and is widely used in the business computing areas like Enterprise Resources Planning (ERP), Customer Relation Management (CRM), Supply Chain Management (SCM), and the Decision Support System (DSS).
Computation on the structured data in the database mainly relies on SQL (Structured Query Language). SQL is the powerful, simple-to-use, and widely-applied database computing script. However, it has some native drawbacks: non-stepwise computation, incomplete set-lization, and no object reference available. Although almost all vendors have introduced and launched some non-compatible solution, such as various stored procedure like PL-SQL®, T-SQL®. These improved alternatives cannot remedy the native SQL drawbacks.
esProc solves these drawbacks completely with more powerful computational capability, much lower technical requirement, and broader scope of application. It is a more capable database computing scripts.

I. Step-by-step Computation
Case Description
A multinational retail enterprise needs to collect statistics on the newly opened retail store, including: How many new retail stores will open in this year? Of which how many companies have the sales over 1 million dollars? Among these companies with over-1-million sales, how many companies are abased overseas?
This question is progressive. The three questions are mutually related, the next question can be regarded as the further exploring on the current question, fit for step-by-step computation.
The original data is from the database of stores table with the main fields: storeCode, storeName, openedTime, profit, and nation. Let's check the SQL solution first.

 SQL Solution
To solve such problem with SQL, you will need to write 3 SQL statements as given below.
l  SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy')  =  to_char (sysdate,'yyyy');
l  SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy')  =  to_char (sysdate,'yyyy') and profit>1000000;
l  SELECT COUNT(*) FROM stores WHERE to_char (openedTime, 'yyyy')  =  to_char (sysdate,'yyyy') and profit>1000000 and nation<>’local’;
SQL1:Get the result of question 1.
SQL2:Solve the problem 2.Because the step-by-step computation is impossible (that is, the results of previous computation cannot be utilized), you can only solve and take it as an individual problem.
SQL3: Solve the problem 3,and you are not allowed to compute in steps either.

 esProc Solution









A1 cell: Get the records requested in problem 1.
A2 cell: Step-by-step computation. Operate on the basis of cell A1, and get the record meeting the conditions of problem 2.
A3 cell: Proceed with the step-by-step computation, and get the records requested in the problem 3.
B1, B2, and B3 cell: It is still the step-by-step computation. Count the corresponding records.

Comparison
For the SQL, there are 3 associations for you to compute in steps, and explore progressively. However, because step-by-step computation is hard to implement with SQL, this problem has to be divided into 3 individual problems.
esProc is to compute in steps following the natural habit of thinking: Decompose the general objective into several simple objective; Solve every small objective step by step; and ultimately complete the final objective.
In case that you proceed with the computation on the basis of the original 3 problems, for example, seek "proportion of problem 3 taken in the problem 2", or "on" problem 3, group by country". As for esProc users, they can simply write ”=A3/A2”, and ”A3.group(nation)”. In each step, there is a brief and clear expression of highly readable, without any requirements on a strong technical background. By comparison, SQL requires redesigning the statement. The redesigned statement will undoubtedly become more and more complex and longer. Such job can only be left to those who have the advanced technical ability in SQL.
esProc can decompose the complex problem into simple computation procedure based on the descriptions from the business perceptive. This is just the advantage of the step-by-step computation. By comparison, SQL does not allow for computation by step or problem decomposition, and thus it is against the scientific methodology, and not fit for the complex computation.
II.               Complete Set-lization
Case Description
A certain advertisement agency needs to compute the clients whose annual sales values are among the top 10.
The data are from the sales table, which records the annual sales value of each client with the fields like customer, time, and amount.

 SQL solution
SELECT customer
FROM (
    SELECT customer
    FROM (
         SELECT customer,RANK() OVER(PARTITION BY time ORDER BY amount DESC) rankorder 
         FROM  sales ) 
    WHERE rankorder<=10) 
GROUP BY customer
HAVING COUNT(*)=(SELECT COUNT(DISTINCT time) FROM sales)
Such Problem requires ranking the sets of a set, that is, group by “time” and then rank by “customer” in the group. Since the popular SQL-92 syntax is still hard to represent this, the SQL-2003 standard, which is gradually supported by several vendors, will be used to solve this problem barely.

Just a tip to compute the customer intersections in the last step, the count of years equals to the count of clients.

 esProc Solution









A1: Group the original dataset by year so that A1 will become a set of sets.
B1: Get the serial number of records whose sales values are among the top 10 of each group. The rank() is used to rank in every group, and pselect() can be used to retrieve the serial number on conditions. ~ is used to represent every member in the set. B1 is the “set of set”.
A2: Retrieve the record from A1 according to the serial number stored in B2, and get the customer field of the record.
A3: Compute the intersection of sets.

Comparison
The SQL set-lization is incomplete and can only be used to represent the simple result set. Developers cannot use SQL to represent the concept of “set of set”. Only the queries of 3-level-nested-loops are available to barely perform the similar computations. In addition, SQL cannot be used to perform the intersection operation easily that developers with advanced techniques can only resort to the unreadable statements to perform the similar operations, such as “count of years equal to the count of clients”. It equals to compute the intersection of client sets.
The set is the base of massive data. esProc can achieve set-lization completely, represent the set, member, and other related generic or object reference conveniently, and perform the set operations easily, such as intersection, complement, and union.

When analyzing the set-related data, esProc can greatly reduce the computation complexity. By taking the advantage of set, esProc can solve many problems agilely and easily that are hard to solve with SQL.

III Ordered Set
Case Description
Suppose that a telecommunication equipment manufacturer needs to compute the monthly link relative ratio of sales value (i.e. the increase percent of sales value of each month compared with that of the previous month). The sales data is stored in the sales table with the main fields including salesMonth, and salesAmount.


 SQL solution
select salesAmount, salesMonth,
        (case when
prev_price !=0 then ((salesAmount)/prev_price)-1
else 0
end) compValue
from (select salesMonth, salesAmount,
lag(salesAmount,1,0) over(order by salesMonth) prev_price
from sales) t

The popular SQL-92 has not introduced the concept of serial number, which adds many difficulties to the computation. Considering this, the designer of SQL-2003 has partly remedied this drawback. For example, the window function lag() is used to retrieve the next record in this example.
In addition, in the above statement, the “case when” statement is used to avoid the error of division by zero on the first record.

 esProc Solution
sales.derive(salesAmount / salesAmount [-1]-1: compValue)

The derive() is an esProc function to insert the newly computed column to the existing data. The new column is compValue by name, and the algorithm is “(Sales value of this month/Sales value of previous month)-1”. The “[n]” is used to indicate the relative position, and so [-1] is to represent the data of the previous month.

On the other hand, for the data of the first record, the additional procedure for division by zero is not required in esProc.

Comparison
From the above example, even if using SQL-2003, the solution to such problem is lengthy and complex, while the esProc solution is simple and clear owing to its support for the ordered set.
Moreover, SQL-2003 only provides the extremely limited computation capability. For example, esProc user can simply use the ”{startPosition,endPosition}” to represent the seeking of a range, and simply use ”(-1)” to represent the seeking of the last record. Regarding the similar functionality, it will be much harder for SQL user to implement.
In the practical data analysis, a great many of complex computations are related to the order of data. SQL users are unable to handle such type of computations as easily as esProc users because SQL lacks of the concept of Being Ordered.

IV Object Reference
An insurance enterprise has the below analysis demands: to pick out the annual outstanding employees (Employee of the Year) whose Department Manager has been awarded with the President Honor. The data are distributed in two tables: department table (main fields are deptName, and manager), and employee table (main fields are empName, empHonor, and empDept).

empHonor has three types of values: null value; ”president's award”, PA for short; and ”employee of the year”, EOY for short. There are 2 groups of correspondence relations: empDept and deptName, and Manager and empName.

 SQL solution
SELECT A.* 
FROM employee A,department B,employee C 
WHERE A.empDept=B.deptName AND B.manager=C.empName AND A.empHonor=EOY AND C.empHornor=PA

SQL users can use the nested query or associated query to solve such kind of problems. In this case, we choose the association query that is both concise and clear. The association statement behind the “where” has established the one-to-many relation between deptName and empDept, and the one-to-one relation between manager and empName.

 esProc Solution
employee.select(empHonor:"EOY",empDept.manager.empHornor:"PA")
esProc solution is intuitive: select the employee of “EOY” whose Department Manager has be awarded with “PA”.

Comparison
The SQL statement to solve such kind of question is lengthy and not intuitive. In fact, the complete association query language is “inner join…on…” style. This statement is simplified in the above example. Otherwise it will be much hard to understand.

esProc users can use ”.” for object reference. Such style is intuitive and easy to understand. The complex and lengthy association statement for multiple tables can thus be converted to the simple object access, which is unachievable for SQL. When there are more and more tables, the complexity of SQL association query will rise in geometric series. By comparison, the esProc user can always access the data intuitively and easily by taking the advantage of object reference.

Regarding the multi-table associations of complex computation, esProc can handle it more intuitively and conveniently than SQL.

From the comparison of the above four examples, we can see that esProc is not only characterized with step-by-step computation, complete set-lization, sorted sets, and object reference. The analysis style is intuitive, the syntax style is agile, and the function is powerful. esProc is a tool especially designed for mass data computation, and a more powerful database computing script.

September 17, 2013

Data Source Preparation Tool Especially for Report Developers

Many report developers may have the experience in presenting the KPIs in a report for those outstanding sales man whose sales has been rising over 10% for consecutive 3 months. The procedure of finding the outstanding salesmen is actually preparing the data source.

Preparing the data source is the key and the tough part of developing a report.

There are multiple ways to prepare the data source. SQL or SP can be used to handle the normal data computation of a single database; R language for the complex data computation; ETL or data warehouse for cross-database computation, by arranging all to a same database and then compute. For the structural data from non-database files or sheets, the senior programing languages can be used to generate result sets, for example, retrieving the data from Text file with Java class. However, these methods all have some drawbacks as discussed below.
        
Let’s start the discussion with SQL/SP. First, SQL/SP alone can only work on a single database because various cumbersome workarounds are unavoidable for multi-database computation. What’s worse, second, SQL statement is hard to debug. That situation is even worse for the long SQL statement since a more complex computational goal will inevitably give rise to more steps and a longer statement. It is the real nightmare for preparing data source. Third, the inability of SQL statement to run step by step has great impact on the maintenance and re-use. The SQL statement can only run as a whole, and all computational logics must be crammed into a single statement. It is impossible to split one SQL statement into several examinable computational procedures for users to check out the result at each step intuitively. Forth, SQL lacks the explicit sets and the direct support for the ordered computation which are common in the complex computation. So, SQL/SP usually costs several-fold more time and effort than other tools do in the related computation.
        
R language is quite good at handling the complex data computing, isn’t it a better choice? No. The truth is that R language has not incorporated with a perfect IDE. It is very inconvenient for users to compose and edit the computational scripts, not to mention its poor debugging. The report developers are not the professional coder, so their productivity will suffer if working in such IDE. More importantly, R does not provide JDBC or any output interface for the direct use by reporting tools. In order to use the R in reports, users have to implement a report interface program additionally to process the data and receive the parameters. Too much trouble.

For ETL or data warehouse, first, it usually incurs a great expense on human resources, equipment, maintenance, and training. Second, report developers will have to grasp ETL scripts like PHP, Perl, VBScript, and JavaScrip, and design the massive update algorithms. Considering these troubles, 99% report developers will surely get a headache.

The real trouble for Java and other senior languages is that users will have to implement all the details by themselves: open the Excel file, build a record, generate a List, retrieve with loops, seek the maximum value, group, compute the average, filter the data, sort, and then seek the top N – the greatest flexibility seems to be obtained at the cost of the greatest workload.
        
In view of all these discussions above, that would have been good news for report developers if there is a data source computation tool specially built for the report, with all advantages of the above methods, and free from their disadvantages.

esProc is such a tool. On the one hand, it is as capable as SQL or SP regarding its professional database computational capability; on the other hand, it offers the convenient debugging and enables the step-by-step computation. Compared with R, esProc also supports the ordered computations and the explicit sets for solving the complex computation problem, while still offering a more user-friendly IDE interface and JDBC output interface to ensure the usability. esProc is as capable as ETL/data warehouse on performing the cross-database computation, but more cost-effectively owing to its low TCO and efficient deployment and usage. esProc allows for the direct data retrieval from Excel and Txt file as Java does, and more superior to Java in the respect of handling the massive structural data directly.

In conclusion, esProc is the ideal tool specially designed to prepare the data source of reports.

September 10, 2013

Data Source Computation in Advance to Simplify Report Development


According to research, most complex report development work can be simplified by performing the data source computation in advance. For example, find out the clients who bought all products in the given list, and then present the details of these clients.

In developing such reports, it is the “computation” part and not the “presentation” part that brings about major difficulties. In which stage will the computation be most cost-effective? Shall the computation be set in the data retrieval scripting or the post-retrieval report presentation?

The report developers as usual are more willing to compute in the report straightforwardly after retrieving data with SQL or Wizard. On the one hand, it is because most report tools are capable of some step-by-step simple computations by themselves, while SQL only allows for incorporating all logics in one statement and is impossible to be decomposed into several examinable components; on the other hand, most report developers are more familiar with the report functions than that of SQL/SP, and the SQL/SP scripts are more difficult to understand.
        
However, the report alone cannot give the satisfactory result. Many report developers find the computational goal is hard to achieve in the report. They will ultimately be hard-pressed to learn the SQL/SP, or request the assistance from the database administer. Why?

The root cause is that the report is mainly developed to present but not to compute. The computation is a non-core feature of a report designed to solve the commonest and easiest problem. Achieving the truly complex computational goal will still depend on the professional scripts for computing like SQL. So, only computing the data source in advance can simplify and streamline the developing procedure of such reports.
        
Stuck in a dilemma? On the one hand, the report can only provide the limited data computing capability; on the other hand, SQL/SP is hard to comprehend and the computational procedure is neither intuitive, nor step-by-step. This is such a headache for most report developers.

esProc can solve the dilemma. It is a professional development tool for report data source, offering the expected computational capability and the user-friendly grid style. In addition, it enables the step-by-step computation to present the result at each step more clearly than report. Compared with SQL, esProc is easier for report developers to learn and understand. They can use it to solve the complex computation more easily and independently, including the computation of the above case.

esProc scripts: 








Like SQL, esProc supports the external parameters. The report can reference the esProc directly through the JDBC interface.
        
In addition, esProc is built with the perfect debugging function, and is also capable of retrieve and operating on the data from multiple databases, text files, and Excel sheets to implement the cross-database computation. esProc is the good assistant to reporting tools and the expert in report data source computation.


September 3, 2013

Powerful and Easy-to-use Data Computation Tool


esProc is the development tool for database computation. esProc IDE is simple and easy-to-use, providing an higher development efficiency than SQL does. It is especially ideal for reaching a complex computational goal, or acting as a data source computation tool of reports, or a data computational layer of applications.

Characteristics: Tailored for Database Computation

1. Its basic data type supports the structured data

The TSeq is the commonest basic data type introduced in esProc  As the result set of esProc, TSeq is the set of structural records, which is same as that of SQL; However, TSeq is more superior than the result set of SQL in many respects owing to its support for the access via sequence number, generic data, and explicit set. The TSeq can be used for common computations on structural data, and more suitable for simplifying the complex database computation.

2. Its syntax structure is especially tailored for the database computation.

The syntax of esProc is agile and efficient, especially designed for the database computation. For example:

Basic filtering: A1.select(conditions)
Dichotomic filtering: A1.select@b(conditions)

Dichotomic filtering and filtering out the complementary set: A1.select@bx(conditions)

Getting the amount field for the last record: A1.m(-1).(amount)

The intersection of 2 result sets: A1^B1

The cyclical function of esProc can simplify and streamline the complex SQL, for example:

Add the computational column LRR, and compute the yearly link relative ratio of amount: A1.derive(amount/amount[-1]-1: LRR)

Compute the moving average of recent 3 days, and assign it to column ma:A1.run (ma= ~{-1,1}.( amount).avg())

3. Fascinating functions for database
In esProc IDE, users can compose SQL statement directly, and take advantages of the database structure browser, SQL wizard, and other facilitating functions. The figure is shown below:





4. Retrieve and modify the structural data directly

The structural data is mainly stored in the database, and partly from the Excel and Txt.  For these three types of data, esProc supports the direct retrieval and modification.

For example, firstly, retrieve the order information from the database and client details from Txt file. Secondly, by computing, find out the clients who have bought the products of all kinds. Lastly, write the result into the Excel file.

Characteristics: Easy-to-use Development Environment

1. Cellset of grid style

The script of esProc is stored in the grid-like cellset, which saves the effort of formatting. The scripts are presented in a clear and readable way by nature. For example, scripts will auto-aligned to the ruler. When composing the judgment statement, loop statement, and other long statements, users can be benefited by indentation in indicating the work scope of computation straightforwardly. The figure is shown below:













2. Step-by-step computation

The step-by-step refers to decomposing the computation goal into several simple steps. This is the most effective method and the most important metric on the ability to solve the complex computation. Because SQL lacks the ability to support the step-by-step computation, it is quite difficult to achieve the complex computation goal. With the grid style, esProc can achieve the step-by-step computation and result reuse easily.











As shown above, the data filtering for B5 cell has completed easily. Click B5 to view the result of this cell on the right directly. In other cells, B6 for example, the result of B5 can be referenced directly through the cell name. No need to define the variables specially. B6 has completed the grouping and summarizing on B5, and B7 can continue to work on B6. In this way, the interactive computation can be carried on constantly. Each step only requires completing one simple computation to get the solution to the complex computational goal.

3. Debug functions
esProc provides the perfect debugging functions, with support for the breakpoint, single step, and run to cursor functions, as shown in the below figure:










Many long SQL/SP statements can only be composed and comprehended as a whole. The internal working details cannot be monitored. So, their debugging functions are not practical. For example, for the grouping and the summarizing, esProc can implement them in 2 steps, while SQL cannot.

4. Instant computation mode

esProc supports the instant computation mode of auto-computing after each step of scripting. The result will auto-appear in the result column on the right side. With the instant computation mode, users can script and monitor the result at the same time. The algorithm for the next step can thus be decided on the basis of the characteristics of result. The computational procedure will be more focused, and scripting will be more smooth and natural. Neither R language, SQL, or other computational utilities have such ability.

5. JDBC output

Since esProc supports JDBC interface, other tools (reporting tools for example) or Java language can retrieve the result from esProc through JDBC interface. Judging from the code reuse and maintenance perspectives, esProc can be taken as the data computational layer of low coupling in the application. Regarding the performance, esProc can off-load the complex data computing from database, so that the pressure on database server is relieved greatly.

6. Big data computation

esProc has implemented the hadoop interface, is capable of being called by mapreduce, allows for the easy retrieval and writing to the HDFS, and supports the big data computing directly. esProc is especially fit for the complex data computation, which makes it superior than other big data computation tools.

7. Plotting arbitrarily
esProc supports the graphic plotting functions and the graphic parameter editor. Not only the common statistical charts can be generated directly, but also the underlying chart element control functions are made available to esProc users. They are thus enabled to plot the personalized charts of any kinds arbitrarily.