esProc, A Script Language for Data Analytics with Parallel Mechanism: January 2013

January 31, 2013

Who Unchained the Django of Business Computing Field?

The computing in business activities involves enterprise reporting (Reporting), business data integration and cleaning (Data Integration and ETL), OLAP (Online Analysis Process), ERP (Enterprise Resource Planning), CRM (Customer Relationship Management), SCM (Supply Chain Management), and DSS (Decision Support System). Unlike the scientific computing, business computing has its own characteristics:

Focus on the structural data in the business environment

Implement computing from business prospective by business specialist

Capable to solve the complex and diverse business problems

Confront to the dynamic business demands rapidly

Both R and esProc are capable to meet the above-mentioned business computing demands. They have many in common, for example, both are optimized for the structural data. Both of them provide the all-aroundly complete interactive computation to handle the business problem. However, they differ in their respective features. R is characterized with the open-source and massive library functions; esProc features the intuitive style and its syntax is easy to learn and use. Judging from their features, we can say that R is more tailored for technical experts, while esProc is more business-specialist-oriented.

In the discussion below, we will discuss who unchained the Django of business computing field.

Environment for use
Both R and esProc is the download-and-install desktop software, not requiring any deployment specific servers, support all business databases and the direct retrieval from the Excel and TXT files.

The IDE of esProc only supports Windows; As the JDBC interface, they both support Linux, Mac OS X, BSD, Unix, and other operating systems, for example, acting as the data source to report or JAVA application. As for R, both IDE and its backend interface support the above-mentioned platforms.

esProc provides the graphical configuration interface for connection, while R implement it all by coding. From this point of view, esProc is the best choice for those business personnel with relatively weak technical background. However, with the help of the 3-party software, R can implement the graphic configuration interface and thus the difference is not great.

Structural data computing
As the main format of business data, most structural data is usually stored and managed centralizedly in the database, and some structural data is scattered in the Excel and text files. Both R and esProc provide the good support for the basic structural data computing, for example, data retrieval, sorting, query on conditions, get the unique value, and fuzzy search. For details, let's look into the below 2 examples:

Retrieve data
R: result<-sqlQuery(conn,'select * from Objects’)
esProc: $select * from Objects’

Sort
R: A1[order(A1$ CustomerID,-A1$salesValue),]
esProc: =A1.sort(CustomerID, salesValue:-1)

The only difference between them lies in structural data computing. The syntax of esProc is easier to understand that business personnel can grasp it rapidly, for example, to group and summarize:

=A1.group(DepName; DepName, ~.count(), ~.sum(Salary))

By comparison, R has a richer academic flavor, as shown in the above-mentioned similar computations:

A6<-aggregate(A1$ Salary,list(A1$ DepName),sum)

A6$count<-tapply(A1$ Salary,A1$ DepName,length)

In the associative queries, this point is quite prominent:

esProc: =join@1(A1:CustomerID:Orders, B1:CustomerID:Customers)
R: merge(A1,B1,by.x="CustomerID",by.y="CustomerID",all.x=TRUE)

Interactive computing
Interactive Computing is a procedure that analyzers can monitor the computed result of the previous step, and decide the next computing actions on the basis of the computed result. In this procedure, the results of previous steps can be reused conveniently. The interactive computing is a core functions to solve the complex problem. It can decompose one complex and obscure computational goal into several simple and clear steps. By solving several simple problems, the final goal of solving the complex computing problems can be ultimately achieved.

This advantage is particularly impressive in the business computing. For example, in the business computing, some obscure and complex computing problems like these may be encountered: "The reason for the rising customer complaints in recent days.", or "analyze the characteristics of product sales in this year", while not the below simple and clear problem, such as "How many complaints the customers made in this month?", or "Which 10 States has the highest sales volume?".The later one enable users to design a set of clear algorithms to have a concise answer, and response it all at once, so we name it as "All-at-Once"; By comparison, the former one requires users to make a assumption first before verifying the assumption to set the direction of computation for the next step, so we name it as "Step-by-Step".

R and esProc both have the outstanding interactive computational capability. To illustrate it with an example, suppose that the massive order data are to be retrieved, filtered on conditions, and summarized by group.

R in Interactive Computing
Retrieve data and filter by time interval.

The computed result of R can only be viewed on the right after named. Of all these results, fTimeData is just the computed result of this step. To view the variable contents, click the variable name, as shown in the below figure:

Or enter the command in the console to watch the fTimeData, as shown in the below figure:

Go ahead to group fTimeData by name and month. The grouped data will be named as gNameMonth. The field name in the gName does not make much sense in business. Therefore, in this step, the fields will be renamed, as shown in the below figure:

The boxes and lines in the above chart illustrate the reference relations between steps clearly. Lastly, you can click the variable names of gNameMonth on the right to view the computed result:

esProc in Interactive Computing
Retrieve data and filter it by time interval.

The computational procedures of esProc are written and rendered in a grid-like cellset, which means that the results can be viewed directly by simply clicking the cell A3:

Proceed with grouping and summarizing on A3. The computational procedures are written in A4, as shown in the below figure:

The boxes and lines in the above chart illustrate that how cell A4 make reference to the computed result of A3. esProc allows for the direct reference to the result from the previous computation with the cell name; For the complex computation at great length or the frequently-used intermediate results, esProc users can also define a name that makes senses in business.

Regarding the interactive computation, esProc is more easy-to-use and fit for the business specialist. esProc has a grid-style 2D flavor for computation. The coding layout is clear and capable for the natural alignment. The reference between cells is more intuitive, and the computed results are easier for users to view. In addition, esProc provide the Instant Computation mode in which each line of code will run automatically once input. The computed result will also display automatically; Differing in debugging to that of R which requires manual coding, esProc has the real debugging functions, for example, esProc users can set the break points directly on the cells. To code on the same purpose, esProc scripts are more concise and readable, and easier for business specialists to learn and grasp.

Syntax features
The leading role of business computing is the business specialist. The syntax of esProc is just tailored for the business specialist, and R is tailored for the technical expert of mathematicians and statisticians.

In terms of structural data, esProc takes record as a unit, that is, one record can be a piece of product information or employee information. This is a quite common format for business data. R takes column as a unit (known as vector). Its working principles are similar to have all data of one field in a database, and thus it runs much faster than the record-style data when computing. In esProc, multiple records forms a table (known as TSeq), while in R, the multiple lines will form a table (known as frame). Both can be involved in the computation in the form of tables to go through the sorting, grouping, and summarizing. However, esProc is more intuitive and easy-to-understand than that of R. For example, from a business spreadsheet, filter the records whose values are less than the values of previous records by 5000.

esProc: business.select(value [-1]< 5000)
R: subset(business,c( 0, value [- length (value)]) <5000)

esProc syntax is featured by its agility. With the flexible syntax, esProc users are only required to grasp the usage of a few functions to compose a great number of functions. R is featured by its abundant functions. A large number of functions correspond to various functions. Through memorizing a great many of function names and parameters, R users can write the code rapidly. For example, R requires users to use various functions like apply, tapply, sapply, lapply, and by to perform the grouping action in different scenarios. By comparison, esProc users can just use group function to represent various grouping conditions unifiedly.

For another example, to compute the Moving Average in recent 3 days, esProc users can just use function avg, while R users have to use filter to compute the average. From the perspective of business, it is hard to understand and only the very skilled technical expert can handle it correctly, as given below:

R: filter(data $value/3, rep(1, 3),sides = 1)
esProc: data.( ~{-1,1}.( value).avg())

esProc has especially optimized the way to solve the typical problems, such as link relative ratio comparison, year-over-year comparison, ranking, growth rate, and cumulative value. With fewer considerations on characteristics of business computing, R is a language designed especially for the technical expert. Therefore, esProc is more intuitive and understandable than R. For example, compute the year-over-year monthly comparison of sales.

R: c(0,result$value[-1]-result$value[-length(result$value)])/result$value[-length(result$value)])

esProc: result.((value-value[-1])/value[-1])

To sum up, both R and esProc are the perfect business computing software. R is capable to handle the scenarios of more diverse deployment environments. Its abundant library functions and extensive 3rd party support are ideal for the technical expert. esProc is more fit for the business specialists, considering its business data support, interactive computational capability, and syntax style.

Related Articles:

Beijing Spirit Leads Enterprises to Continuous Progress
Business Intelligence Suppliers: Are You Ready for 2013?
2012 End of the World: Is This Prediction Based on Correct Analysis?

January 23, 2013

What's an Excellent Professional Reporting Software - Part I

The report is a tool to use the table and statistical chart to present the data. It is the basic need to develop business, the basic means to manage the enterprise, and the basic advantage to enhance the competitiveness. The reporting software is one of the commonest tools for business personnel. What's such professional reporting software that is ideal for those business personnel lacking the technical experience to develop the static reports by themselves.

For example: Bill is the sales director of a pharmaceutical products company. He is in urgent need to prepare a sales report for specific products, so as to impress the clients in an important bidding event. This report needs to show the monthly sales of 3 kinds of products, growth compared with the previous month, and the comparison with the same period of previous year. However, he did not find the existing report at hand, but the short lead time did not allow for requesting assistance from the IT personnel. Therefore, this report must be handled by yourself and the sales force. Similar situations also include:

A retail business is newly equipped with a supply chain management system. Besides the report shipped with system, some temporary reports may be required additionally according to the volatile market change. The cost would be obviously quite high if specially building a technical team to handle it. This job should be left to the business personnel.

A bank improves the business workflow, so they need to prepare the corresponding report. For the data safety purpose, the business department needs to prepare the report all by itself.

A medical products company needs to prepare a report about Clinical Practice. Preparing the report involves many concepts that may easily confuse the average technical personnel. In this case, having the business personnel to prepare the report can reduce errors.

As a professional reporting tool, first it should support for rapid report building:

Take esCalc for example. It has a report design interface of grid style, easy to learn and understand, and capable of building reports rapidly.

The whole interface is shown below:

In which, the data zone is similar to the Excel spreadsheet whose grid employ a letter as the header and a number to indicate the column number. In the cell, you can reference other cells for calculation with the cell name, as shown in the below figure:

The tool bar is available to set the style elements for massive cells, such as font, size, alignment, ruling, merge & split, and the format painter to copy and paste the style, as shown in the below figure:

On the right, there is a cell property panel for users to set the cell properties. For example, input the formula once to implement the color changing in every other row, as shown in the below figure:

Right click on the cell to show the prompt menu, and perform the group, query, locating, and sorting operation, as shown in the below figure:

The grid-style design interface is simple and easy-to-use, saving the efforts of cumbersome data alignment procedure; Referencing the data with the cell name can avoid the complicated variable definition; With the multi-level grid, the data hierarchy and summarization can be implemented more easily. In the grid, reporting becomes more intuitively. In this way, the result can be presented undistortedly and consistently in the procedure of design, preview, print, and export.

Related News from Raqsoft:

Raqsoft Organizes Training to Better Serve Customers
Instant Computing of esProc Brings Flexibility to Analysts
What Makes Self-service Statistical Computing Tools So Important?

January 15, 2013

Seven Drawbacks of Traditional OLAP

In 1993, E.F. Codd, the acknowledged founder of relational databases, introduced the term Online Analytical Processing (OLAP). OLAP is intended for the non-data processing professionals like the business expert, and is expected to be intuitive, rapid, and flexible.

Intuitive user interface: clear a technical roadblock for business personnel to operate freely

Rapid analysis procedure: accommodate the fast-changing business environment to seize opportunities and make decisions

Flexible computing ability: able to confront many complex business computation

When the traditional OLAP product was just introduced, it experienced 2-digit growth rates in the global market for years before clients start to complain about that projects would easily fail, no obvious effect, and the time span is a bit too long. Since 2004, the growth of OLAP drops dramatically, owing to the seven major drawbacks of traditional OLAP tools(what’s the remedies of OLAP’s drawbacks?):

1. Pre-modeling as a must

Regarding the business data, the traditional OLAP tools do not allow for the immediate analysis without pre-modeling. These tools without a good OLAP engine cannot convert the data to a pattern in which business personnel can operate directly.

Assume that you are requested to analyze the profit for a telecommunications enterprise. Firstly, you need to draft a star or snowflakes model involving the date, region, client gender, client’s occupation, credit ratings, and other dimensions. Then, adapt the model to the practical database through repeated modifications. For example, remove the dimension of “customer’s occupation” from the model if it is not covered in the business data; replenish the ignored dimension of “hierarchy of consumption” to the analytical model; and use the ETL tool to synchronize the “sales network” from another database to this database. The finalized stable model will be filled with data regularly at scheduled times for use in analysis. Even so, you will find that the data is not up to the convention or short of historical data. Then, you will have to establish an additional data warehouse or data mart; In case no key data from the model is found, then you will return to modify the business system instead.

Modeling is a time-consuming procedure of a great many steps. Users thus have to pay the expensive cost in advance.

2. Great dependence on IT

Although business personnel is the intended user of OLAP, they will still have to work with the IT pros because the traditional OLAP tools requires a complex modeling procedure and its users have to write a great numbers of codes/scripts/SQL.

The below jobs cannot be completed without the assistance from technicians: Propose a common analytical model; Map dimensions of model to the names of fields, tables, and views of the business database; Use ETL tool or Perl scripts to migrate and integrate all required data to a same database if they are in different databases; Establish the reasonable scheduling rules, and fill data to the model by writing SQL statements or stored procedures; Deploy OLAP application to the server, and write .net/java application for user to use. Not to mention setting-up the database warehouse or data mart and modifying the business system to replenish the key data, business personnel is unable to handle without IT involvement.

The traditional OLAP tools depend heavily on the involvement of IT pros, and cost great human resources. What’s even worse, the IT pros without business expertise cannot fully understand the analysis goal. The model built by such IT pros may not be trustworthy enough. OLAP projects built through months or years of efforts often deviate away from the requirement of actual business, which gives rise to lots of project failures.

3. Poor computation capability

The data computing refers to a procedure of processing and transforming data through a series of specific steps toward a concrete goal. Data computing is a basic feature of OLAP. The traditional OLAP tools are of insufficient computational capabilities and few computational methods such as drilling, slicing, rotation, and simple column computation. This is because their architectures are old, lacking the innovation, and hard to strike a balance between user-friendliness and flexibility. Taking the below computational goal for example, it is hard to implement with the traditional OLAP tools.

How to find the workshop whose defective rate continuous to drop for 3 months in a row?

How to make statistics on students whose score on each course is above B?

In the 1st month since the new product enters the market, how many days will it takes to reach the 1/3 of the total sales in that month?

Lacking the computational capabilities impedes the flexibility of OLAP tool greatly. Analyzers are confined to a narrow and small area, incapable to analyze freely, and even have to resort to the 3rd party to perform this kind of computation. In the similar business computations, OLAP is often abandoned in an awkward situation.

4. Short of Interactive analysis ability

Data analysis is the most important feature of OLAP. Not like the data computing, data analysis is an interactive procedure requiring the perfect step-by-step computational mechanism. Toward the obscure goal, users need to observe the current data and make the reasonable assumption, and then verify/falsify the assumption to ultimately achieve the rather complicated business analysis goal. Unfortunately, the model of traditional OLAP tools is too old to provide so free a style of interactive analysis.

For example, why the sales improved greatly?

The possible assumptions include: orders flood in, sales forces beefed up in many respects, and any large orders are placed. Of which, the procedure to check the large orders is as given below:

1. List the order data, and then filter the data of recent 6 months.
2. Compute the average sales of each order.
3. Multiply the average sales by 300%, as the criterion of”large order” standard.
4. Group the data from step 2 by month, and filter the result from step 3 to get the large order.
5. View the result and lower the criterion for comparison if few orders are up to the criterion, and raise the criterion for comparison if there are too great orders after filtering.
6. Count the large orders of each month.
7. Compute the increment of large orders in each month.

Considering the above result, users may still need to keep investigating the cause of emerging large orders through computation until the clear and reliable basis for decision-making is found. The cause may be the vocational training for sales persons starts to bear fruit.

Traditional OLAP tools suffer from the limited interactive analysis abilities. They are unable to provide the above-mentioned flexible step-by-step computational capabilities, unable to solve the complex business computing, or provide the true basis for decision-making. So, quite a few clients just take OLAP tools as an expensive and apparently single-purpose reporting tool for presentation. They are trying to get more leverage, but only find the limited uses of OLAP for the time being.

5. Slow in reacting

Traditional OLAP tools require pre-molding and cooperation between people of various departments. Therefore, it is usually slow in reacting to the business analysis demands.

Assume that a toy manufacturer needs the below information before Christmas: Of the top 100 cities with the largest population, which cities require the sales effort being strengthened immediately. Then, you will find that no population data of the city in the existing model. The usage of traditional OLAP tools is as given below:

Business personnel download population data from census bureau or Wikipedia, and then use Excel and other similar tools to list the top 100 cities by population. The above step will not take more than 1 hour. But in the step followed, they will have to ask IT pros for help, because these data cannot be imported to OLAP for use directly.

Firstly, to coordinate: the business personnel raise a request to his/her superior. The superior will request the support from R&D department, and assign a project leader who will investigate the request initially, and build the project schedule and budget. The R&D department will approve and set up a development team. If cost is not a priority, then the development team can be standby all time for the business personnel.

Then, to implement: Perform the demand analysis in details and confirm it is to modify the model, schedule the task and export from Excel or database for other models to use. In addition, there are also steps of design, implementation, deployment, and verification as well as communication with database administers Web application administer, and programmers. When it comes to the final acceptance, further improvement is still required because IT pros may not be in the picture and are on the different page with business personnel. The improvement would be unavoidable unless unfortunately the Christmas was gone and they missed it.

The traditional OLAP is slow in reacting, requires great workload, and takes a bit too long time to implement the goal. Facing the fast-changing challenges, the enterprises will miss commercial opportunities, and find themselves in a disadvantage position in the intense competition.

6. Abstract model

The traditional OLAP tools convert the data of 2 dimensional from database and Excel to the multi-dimensional. To use the OLAP tools freely, business personnel must correctly understand the concepts of slicing, rotating, drilling, and other concepts as prerequisites. The abstraction of model hinders the business personnel from analyzing freely.

For example, to business personnel, the employee data, regional data, and product data are simply 3 lists or 3 sheets in Excel. Even the most ordinary business personnel know how to group, sort, and filter these data. However, things changes once these data is converted to the multi-dimensional data, business personnel have to say good bye to the data organized 2-dimensionally as flat as computer screen, and image these data as a cube. Never having the business personnel seen such data in their work or life before, they feel a bit tough to operate these abstract data.

The abstract model requires analyst to think 3-dimentionally. This gives rise to the difficulty to understand. Users are hard to convert the business language to the abstract multi-dimensional operations. Therefore, it is hard to truly implement the goal of on-line-analysis.

7. Great potential risk

Traditional OLAP tools have a huge potential risk due to the lacking of the computation and low interactive analysis ability, and the implementation relies on the cooperation with IT pros. The procedure and the cycle are a bit long.

The analyst are often forced to abandon the OLAP due to its poor computation ability that results in the failures to submit data of huge amount, and great difficulty to provide valuable references for the decision-maker. Lacking the interactive analysis ability, unable to solve the complex business analysis problem or find the true cause and solution, these drawbacks give rise to the faulty conclusion of analysis, and bring about great direct loss to the enterprise. In the analysis aimed to the business, too much work is relied on the IT pros. No wonder the procedure and cycle is lengthy, huge manpower and physical resources are consumed, while users are still unable to react effectively and efficiently to the constant changing business environment. Quite often, the ”stars” business is exceeded and ”cash cow” business is captured by others too. The analysis conclusion often deviates from the original goal of analysis, and the wrong decision may be easily made.

These potential risks can easily incur the failure of OLAP project, and bring about unrecoverable loss to the enterprise.

All in all, the traditional OLAP tools do not implement On Line Analytical Processing according to its true essence. They are just the “OLAP in its narrowest senses or the subset of OLAP”. It is neither intuitive nor fast or flexible.