July 30, 2014

Loop Code for esProc

With esProc, the code for loop is mainly implemented with for statement. The for statement will repeat the code block with for as the main cell. There are different formats of the for statements, as listed below:

1.The for loop 
Unconditional loop. The values of the main cell are, in turn, the count for the current loop, which often needs to be exited by break command.
Unconditional loop or endless loop is the simplest structure, for example:


In the above example, A2 executes unconditional loop. According to the loop number in A2, B2 retrieves one employee record every time. The rest of codes save the employee information for Texas in B1. After the information for first 10 Texas employees were retrieved, the loop will jump out with break statement in C6. Alternatively, break C command can also be used to replace break statement to jump out from the loop body with cell C as the main cell.
After execution, B1 contains the following results:
 
Note that when using the unconditional loop, we need to make sure that the break statement can be executed properly, otherwise theprogram will never end at the time of execution.

2.for A
Loop the member of the sequence A, the main cell value will in turn be the current member of A.

In esProc, the most commonly used loop is to loop on every member in a sequence. For example:
 
In this example, all Texas employees are selected in A2. The loop is executed in A3 for each Texas employee to compute their age, and then store the oldest age in B1. When the loop is complete, we can get the maximum age for all Texas employees from B1.
For relatively simple loop statement, sometimes we can use the loop function for sequence to achieve the same result. For example:
 

3.for n
Loop n times;with the main cell value as the current loop count in turn.

In addition to loop on all members in a sequence, loop for specific times is also very common.
For example, suppose we have a piece of paper which is 1000mm * 800mm in size and 0.1mm in thickness. If we fold it ten times, what will be the final length, width and thickness?
 
In the above example, the folding is looped ten times. After each folding the length will be reduced by half, and the thickness will be doubled. If after the folding the width of the paper is greater than the length, the length and width will be switched. After execution, A1, B1 and C1 contains the length, width and thickness of the paper as following:
   
The for n statement can also be seen as a simple form of for to (1, n) . This means "loop from 1 to n".

4.for a, b, s
Loop from a to b, at the step of s. This means "to loop on each member of array to(a, b).step (s)". If s is absent the default value will be 1.
Sometimes the loop does not start from 1, and the step value is not 1. We can then use for a, b, s loop. For example:
 
For any n digits number, if the total ofthe n-th power of each digit equals its own, the number can be called daffodils number. The following example is used to identify all three digits daffodils number:
 
In the loop codesin A2, the loop is excecuted on all 3 digits numbers. After the execution of program, we can see the result in A1:
 

5.for x
Loop when x is true. The value in main cell is the computed value for x.

By specifying the loop condition, we can control when to end the loop. For example, the issue in section 1 can also be solved with for statementwith specific loop conditions:


In the main cell of for loop, the cell value is either true or false, which is the result we got after computing the loop condition every time. At this point we can not use the value from main cell as the count of loop. Therefore to retrieve employee records in B2, we use #C to get the current number of loops. In #C, C is the main cell of loop. After the code is executed, the result is the same as in section 1.

6.for cs,n:x
Loop cursor, and retrieve n records each time, or retrieve records until x has changed. Return table sequence. Close after the reading.
When retrieve the data with cursor, we can use for statement to loop the data in cursor.


7.Nested loop
In the body of the loop, we can also use loop statement. This is what we call nested loops, or multi-layer loops. For example:
 
In the example, A1 stores the prime number sequencewe found; A2 loops the integer from 2 to 1000. The prime number sequence is looped in B2. If the number in A2 is divisible by certain prime number, it will be replaced by the next number in A2 with the command of next A2. If the number in A2 is not divisible by all integer number in current prime number list, we then find a new prime number. It will be added to the sequence in A1. Among them, the next C command in D2 specifies to skip the rest part of the loop body and enter the next round of loop. After execution, the sequence for prime numbers within 1000 are as following in A1:


When using the next statement, we don't have to specify the main cell C, to skip the deepest level of the loop where the next statement is located.
As with the following one hundred chicken puzzle: each rooster is worth of 5 dollars, and each hen 3 dollars. Three chicks is worth of one dollar. If we can buy 100 checken with 100 dollars, how many roosters, hens and chicks do we buy actually?We can resolve the issue in the following way:
 
In the codes, A2 loops on the total number of rooster. We need to judge in B2: if the total price of the rooster is more than 100, there must be too many roosters. We can use break command in C2 to end the loop. In B3 we continue to loop with the possible number of hens. If the current total price of chicken is more than 100 in D4,we know that there are too many hens. We can use next A2 to add the number of roosters and try again. increasing the total number of cocks and try again. In D5, the current total price of chicken is still less than 100. We can continue to increase the total number of hens and try. The next statement here can skip from the deepest level of loop, which has B3 as the main cell.

July 29, 2014

Debugging Function Comparison RLanguage v.s. esProc

As is well known, in the development process of program, the time consumed to remove and correct the error is usually greater than time spent in coding. Therefore, a friendly debug environment can save a lot of time. In this respect, VB.NET and SQL are two extremes that the former provides almost a perfect Debug environment, while the latter nearly provides no error debugging tool.
R language and esProc as two development tools for computation and analysis are both capable to debug to some degree. We will study on their differences in this respect.

Let’s kick off by making ourselves familiar with the debugging environments of both R (take R Studio for example) and esProc from their respective interfaces:

R Studio Debugging Environment:
 
esProc's Debugging Environment

 

Let's compare the basic functions. 
Break point: For R, the break point is set by inserting browser () into the codes. Users have to remove these statements manually once debugged, which seems to be  back to the cherished old times of using BASIC to code when Windows was not invented, impressed us with a strong feeling of reminiscence. In those days, removing the stop break point statement is even an important job before releasing codes. By comparison, the break point style of esProc is similar to that of VB.NET and other alike modern programming languages. By clicking the button or pressing shortcut keys, the break point can be set to the cell in which the mouse cursor is located. This is nothing special.

Debug command: with the same style of break point, debug command of R is input from the console, including to resume running,n to go run the next statement, and Q to exit the debug mode. In addition, there are also functions like trace/set Break point/debug/undebug/stop.It is important to note that it would be best not to have any variable named after c, n, and Q in the codes. Otherwise, accidental conflicts will occur.

Regarding the procedure control, esProc is no different than VB.net and like programming languages for just requires click(s) on button or shortcut keys to implement, not requiring users to memorize any command, as we all know. 

Variable watch: The variable watch window of R language is on the right, in which all current variables will be listed. On clicking it, a new window will prompt to display the value of this variable. Alternatively, R language users can also enter the fix(variable name) at the command line window as shown below. In the right bottom corner of esProc user interface, there is a similar variable list. Seldom do esProc users use this list because esProc does not require users to specially define the variable name. The name of cell is taken as the variable name by default, and thus users can simply click the cell to review the variable value. 

One thing to notice is that R is friendly to display the variables of Frame format. However, it is comparatively not so friendly to support the irregularly-structured variables that we can say it is unreadable at all, as the below typical List for example:
 

esProc does a much better job in this respect. For the same data, in esProc, it is represented by drilling through the hyperlinks:
 

Then, let’s compare some more advanced functions, and start from checking the Immediate Running first.

As for esProc, a cell will be calculated immediately and automatically once codes are entered into this cell. Therefore, the developers can view the execution result immediately and adjust the code for re-run on conditions. This style can speed up the development speed and lower the probability of errors, allowing the green hand to become familiar with it quickly. R Studio provides the similar means that more resembles the "immediate window" of VB, that is,user's type in codes at a command line window and run immediately. If it is run correctly, then copy the codes to the formal code section. Judging on the whole, R is less convenient than esProc in this respect.

Finally, let us discuss the function to debug the functions separately.

R users can use the debug(Function Name) to debug the functions separately and directly so as to modularize in development and implement the large-scale test. esProc users,on the contrary, are not allowed to debug the function separately, which is a pity more or less. However, the debug function of R language has not implemented the true “separate” test. Its working principle is actually to add a browser () command prior to the function to be debugged, still requiring running all codes before entering the function to debug. 

From another perspective, such computational analysis software is rarely used for the large-scale development and test. There is not much significance and value for its ability to debug function separately. 

Through the above comparison, we can see that both R and esProc provide some debugging functions. In which,esProc is performing better in terms of convenience and usability.

Set Operators in esProc

Since sets are commonly used in esProc, the latter provides comprehensive set operations.

1. Binary Operation on Sets

The most basic set in esProc is sequence. Let’s look at some basic binary operations between two sequences A and B.

  • A|B

Concatenate: Concatenate the two sequence straightforwardly, of which the member of B is added behind the members of A. If A or B or both A and B are the single-value member instead of sequence, then it will be handled as a single member sequence.

  • A&B

Union: Join the two sets and remove the members of B that already exist in A. If one of A and B or both vales are the single-value member instead of sequence, then you can take it as single member sequence to process.

  • A^B

Intersection: Intersection set of A and B. Get a sequence composed of members that not only exist in A but also exist in B.

  • A\B

Complement: Members in A but not in B. If B is not a sequence, then treat it as the single member sequence.

  • k*A

A|A|…|A, copy A for k times, positions of k and A are interchangeable in the expression.



These basic operations - concatenate, complement, intersection and union - can also be used to deal with more complex set operations. Such as, to seek the "inverse intersection" of sets A and B, i.e., the set composed of all that is not the common members of A and B.
 

The "inverse intersection" of two sets is the complement of their union and intersection. With this kind of transformation, more complex binary operations will be completed.

2. Alignment Arithmetic Operation on Sets

Two sequences which have the same length will perform the alignment computation on members, and return a new sequence.

  • A++B

[A(1)+B(1),A(2)+B(2),…]

  • A--B

[A(1)-B(1),A(2)-B(2),…]

  • A**B

[A(1)*B(1),A(2)*B(2),…]

  • A//B

[A(1)/B(1),A(2)/B(2),…]

  • A%%B

[A(1)%B(1),A(2)%B(2),…], the % here is the Mod computation.

  • A\\B

[A(1)\B(1),A(2)\B(2),…], the slash here means integer division


3. Comparison of Sets

In esProc, the function cmp(A,B) can be used to compare the sequence A and B.

  • cmp(A,B)

Compare the member values of two sequences at the same location one by one, and return the 1 or -1 when encountering the first unequal member, if A is identical to B, then return 0. Specifically, cmp(A) or cmp(A,0) will compare A with the sequence with the same length and members are all 0, i.e. cmp(A,[0,0,…,0]).


Comparison of two sequences can be briefly expressed as A==B, A>B.

Note that a sequence in esProc is an ordered set, so order plays an important role in comparing the size of two sequences A and B. A.eq(B) is used to see if the two sequences have common members.



Because the member order in A1 and A2 is different, result in A3 is false, showing the two sequences are not equal. While result in A4 is true, showing the two sequences have the same members.

July 28, 2014

Implement Basic Functions of SQL through R Language & esProc

As we know, SQL users have to finish composing all codes and then run them all at one time, resulting in inconvenient interactive computing. However, the simple and easy-to-understand query syntax of SQL is always welcomed by programmers. R language and esProc as the computing and analysis tool are surely required to offer the similar query syntax. In this essay, we will discuss how they implement the basic functions of SQL through R language and esProc through some examples.

The example data is from 2 tables of the classical Northwind database:
Orders table with the main fields: OrderID, EmployeeID, OrderDate, Freight, CustomerID

Customer table with the main fields: CustomerID, CompanyName

Retrieve data of the entire table
SQL: select * from Orders

R: A1<-sqlQuery(conn,'select * from Orders')

esProc: $select * from Orders

Comments: The word count of esProc and SQL codes only differs by one word. The R language has a strong flavor of programming.

Where: Search for the order which has a freightage higher than 100 and is placed before the year of 1998
SQL: select * from Orders where Freight > 100 and OrderDate < '1998-01-01'

R : subset(A1,Freight > 100 & OrderDate < as.POSIXlt('1998-01-01'))

esProc: =A1.select(Freight > 100 && OrderDate < date("1998-01-01"))

Comments: Both SQL and R solutions are close to SQL to some extent. R is of the typical function style, and esProc is of the typical object style.

Order: Sort by employees correctly, and then sort by freightage in reverse order
SQL: select * from Orders order by EmployeeID, Freight desc

R: A1[order(A1$ EmployeeID,-A1$Freight),]

esProc: =A1.sort(EmployeeID,Freight:-1)

Comments: R solution is to retrieve 2 vectors from A1 at first, pass them to the function order to group them together for sorting, and then export the serial numbers. At last, data will be rearranged according to the serial number. The computing process of R language is rather winding, not as straightforward as SQL. It is because that R is good at handling vector, the access to structured data will take column as the basic unit, and the parameter usually takes up a whole column. By comparison, SQL takes the record (column) as the basic unit, with parameter as the column name.

esProc solution resembles that of SQL because esProc takes the record (row) as the basic unit .

Group & Sum: Summarize by employee, sum up the freightage, and count the orders:
SQL: select EmployeeID, count(OrderID), sum(Freight) from Orders group by EmployeeID

R :
A4<-aggregate (A1$Freight,list(A1$EmployeeID),sum)  
A4$count<-tapply(A1$Freight,A1$EmployeeID,length)

esProc : =A1.group(EmployeeID;EmployeeID,~.count(),~.sum(Freight))

Comments: In this case, it is obvious that R and SQL differ greatly. The algorithms available in R may be clearer for mathematicians, and means more learning efforts for users accustomed to SQL.

Join:  Perform left join on Orders table and Customers table by CustomerID.
SQL :select * from Orders leftjoin Customers on Orders.CustomerID=Customers.CustomerID

R :merge(A1,B1,by.x="CustomerID",by.y="CustomerID",all.x=TRUE)

esProc: =join@1(A1:CustomerID:Orders, B1:CustomerID:Customers)

Comments: The join of SQL equals to join of esProc or the merge of R. Similarly, the left join of SQL equals to join@1 of esProc, or merge(...all.x=TRUE) of R. Obviously, esProc is more alike SQL in the respects of both the syntax conventions and the literal meanings.

Distinct: Remove the duplicate CustomerID
SQL: selectdistinct CustomerID fromOrders

R: unique(B2$CustomerID)

esProc: =B2.id(CustomerID)
Comments: The keywords of the two solutions respectively differ to that of SQL. However, their usages are basically the same to that of SQL. In which, R is the typical function style, and esProc is the typical object style.

Like: Search for the record with Island in ShipName
SQL : select * from Orders where ShipName like '%Island%'

R: subset(A1,grepl("Island",ShipName,ignore.case = TRUE))

esProc : =A1.select(like@c(ShipName ,"*Island*"))

Comments: R supports several means to match, including the regular expressions, and is more powerful than esProc in this respect. The usages of esProc are more close to that of SQL, and fit for those who are familiar with SQL.

As can be seen from the above comparison, esProc has a coding style more close to that of SQL since esProc supports the data structure of “Record” by the infrastructure, making it more suitable for users who are familiar with SQL. In addition, compared with R language, esProc provides the representation style of "object + functions" that is much easier for those who are familiar with database to accept.
R is more resourceful in details and third party support, ideal for mathematicians.

Grouping Function Comparison R Language vs. esProc

Grouping is to allocate the samples into several groups according to a specific flag. There is a difference between groups and the relative commonness shared by group members. The grouping plays an important role in statistical analysis. For example, the type grouping is used to differentiate the types of economy, society, sciences, and other phenomena. The structural grouping is used to study the internal structure, and the analysis group is used to analyze the coexistence relation between data.

As the data analysis language, both R language and esProc provide the rich functions of grouping. Let’s use some examples to have an idea of their difference. In these cases, we will use the Orders table from Northwind database as the sample data.

Basic grouping: group by a certain column, for example, view data by employee.
   R:orderByEmp<-split(result,result$EmployeeID)
 
   esProc:A2=A1.group(EmployeeID)

Regarding the basic functions, both R and esProc implement it well. In addition, users can expand the basic grouping functions, such as, group by multi-columns, group and summarize concurrently, first group and then summarize, and continuous grouping at one level after another, and the inter-row computation on data to be grouped. 

The basic grouping, in effect, can be characterized as follows: the original member will always be assigned to a certain group, and no duplicate member is allowed. This is the completely-divided grouping that is championed by the relation algebra (i.e. SQL). In some cases, the conditions are even more complicated. For example, the Marketing department sent a list of advertisements & regions (AdCountry by name). These regions are the location where the advertisement campaigns are intensively launched. Currently, we need to analyze the order conditions in these regions. Such conditions are characterized as follows:  
The advertisement list has definitely less countries than those in the Orders table because it is the advertisement for partial countries.
The advertisement list may comprise more countries than those in the Orders table because it is quite normal for some countries having no orders.

Such type of grouping can be referred to as "incompletely divided grouping". This is not supported by SQL theory and hard to implement. Let's check whether R language and esProc can over-perform SQL in this respect:

R solution:
   AdCountry<-c("USA","Finland","Canada","NotInOrders")
   AdCountryFac<-factor(result$ShipCountry,levels=AdCountry)
   groupByAd<-split(result,AdCountryFac)

esProc solution:
   A3=["USA","Finland","Canada","NotInOrders"]
   A4=A1.align(A3,ShipCountry)

Comments:Both R and esProc can solve this problem well. The data is grouped into 4 groups. There are some data in the first 3 groups, and the last group is empty, as expected. 

Let us then check the grouping on simple conditions: classify the freightage into 3 categories of 0-30, 30-100, and 100. 

R solution:
   Frekind<-cut(result$Freight,breaks=c(0,30,100,Inf))
   orderByFei<-split(result,Frekind)

esProc solution:
   A5=["?<=30", "?<=100&& ?>=30" ,”?>100”]
   A6=A1.enum(A5,Freight)

Comments: Both solutions solve the problem perfectly. However, you may have noticed that the representation of esProc is much more flexible. For example, esProc users can carry out the Boolean judgment on character string or data of other types. They can also compute the Boolean expression on 2 fields concurrently. By comparison, it is much more complicated for R language users to implement the similar functions. Since R language users can only perform grouping on one field of numeric type to group it into a non-overlapped range/category. The limitations are really not just a few.

Then, let us check a case of much more complicated example of grouping on conditions. For example, the freightage belongs to these 3 categories: the 5$-15$ is the freightage range that is most easily to be accepted by users, the low freightage range is for those below the 50$, and the high freightage range is for those higher than 50. In this case, there is some overlapped area of category 1 and category 2. Then, the record whose freightage below 10$ must exist in both these 2 groups.

R solution:
   subset(result,Freight>=5 &  Freight<=15)->g5to15
   subset(result,Freight<=50 )->g0to50
   subset(result,Freight>50 )->g50toinf
   gFeight<-list(g5to15=g5to15,g0to50=g0to50,g50toinf=g50toinf)

esProc solution:
   A5=["?<=5 &&?>=15", "?<=50" ,”?>50”]
   A6=A1.enum@r(A5,Freight)

Comments: R does not provide the function / feature to implement the grouping on complicated conditions. In fact, such grouping is made out awkwardly. So, it by no means resembles the "grouping action". esProc solution is the same to that for the previous example. In which, @r is used to indicate that the duplicates are allowed in the groups. Such syntax style is flexible, and you can implement lots of functions or features on the basis of the limited number of functions, not having to name any new functions. 

As can be seen from several examples above, R language can be used to implement the advanced grouping function. R language is more powerful than SQL in this respect. However, it is still behind esProc regarding the flexibility and usability.