September 21, 2014

An Illustration of Processing Big Text Files with esProc Cursor

esProc can process big text files conveniently by providing cursor data object. The following example is to illustrate this.

Let's assume that there is a text file, sales.txt, with ten million sales records. Its main fields include SellerID, OrderDate and Amount. Now compute each salesman's total Amount of big orders in the past four years. The big orders refer to those whose amount is above 2000.


esProc code:

Code interpretation:

A1: If all the ten million records are read into memory simultaneously, memory will overflow. So the job will be done in batches.

A2: Read by looping, 100,000 rows at a time.

B3: Filter each batch of data, select those records whose amount is above 2000 after the year of 2011.

B4: Group and summarize the filtered data, seek the sales of each salesperson in this batch of data.

B5: Add the computed result of this batch of data to a certain variable (B1), and move on to the computation of next batch of data.

B6: After all the computations, sales of each salesperson in every batch of data will be found in B1. Last, group and summarize these sales data and seek each salesperson's total sales amount.

Analysis:

In cell A1, esProc cursor is created with function cursor. The cell name is the cursor’s variable name. When the cursor is created, data will not be read in the memory directly. Read-in will only be executed while fetch operation or other equal operations are going on, e.g., the code for A1,100000 in cell A2 represents reading data from cursor by looping with 100,000 rows at a time. We can see that the data size in memory is always kept in a relatively small level and no overflows will occur.

select and groups are computation functions specially used with structured data. After the data is read in the memory with esProc cursor, they can be processed and analyzed by employing functions of professional structured data computation library. This is more convenient than writing underlying code by hand.  

Equipped with functions and grammar of semi-structured data processing, e.g., function for data split and merging, looping and traversal statement and branch judgment statement, esProc cursor can do complex task of data cleansing and arrangement and form easily computed structured data.   

Splitting and analyzing
For instance, the format of weblog is too complex to be computed and analyzed directly. A typical web blog text need to be transformed into a two-dimensional table of standard format in order to be used in structured data computation or be stored in a database.  

A record in the original weblog:
10.10.10.145 - - [01/May/2013:03:24:56 -0400] "GET /product/p0040001/review.jsp?page=5 HTTP/1.1" 200 8100 "http://www.xxx.com/xxxx.html""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" 0 tankLee in 9fc0792eb272b78916c3872e9ad –

The records in a two-dimensional table are: 

The following case omits the process of file accesses and the final merging of multiple batches of data(refer to the previous example), and lists the code for splitting and analyzing directly.

Data cleansing

Let's see a typical example of data cleansing. Since the employee table read in from a file is not a standard format, it need to be reorganized into standard structural data in batches. Data of the current batch will be stored in cell D3 temporarily. The rule for reorganizing is:

1.The record is invalid if UserID and firstName is null or blank string.

2. UserID can only contain digits; the record is invalid if letters appear in it.

3. For repeated UserID, only the last entered record is kept.

4. Delete possible spacing before and after the data.

5. Capitalize all the first letters of firstName.

6.Full name is combined in the form of firstName+"."+"lastName". But, if lastName is null or blank string, fullname equals to firstName.

The following table also omits the process of file accesses and the merging of multiple batches of data, and only lists the code for data cleansing:

September 18, 2014

Program Eight Queens Problem with esProc

There are quite a lot of chess problems, among which the Eight Queens Problem is the most famous one. Put eight Queens in a 8×8 chessboard, the requirement is no attack between any of them. How many layouts are there to achieve this?

As a Queen's striking range is limited to a row, a column and an oblique line, only one Queen is allowed to appear in a single line, otherwise it is ineligible. 

Since there is only one Queen in any single line, a sequence, 8 in length, could be employed. Set in turn the column number a Queen in each row is located. If there is not a Queen in a certain row, mark with zero; when we place a new Queen in the board, and the column number is not within the sequence, which means there is no Queen in this column.

At the same time, we also need to make sure the new Queen has no counterpart in the diagonal. If a Queen is located in row m of column k, there are two places at most in row m+n of the same oblique line. Between the two places and the Queen, the horizontal distance is equal to the vertical distance, which means there are at most two places (m+n,k+n) and (m+n,k-n) are in the same oblique line with the Queen.

So, we would know the number of eligible conditions after examining status of each Queen in every line. 

esProc can do all the work with loop computation, as shown in the following form:

 i is employed during the computation to record the serial number of the current row where a Queen is placed. The code in the 2nd line of the above form shows that with each loop of the code, the Queen will move to the next column, and in this way, traversal of every place of the current row will be completed. For the code in the 3rd line, when the Queen move to the 9th column, its traversal in all places in the current row has completed; the record of the current row restores to zero and i equals i-1 and return to continue with traversal in the last row. Here note that when the entire loop in the first line is done, the traversal is completed, i will be reset as zero, and the loop is over. For the code of in the 4th line, when moving the queen in the first row, we could locate the second Queen without making any judgments. The code of in the 6th line judges whether there is any located Queen in a same column; and the code in the 7th line judges whether there is any located Queen in a same oblique line. If the answers of both judgment are no, we could locate the Queen in the next row. When all the eight Queens are located, record their current places with the code in the 9th line.

The computed result in A10 is:

Check detailed result in C1:

September 17, 2014

How to Compute Moving Average in R Language

A moving average is used to smooth out a time series. Computing moving average is a typical case of ordered data computing. Its basic computing method is to create a subset composed of N consecutive members of a time series, compute the average of the set and shift the subset forward one by one. The following example teaches you how to compute moving average in R language.

Case description:

Data frame sales has two fields: salesDate and Amount of this date. Requirement: compute the moving average in three days. Computing steps include seeking sales amount average of the previous day, the current day and the next day, and shift forward along the dates. A part of the source data is as follows:
Code:
Computed result:
Code interpretation:
filter function can be used in R language to compute moving average, which produces concise code. This method is quite convenient.

Despite the convenience of the filter function, it is difficult to understand for beginners. For example, sales$Amount/3means dividing the current value of field Amount by three,but when it is used in filter function, it may mean adding the three consecutive values together, then divide the sum by three. [1,1,1] is the value of expression rep(1,3), which is used here to specify the range of data fetching. In addition, because neither the name nor the parameters of filter function contain the words “average” and “moving”, even many developers of R language don’t know its use for computing moving average.

In fact, filter function is a universal linear filter. Its use is more than computing moving average. Its complete function reference is filter(x, filter, method = c("convolution", "recursive"),sides = 2, circular = FALSE, init).
Any modification of the requirement will make the code more difficult to understand. For example, the code for computing moving average of the current day and the previous two days cannot be written as filter(sales$Amount/3, rep(0,2)), it has to befilter(sales$Amount/3, rep(1,3), sides = 1).
Summary:
R language can compute moving average, but its code is rather elusive.

Third-party solutions
We can also use Python, esProc and Perl to handle this case. As R language, all of these languages can perform data statistics and analysis and compute moving average. The following introduces solutions of Python and esProc briefly.

Python(pandas)
Pandas is Python's third-party library function. It is powerful in processing structured data with basic data type imitating R's data frame. At present the latest version is 0.14. Its code for handling this case is as follows:
         pandas.stats.moments.rolling_mean(sales["Amount"], 3)

The name of rolling_mean function is clear, even a developer without experience with pandas can understand it easily. The function’s usage is simple too. Its first parameter is the sequence being computed and the second parameter is N, which is the number of days in seeking moving average.

esProc
esProc is good at expressing business logic freely with agile syntax. Its expressions for relative position can solve computational problems of ordering data easily. The code is as follows:
         sales.(Amount{-1,1}.avg())

{-1,1} in the code represents a relative interval, that is, the three days of the previous day, the current day and the next day. It can be seen that moving average can be worked out clearly and flexibly by using a relative interval. If it is required, for example, to compute the moving average of the current day and the previous two days, we just need to change the interval to {-2,0}in esProc.

A relative interval is a set. esProc can also express an element of relative position. For example, it can compute sales growth rate with (Amount -Amount[-1]) conveniently. In contrast, the code in R language and Python is difficult to understand. 

September 16, 2014

Computing the Online Time for Users with esProc (IV)

In last article we mentioned that IT engineers from the Web Company used esProc to code single-machine multi-threaded program which could handle large data volume and complex requirements. This leverages the full power of one multi-core multi-CPU machine. Now once again these engineers found a new issue: with the user numbers for the online application growing explosively, colleagues from the Operation Department complained that the online time computation program is still running too slow.

IT Engineers leverage esProc's multi-machine parallel computing capability, to split the task for multiple machines to complete. The performance problem is resolved successfully. The single machine parallel processing is shifted to multi-machine parallel processing, with relatively low cost for hardware and software upgrade.

To improve performance, the Web Company increased the number of server from the original number of 1 to 3. Accordingly, the following steps are needed to shift from single-machine parallel to multi-machine parallel:


The first step: Modify the esProc program for weekly log files processing. Divide user ID by3 and separate the weekly log file into 3 files according to the remainder. Every server would be processing one of these. This way the file size were reduced and file transfer time could be shortened. Later the three files were uploaded to three servers, using multiple parallel programs to do the computation. The actual program is as following:


Note in the last screenshot that, A6 used the @g option of export function to retrieve "log files for one week" into three binary files. During subsequent use of parallel processing time, the content of log files can be retrieved by blocks for different user. The use of @g option is to ensure the segmented data retrieval is aligned to group borders, removing the possibility for assigning data of the same user to two blocks.

The second step: the single-machine multi-threaded program is unchanged. Let's go back.

Subroutine parameters are shown below. They are used to pass the log file name, block number and total number of blocks for the week when called by the main program. Here the log file name for the week, week file, was already one of the three segmented files corresponding to this machine.


The subroutine is as following:


The above screenshot illustrates that:
1. As we previously used export @g to output the file in group according to different user ID, the use of @z option by cursor in A2 to handle specific block (value is block number) among total (value is total blocks) from file will retrieve the complete group for the same userID. Data for one user will not be split into two blocks.

2.  The code line in red box returns the resulting file as cursor to the main program. Since multi-machine parallel processing were used here, this cursor is remote cursor ( Read esProc's Documents for detailed introduction on remote cursor).

The third step: writing main program for parallel computing, to call the parallel computing subroutine. As illustrated below, the main program called parallel tasks on tree machines, which effectively improved the performance for computation.

The server list in the program could also be written into the configuration file, this way any subsequent increase or decrease of the server would be easy.

Note: for specific measurements regarding esProc's performance gain with parallel computing, please refer to related test reports for esProc.Notes on the above screen capture:

1. callx@ parameter specifies 3 servers from A1 to A3, to handle three log files B1 to B3.

2. The syntax of callx's input parameter, is to specify three servers through A5, and specify 6 parallel computing tasks for each server in A6.

3. Server list, server number, and the number of tasks for each server can be adjusted according to actual situation, to leverage full performance potential of the server.

The fourth step: implement the esProc server, and upload related program & data files. Refer to instructions on esProc for specific steps and methods.

After the transformation to multi-machine parallel computing, the Operations Department found significant improvement in the computation speed of users online time. The cost of this transformation is much lower than that for application databases upgrade, especially, in the hardware part, only 2 additional PC Servers were needed.

So far, The Web Company finished implementation of esProc based user behavior analysis and computation platform. Its main advantages are:

1. The platform is easy to be adjusted with more complex algorithm for future, shortened the response time and saved labor costs from engineers.

2. It's easy to scale out for even larger data amount in the future, with shortened project time and reduced cost of upgrade.

September 15, 2014

Computing the Online Time for Users with esProc (III)

In last article we mentioned that IT engineers from the Web Company used esProc to code program which could handle large data volume and complex requirements. Not only could it meet the demands for online time computation, but also is relatively easy to be extended to with new conditions.

However, these engineers found that the single-threaded program does not take full advantage of the of the server's computing power. Practice has proved that the use of esProc's multi-threading capability can take advantage of the server's quaddual core, or even more CPUs. The change from single-threaded to multi-threaded requires very little workload. 

The Operation Department provided the following requirements for computation of users online time:

1. Login should be considered as the starting point of online time, and overnight should be take into consideration.

2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.


3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.


4. If there is only login, without logout, then the last operation time should be treated as time for logout.


5. For users who completed a post operation, his/her current time online time will be tripled in computation.


To shift from single-threaded computing to parallel computing, following steps needs to be done:


The first step: Adjust the log file preprocessor with the @g option of export function, to retrieve the log file for one week into a segmented binary file. In subsequent parallel processing, log file could be retrieved by block for different users. The use of @g option is to ensure the segmented data retrieval is aligned to group borders, removing the possibility for assigning data of the same user to two blocks. The actual procedures are as following:


The second step: Rewrite the online time computing program into a parallel subroutine. The part in the following red box is where we need to modify for parallel processing. Because different parallel tasks are used compute for different users, you can see that very little changes are required for parallel computing. The only change required, is to replace the use of files with different blocks from the binary file.

First we need to add parameters to subroutine, to pass the log file name, block number and total number of blocks for the week when called by the main program.


And then modify the program as following:

The above screenshot illustrates that:
1. As we previously used export@g to retrieve the file according to different user ID, the use of @z option by cursor to handle specific block (value is block number) among total (value is total blocks) from file, as shown in the redbox, will retrieve the complete group for the same userID. Data for one userwill not be split into two blocks.

2. A16 returns the resulting file as cursor to the main program.

The third step: writing main program for parallel computing, to call the parallel computing subroutine. Because the total cores of the server CPU is 8,the IT engineers decided to use six threads for parallel computing. This take full advantage of multi-core CPUs to improve performance.

Note: for specific measurements regarding esProc's performance gain with parallel computing, please refer to related test reports for esProc.

Upon the meeting of this requirement, IT engineers from the Web Company are facing a new problem: the user numbers for the online application grew explosively. Colleagues from the Operation Department complained that the online time computation program is still running too slow. The single-machine, multi-threaded approach can no longer enhance the computing speed significantly. Can these IT engineers effectively solve the performance issue using esProc’s parallel multi-machine computing capability? Is it too costly to transform to a multi-machine parallel mode? See "Computing the Online Time for Users with esProc (IV)"