August 28, 2014

Methods of Grouping and Summarizing in R Language

The operation of grouping and summarizing includes grouping one or more certain fields of two-dimensional structured data and then summarizing fields of each group. The following will introduce methods of grouping and summarizing in R language through an example. In order to make the example more typical, we’ll set two fields to be grouped and two summarizing operations.

Case description:
Please group data frame orders according to CLIENT and SELLERID, and then summate field AMOUNT and seek its maximum value respectively in each group.


Note: orders contains records of sales orders. Its source can be a database or a file. Such as orders<-read.table("sales.txt",sep="\t", header=TRUE).The first rows of data are as follows: 

Method 1: aggregate function
Code:
    result1<-aggregate(orders$AMOUNT, orders[,c("SELLERID","CLIENT")],sum)
    result2<-aggregate(orders$AMOUNT, orders[,c("SELLERID","CLIENT")],max)
    result<-cbind(result1,result2$x)

Part of the computed result:

Code interpretation:
1.The name aggregate implies that it is a function specializing in grouping and summarizing. Both its input parameters and computed result are data frame and its usage is relatively simple.
2.aggregate function cannot perform multiple summarizing operations on grouped data, thus two lines of code are required to realize the operations of seeking sum and max respectively, then their results are combined using cbind. Obviously, the code is not satisfactory in performance and usability.
3. aggregate function has a strange requirement about the order of the fields to be grouped, that is, the fields must be in reversed order. In view of this, the code for grouping CLIENT first and then SELLERID should only be written as orders[,c("SELLERID","CLIENT")]. The code written according to the normal way of thinking will be incorrect instead.
4.  Not only the code is written in an unnatural way, but the computed result is weird too by putting filed SELLERID before CLIENT. In reality, the code should be improved in order to make the computed result conform to the business logic.

Summary:
aggregate function manages to perform the task after a fashion. But it is not good in performance and usability because the way of coding, computed result and business logic are inconsistent with each other.

Code:
    result1<-lapply(sp,FUN=function(x) sum(x$AMOUNT))
    result2<-lapply(sp,FUN=function(x) max(x$AMOUNT))
    result<-cbind(result1,result2)

Part of the computed result:

Code interpretation:
1.  The role of split function is to group the data frame according to specified fields. No further computation is involved. lapply function can perform the same operation on data of each group. By working with each other, split and lapply can fulfill the task.
2.  Because the grouped data can be reused, this operation performs better than that using aggregate function.
3.  As lapply function doesn’t support multiple statistical approaches, two lines of code are required too to realize the operations of seeking sum and max respectively, and then use cbind to combine the results. What’s more, this operation requires an extra split function, so instead of enhancing the usability, it reduces it.
4. The grouping order is still unnatural and the code has to be written reversely as orders[,c("SELLERID","CLIENT")].
5. The computed result needs a lot of modification which brings great inconvenience. It can be seen that the first column of the computed result is, in fact, the “SELLERID.CLIENT”. The column needs tobe split into two columns whose orders should be exchanged.

Summary:
This operation improves some performance but the usability is obviously poor with inconsistency in the aspects of way of coding, business logic and the computed result.

lapply belongs to the family of apply function. Similar functions include sapply and tapply, whose usages differ on parameters. For example:
    sp<-split(orders,orders[,c("SELLERID","CLIENT")],drop=TRUE)
    result1<-sapply(sp,simplify=FALSE,FUN=function(x) sum(x$AMOUNT))
    result2<-sapply(sp,simplify=FALSE,FUN=function(x) max(x$AMOUNT))
    result<-cbind(result1,result2)

tapply specializes in data frame, which, by rights, is the most suitable one for fulfilling this task. But it isn’t in fact. It applies only to the situation where a single field is required to be grouped. When it is used to group two fields together, the result will be two-dimensional matrix. This requires users to make further complicated processing. For example, the computed result of the line of code tapply(orders$AMOUNT, orders[,c("SELLERID","CLIENT")],function(x) sum(x))is as follows:

Third-party library functions
There are various disadvantages when using R's built-in functions to group and summarize. In response to the problem, we may consider using the third-party library functions, such as reshape, stack, etc. The stability and computational efficiency of these library functions is generally not as good as those of the built-in functions, and the information for their use is not many. Therefore it is difficult for them to fulfil the task. Here we won’t go into any example about their use.

Third-party languages
Python, esProc and Perl can also be employed to fulfil this task. All of them can perform grouping and summarizing as well as structured data computing as R language can. We’ll briefly introduce solutions of esProc and python.

esProc
esProc can fulfil this task by simply using groups function. Its syntax is concise and easy to understand, as well as in line with the natural way of thinking. The code is as follows:      result=orders.groups(CLIENT,SELLERID;sum(Amount),max(Amount))

The computed result, syntax and business logic are highly consistent with each other in esProc. Some of the computed results are as follows:

Python(pandas)
If python's built-in functions are used to deal with this task, the code will be rather complicated. Here pandas, the third-party function library, comes to help. pandas will first perform grouping operation using groupby function, then summarize using agg function. Its code, which is simpler than R language but not as good as esProc,is written like this:     result=orders.groupby(['CLIENT','SELLERID']).agg({'AMOUNT':[sum,max]}).

Pandas' computed result and syntax are highly consistent with the business logic. Part of the computed result is as follows:

August 27, 2014

Code Examples of Common In-Memory Grouping with esProc

It is convenient to realize some common in-memory grouping with esProc, such as, equal grouping, alignment grouping and enumeration grouping. They are to be illustrated with the following examples.

Equal grouping

Grouping basis of equal grouping is certain fields (or computed columns derived from fields) within a data set. Each group is a subset of original data set.
Case description: Group sales orders by the year.

Data description: Data of sales orders are shown below: 
The above data set (table sequence) can be accessed from a database or a file. For example:
A1=file("E:/sales.txt").import@t()

esProc code:
         A2=A1.group(year(OrderDate))

Computed results:
Code interpretation:

1. In this example, grouping basis comes from OrderDate. The date of sales order will be converted into the year through year(OrderDate), and data of the same year will be grouped together.

2. There may be multiple fields for grouping. For example, regroup data of different years and sellers according to year and sellers. The code is:

A1.group(year(OrderDate),SellerId)

3. Often, the grouped data are used to perform aggregation operations, such as, compute each year's sales amount according to A2. The code is:
A2.new(year(OrderDate):y,~.sum(Amount):a)

Computed results are:

Or, combine grouping and summarizing into one step with groups function:
A1.groups(year(OrderDate):y; sum(Amount):a)

Of course, sometimes we have to execute grouping and summarizing separately in order to reuse the code and improve computational efficiency. For example, filter one of the groups of A2, and perform association computation for another group. Another situation is that, if, after summarizing, data of a certain group are unusual and worth further study, then this group can be used directly to go on with the computations. It's no need to filter the group again.  

4. By default, esProc's group function will group data by using hash algorithm. But, comparing adjacent rows can have higher performance for ordered data. This can be executed by using option @o in group function. For example:
A1.group@o(year(OrderDate),SellerId)

Alignment grouping


Criterions used for equal grouping come from within a dataset. But sometimes, they originate from without, like fields of other data sets, arrays created by users, parameter list and so on. Thus the alignment grouping comes into being.
   
Different from equal grouping, method of alignment grouping may produce empty subsets, which have no members to correspond to data of a group. It may also produce incomplete groups, meaning that some data won’t be included in any group. These things won't happen for equal grouping. 

Case description: Table of top 10 sellers has been worked out in the light of performance, please group sales orders according to the table order.
The data set before grouping:

The sales orders are the same as those in above example. Data are stored in A1.

Table of the top 10 sellers is stored in B1 as follows: 

Sellers table may come from a temporary table, or be generated by a piece of code. The generating process is not the focus of this example.
esProc code:
    A1.align@a(B1:empID,SellerId)

Computed results:

Code interpretation:
1. In this example, the grouping basis (sellers table) comes from without the data set to be grouped. After grouping is completed, a group contains only the data of one seller, and groups are sorted according to sellers table. 

2. Because sellers in sales orders outnumber those in sellers table, some of the orders won't appear in any groups. If we want to create one more group to store these orders, we can use function option @n as follows:
A1.align@a@n(B1:empID,SellerId)

The one more group will be put last as follows:  

3. Sometimes, the grouping basis is not within the data set to be grouped, such as, "newly-employed sellers table". In this case, it's normal to produce empty groups. Modify the first group of data in the table into empID=100, for example, the computed results will be: 

Enumeration grouping 


The grouping basis for enumeration grouping could be more flexible. It could be any Boolean expressions. Those records consistent with the expression will get into the same group.

Similar to alignment grouping, enumeration grouping is also of incomplete grouping, probably producing empty subsets or results that some records are not included in any group. In addition, with this grouping method, it is likely that some records may appear in more than one group. 

Case description: Dividing sales orders into four groups, they are: A. order amount is less than 1000; B. order amount is less than 2000; C. order amount is less than 3000; D. order amount is less than 10,000. Note that the data cannot be grouped repeatedly, that is, if an order has been in group A, it must not be put into group B, C, or D.

The data set before grouping:
The sales orders are the same as those in above example. Data are stored in A1.

esProc code:
         A2=["?<=1000","?<=2000","?<=3000","?<=10000"]
         A3=A1.enum(A2,Amount)

Computed results:
Case interpretation:

1.In this example, grouping basis for grouping is multiple flexible expressions. Each record will be compared with the expressions. Those consistent with the same expression will be put into the same group. Groups are sorted according to order of grouping basis as well.

2. By default, enumeration grouping will not produce identical results. The method, which is showed in the above example, is that after group A’s data are selected, the rest of data will be compared with expression B to see their consistency. While the use of function option @r, which represents that all data are compared with expression B, may produce identical results. For example: A3=A1.enum@r(A2,Amount), computed results are: 

3. Similar to alignment grouping, if the expression for enumeration grouping is inconsistent with any data to be grouped, empty group will appear. Besides, if some data are inconsistent with any expression, function option @n can be used to put them into a surplus group. 

August 26, 2014

Method of Filtering Data of Big Text Files with R Language

It's common to use R language to filter data of files. Sometimes we need to filter data of big files which cannot be imported to the memory all at once. Solutions to this problem include batch importing and filtering as well as result merging. Here is an example to illustrate the method of filtering data of big files with R language.

There is a file sales.txt of 1G size containing a great number of sales order records. Please filter out records in which the field value of AMOUNT is between 2,000 and 3,000. The column separator used in the file is "\t". The first rows of data are as follows: 

R's solution:
con <- file("E:\\sales.txt", "r")
readLines(con,n=1)
result=read.table(con,nrows=100000,sep="\t")
result<-result[result$V4>=2000 & result$V4<=3000,]
while(length(databatch<-read.table(con,header=FALSE,nrows=100000,sep="\t"))!=0) {
  result<-rbind(result,databatch)
}
close(con)

These are some of the computed results:

Code interpretation:
The 1st rowOpen the file handle.

The 2nd row: Skip the first row, i.e. the column name.

The 3rd ~4th row: Import the first batch of 100,000 rows of data, then perform 

filtering and save the result in result.

The 5th ~ 8th row: Import by loop. 100,000 rows data are imported per batch. Append the filtering result to result variable, and then begin the processing of next batch.

The 9th row: Close the file handle.

Points for attention:

For the small files, only a line of code is enough for executing data importing and the first row can be set as the column name of data frame. But this won't work with the big files whose data require to be imported in batches. And for the second batch of imported data,the first row cannot be set as the column name. The column names are by default V1, V2, V3 and so forth.

In order to import data from big files in batches, R language has to use while statement in performing the operation. Besides, it is not convenient in using column names. These will produce fairly complicated code.

Alternative methods:
Python, esProc and Perl can also perform the same operation to do this task. As R language, all of them can perform file data filtering and structured data computing. The following will briefly introduce the methods used by esProc and Python.

esProc can automatically process data in batches and it’s not necessary for programmers to control the operation manually with loop statement, so its code is quite simple:


Cursor is an esProc data type used for structured data computing, which has almost the same usage as data frame but is better at processing big files and performing complex computation. What's more, cursor can set the first row of the file as the column name using @t option.

Python's code structure is similar to that of R language. Both require controlling loop manually. But Python hasn't structured data types like data frame or cursor, so its code is classified into a lower level:
result = []
myfile = open("E: \\sales.txt",'r')
BUFSIZE = 10240000
myfile.readline()
lines = myfile.readlines(BUFSIZE)
value=0
while lines:
    for line in lines:
        record=line.split('\t')
        AMOUNT=float(record[3])
        if(AMOUNT>=2000 and AMOUNT<=3000):
result.append(record)
    lines = myfile.readlines(BUFSIZE)
myfile.close()


Though Python can perform the above operation using the third-party packages, like pandas, which has structured data object similar to the data frame, it lacks sufficient ability in processing big files. Thus it is hard for it to further simplify the code. 

August 25, 2014

A Method of Grouping and Summarizing Data of Big Text Files in R Language

It is common to use R language to group and summarize data of files. Sometimes we may find ourselves processing comparatively big files which have smaller computed result and bigger source data. We cannot load them wholly to the memory when we need to compute them. The only solutions could be batch importing and computing as well as result merging. We’ll use an example in the following to illustrate the way of R language to group and summarize data from big text files.

Here is a file, sales.txt, of 1G size, which contains a great number of records of sales orders. We want to group field CLIENT and summarize field AMOUNT. "\t" is used in the file as the column separator. The first rows of data are as follows:
R's solution:
con <- file("E: \\sales.txt", "r")
result=read.table(con,nrows=100000,sep="\t",header=TRUE)
result<-aggregate(result[,4],list(result[,2]),sum)
while(nrow(databatch<-read.table(con,header=FALSE,nrows=100000,sep="\t",col.names=c("
ORDERID","Group.1","SELLERID","x","ORDERDATE")))!=0) {
databatch<-databatch[,c(2,4)]
  result<-rbind(result,databatch)
  result<-aggregate(result[,2],list(result[,1]),sum)
}
   
close(con)

Part of the computed result:

Code interpretation:
The 1stline: Open the file handle.
The 2nd ~ 3rdline: Import the first batch of 100,000 rows of data, group and summarize them and save the result in result.
The 4th ~ 8thline: Import data by loop, with 100,000 rows of data per batch, and store them in the variable databatch. Then get the second and fourth field, i.e. "CLIENT" and "AMOUNT", merge databatch into result, and execute grouping operation.
It can be seen that, at a certain moment, only databatch,which includes 100,000 rows of data, and result, the summarizing result, have memory usage. Usually, the size of the latter is small and will not result in a memory overflow.
The 11thline: Close the file handle.

Matters needing attention:
Data frame. Because the data frame of R language cannot directly perform the computing of big files, loop statement is necessary to help to do the job in this occasion. The steps are: import a batch of data and merge them into the data frame result; group and summarize result and then import the next batch of data. You can see that this part of code of loop statement is a little complicated.

Column name. As the first row of data is the column name, header=TRUE can be used in the first batch of data to directly set the column name. But the subsequent data hasn't column names and header=FALSE should be used to import data. The default column names are V1, V2 and so forth when header=FALSE is used. But the default column names are Group.1 and x after grouping and summarizing are executed, and col.names is needed to change the column names in order to maintain structure consistency both before and after grouping and summarizing and set the stage for the subsequent merging. The code about column names is worth our notice because it is easy to get wrong.

Alternative solutions:
Python, esProc and Perl can also perform the same operation. They can execute the grouping and summarizing of data from big text files and the subsequent structured data computing as R language does. We'll briefly introduce the coding methods used by esProc and Python.
esProc can process data in batches automatically, which requires no manual control from the programmers by loop statement and produces quite simple code:

Cursor is a data type used for structured data computing in esProc. Its usage is similar to that of the data frame, but it is better at processing big files and performing complicated computations. What’s more, @t option in the code indicates that the first line of the file is the column name. So it is convenient to use the column name directly in subsequent computation.

Python's code structure, which also requires manual loop control, is similar to that of R language. But Python itself hasn't the structured data type, like data frame or cursor, so its code is executed in a lower level:
from itertools import groupby
from operator import itemgetter
result = []
myfile = open("E:\\sales.txt",'r')
BUFSIZE = 10240000
myfile.readline()
lines = myfile.readlines(BUFSIZE)
value=0
while lines:
    for line in lines:
        record=line.split('\t')
result.append([record[1],float(record[3])])
    result=sorted(result,key=lambda x:(x[0]))                #the sorting before grouping is executed
    
    batch=[]
  
    for key, items in groupby(result, itemgetter(0)):    # group using groupBy function
    
        value=0
    
        for subItem in items:value+=subItem[1]
    
batch.append([key,value])                 # finally, merger the summarizing results into a two-dimensional array

    result=batch
   
    lines = myfile.readlines(BUFSIZE)
    
myfile.close()

Except for the above two-dimensional array, Python can execute the operation with the third-party packages. For example, pandas has the structured data object similar to the data frame. pandassimplifies the code in a similar way asR language. But it lacks sufficient ability to perform big file computing, thus loop statement is still needed while programming. 

August 21, 2014

Basic Data Type in Data Processing Programing Language

Programming languages focus on various basic data types, subject to their different design goals. Languages such as Java and C# are designed to develop the common applications. Their basic data types are character strings, number, boolean, and other atomic data type, array and common object. SQL, PowerBuilder, R, esProc, and other alike languages are designed to process data. So their basic data types are the structured 2-dimentional data sheet object. Take this SQL statement, for example,SELECT T1.id,T1.name,T1.value FROM T1 LEFT JOIN T2 ON T1.id=T2.id. Of which, the T1, T2, and the computed result just use such data type. With the multiple fields to form one record and the multiple records to form the 2-dimentional data, the combination of such data and its field name is the structured 2-dimenional data table object.

Why not use the atomic data type and the common object as the basic data type for the data processing languages? If representing the T1 and T2 from the above-mentioned SQL statement with the array or Array List object, you will find: The complexity will increase for several times, and the length of codes will also increase sharply for dozens of times.

The basic data types of data processing languages are the structured 2-dimensional data table object. This is not a coincidence, but there are subtle reasons instead.

Correspond to actual business. In the real world, most business data is the structured data. As an example, the Payroll list has the employee number, employee name, department, date, pre-tax salary, and post-tax salary; For another example, the retail record has the order time, outlet number, checkout counter number, cashier number, product name, and unit price; The last example of business data is the Website log, which comprises the browse time, URL, visitor IP, browser version, and other properties. These properties are equivalent to the field. Each of the records has the same structure. Though they are stored in text while not the database, they are actually still the structured data in nature. So, it is only natural to use the 2-dimensional data table to represent it. The structured 2-dimensional data table object can be used to represent the business data intuitively. Representing the actual business in the most faithful way, no matter the storage, computing, exchange or sharing. Such kind of data is the easiest for users to understand in a most convenient way.

Easy for massive processing. Business data are mostly the data of the same structure, for example, the Payroll table, Retail record, and Website log mentioned above. In processing such data, in some cases, we will handle a certain data of a certain record, but in most cases, we take a certain record as unit to process all data, for example: Compute the after-tax wage based on the pre-tax wage; Compute the amount based on the unit price and quantity of commodity. Count the daily on-line duration for each IP. The above-mentioned processing mode is just the massive data processing. To implement the batch processing, we can traverse every member of array in loops by row number and column number just as the operations for Java. Alternatively, we can operation on the data with the business field name directly as we would do for SQL and esProc. The latter resolution is simpler and easier-to-use without having to write loop statements. Programmers can thus operate on data intuitively from business perceptive, and the corresponding code become more concise and readable.

Compatible with the Relational Algebra. The relational algebra is the underlying theory developed for data processing and query. By which, the association and laws of operations among business data can be expressed in full details using the basic operation along with the join operation, aggregation operation, and division operation. Theoretically, any computation problem of any degree of difficulty can be implemented and solved by relational algebra in the respects of data processing and data query. Because the relational algebra is concise and complete, databases are largely designed based on this theory. E.F. Codd is thus called as the father of relational database. The structured 2-dimensional data table object is just the data type recommended by E.F. Codd. This data type can be used to express various operations of relational algebra, so as to solve the computation problem in data processing easily. In facts, the database result set is the earliest structured 2-dimensional data table object.
As can be seen, all kinds of programming languages adopt the structured 2-dimental data table object as the basic data type because it is corresponding to the real business data, and easy to implement the massive computation, making it compatible to the relational algebra theory. With the 2-dimensional data table object, codes can be simple and easy to understand, and the development efficiency is improved. Let me explain it with a few more examples below:

Result set of SQL (resultSet): Group by the book type to compute the average price of the books whose average price is greater than 15 yuan.
    select avg(price),type from books group by type having avg(price)>15

Table sequence of esProc (TSeq): Group by department to find the top 10 best sellers for each department.
    products. group(department). (~.top(quantity;10)

Data window of PowerBuilder (datawindow): Sort the order by price
    Order.SetSort('value d')
    Order.Sort()

R language data frame (data.frame): Left-join the orders table and customer table by customerID.
    merge(A1,B1,by.x="CustomerID",by.y="CustomerID",all.x=TRUE)

SQL, esProc, and R code comparison: Group the order data by department, and summarize the order data and sales amount of each department.
SQL:
    Select count(*),sum(sales) from orders group by Dept

esProc:
    orders.groups(Dept; count(~), sum(sales))

R language:
 result<-aggregate(orders$ sales,list(orders $ Dept),sum) 
 result$count<-tapply(orders $ sales, orders $ Dept,length)
        
Let's take a close look on result set, table sequence, data window, and data frame. Although they are all structured 2-dimensional data table objects with basically the same function. There are some slight differences between them.
SQL result set is rich in various materials, widely applied, universal, and simple to use. It is the top mainstream data type of all data processing languages. However, SQL did not implement the relational algebra to the full, making it a bit inconvenient for some computations, such as the set division.

DataWindow usually retrieves number from SQL, and return the final result to the database. It mainly servers the purpose of breaking through any barrier between the data and UI controls, so that programmers can design and deliver the database application with high interactivity soon. Another major function of DataWindow is to render and edit data. It can be only used for form computation, and the data processing capability is relatively poor.

Data Frame is capable of handling the structured computation to some extent. As can be seen from the above example, its syntax is obscure, and it is relatively complex to implement the same functions with it. This is because the major functions of R are scientific and statistical computing, focusing on the data types of array and matrix. As a additional data type, the data frame was later introduced to implement the structured data computing. Considering this point, data frame is not so dedicated as that of the other three tools.

TSeq is quite dedicated in data processing. Having incorporated all common strong points of SQL result sets, TSeq can fully and completely implement the relational algebra. TSeq is generic and sorted, especially fit for the order-related complex computing in data processing, for example: Yearly link relative ratio, year-on-year comparison, ranking, relative position computing, and interval computing. TSeq is also charactered for it is generic, and easier to establish relations between data and provide access to data with multi-level association easily by object. Compared with SQL, TSeq is unable to directly process the big data because it is the pure memory object.

As can be seen, the structured 2-dimensional data table object is directly related to the degree of dedication for the data processing languages. The more powerful the former one is, the higher the degree of dedication for the latter one would be, and vice versa. If a programming language lack the structured 2-dimensional data table object, then this language can hardly be regarded as a dedicated one in processing the data. To research and examine if a programming language can be used to develop any application for data analysis and processing efficiently, the key is to find out if it offers the dedicated 2-dimensional data table object and the appropriate class library.

Perl is often used to retrieve character string and is capable of processing the data to some extent. However, since its code is lengthy and complex, it is not the dedicated data processing language. For example, to complete the simplest algorithm of grouping and summarizing, the code of Perl is shown below:
    %groups=();                       
    for each(@carts){             
        $name = $_->[1];
        if($groups{$name} == null){    
            $groups{$name}=[$_];
        }
        else{
            push($groups{$name},$_);     
             }
    }
    my @result=();                           
    for each( keys(%groups)){        
             $value=0;
        while($row=pop $groups{$_}){        
             $value += $row->[2];                
        }
        push @result,[$_,$value];
    }

Python is a bit simpler to write, but far more inefficient than SQL, esProc, and R in developing by order of magnitude. The sample code is shown below:
    result=[]
    for key, items in groupby(data, itemgetter(0)):      
        value1=0
        value2=0
        for subitem in items:                                            
            value1+=subitem[1]
            value2+=subitem[2]
        result.append([key,value1,value2])                  
    print(result)

Perl and Python is not the dedicated tools on data processing. Most importantly, they lack the structured 2-dimensional table data object.

The TSeq of esProc is not only the structured 2-dimensional table data object, but also is characterized with its being order, generic, step-by-step computation, making it more dedicated than other alike languages. For example, to implement a relatively complex computational goal: Find the shares having been rising on 5 consecutive days. esProc solution code is shown below: