September 1, 2014

Methods of Realizing Fixed Grouping and Summarizing in R Language

In grouping and summarizing operation, if the names and numbers of the groups have been given, then we call the operation as fixed grouping and summarizing. The grouping criterion of this type of operation comes from outside of the data set, such as grouping by the client list of parameters list, or by conditions list. This type of operation involves some problems like if the grouping criterion is beyond the boundary of the data set, if extra groups are needed and if there is overlap among the data. To solve these problems requires some effort. In the following, we'll look at several methods for realizing grouping and summarizing in R language.

Case 1: Group criterion is within the data set
Data frame sales contains records of sales orders, in which column CLIENT contains clients list and column AMOUNT contains the orderamounts. Please group sales by “potential clients list” and then summate columns AMOUNT in each group. The potential clients list is [ARO,BON,CHO], which happens to be a subset of column CLIENT.


Note: Source of sales may be a database or a file, such asorders<-read.table("sales.txt",sep="\t", header=TRUE). The first rows of data are as follows:

Code:
         byFac<-factor(sales$CLIENT,levels=c("ARO","BON","CHO"))
         result<-aggregate(sales$AMOUNT,list(byFac),sum)
Computed result:

Code interpretation:
1.factor function generates a grouping criterion (which is called as factor in R language), and aggregate function groups and summarizes according to the grouping criterion. The structure of the code is quite clear.

2. Note that the grouping criterion is neither a vector quantity nor an array, so it cannot be written as byFac<- c("ARO","BON","CHO"); it cannot be used directly too unless it is converted into the list type. They are not easy to understand for beginners and require special attention.

3. If column CLIENT is set as the grouping criterion (that is, the unfixed grouping), only a line of code is enough:result<-aggregate(sales$AMOUNT,list(sales$CLIENT),sum).

Summary:
aggregate function can easily fulfill this task.

Case 2: Grouping criterion is beyond the boundary of the data set
It is the exceptional case that the grouping criterion is within the column data. In fact, since the grouping criterion comes from outside of the data set (like the external parameters), its members may not in the column data. In this case, we try to solve such a problem.

Supposed that the value of “potential clients list” is [ARO,BON,CHO,ZTOZ], please divide data into four groups according to “potential clients list” and summate column AMOUNT in each group. Note that client ZTOZ is not in column CLIENT.

Code similar to that of case 1:
         byFac<-factor(sales$CLIENT,levels=c("ARO","BON","CHO","ZTOZ"))
         result<-aggregate(sales$AMOUNT,list(byFac),sum)
The computed result of the above code:

It can be seen that there are only three groups of data instead of the four groups as required in the computed result. Apparently, the above code fails to fulfill the task and needs some improvement.

The improved code:
         byFac<-factor(sales$CLIENT,levels=c("ARO","BON","CHO","ZTOZ"))
         tapply(sales$AMOUNT, list(byFac),function(x) sum(x))
Computed result is:

Code interpretation:
1.  The improved code is in line with the business logic, which makes all the required four groups be displayed in the result.
2.  tapply is used to group and summarize. It is more universal than aggregate. But it confuses beginners most of the time as it lacks a sense of immediacy compared with aggregate.
3.   That the summating value of ZTOZ is NA means that ZTOZ is not in column CLIENT. If the summating value of ZTOZ is zero, ZTOZ is in column CLIENT and order amount is zero too.
4.  In this case, there are just four groups in the grouping and summarizing result. Extra clients shouldn't be here. These extra clients can be called "extra group" and the code for computing its summating value cannot be produced only through some simple modification on the basis of the current code. Instead, a new function is needed:
    filtered<-sales[!is.element(sales$CLIENT,byFac),]
    redundant<-sum(filtered$AMOUNT)
It is not a piece of complicated code, but the way of coding is apparently different from the previous one.
Summary:
tapply can fulfill this task easily.

Case 3: No overlap among the grouping criteria
It is also one of the fixed groupings that set some conditions as the grouping criteria. For example, divide the order amount into four intervals by 1000, 2000 and 4000. Each interval is a group of orders. Then summate the order amount of each group.
Code:
        byFac<-cut(sales$AMOUNT,breaks=c(0,1000,2000,4000,Inf))
        result<-tapply(sales$AMOUNT, list(byFac),function(x) sum(x))
Computed result:

Code interpretation:
cut function divide the data frame into four intervals, then tapply function groups the data frame by these intervals and summarize the result of each group.
Summary:
The cooperation of cut and tapply can easily perform the simplest conditional grouping.

Case 4: Re-compute the overlapping data in grouping criteria
The simplest conditional grouping not involves overlapping among the conditions. But in reality, it is common that conditions are overlapping. For example, group the order amount according to the following conditions:
         1000 ~ 4000standard orders r14
         <2000non-important orders r2
         >3000important orders r3

The group of standard orders overlaps with the other two groups. In this circumstance, the problem of whether we should re-compute the overlapping data arises. Let's first deal with situations which require the re-computing.
Code:
        r14<-subset(sales,AMOUNT>=1000 &  AMOUNT<=4000 )
        r2<-subset(sales,AMOUNT<2000)
        r3<-subset(sales,AMOUNT>3000 )
        grouped<-list(r14=r14,r2=r2,r3=r3)
              result<-lapply(grouped,FUN=function(x) sum(x$AMOUNT))
Computed result:

Note: r2 and r3 contain some of the data of r14.
Code interpretation:
1. The above code can complete this task but is too complicated. If there are more conditions, the code will become lengthy.
2. A new function lapply is used in the code. So far we have used a lot of functions, including factor, aggregate, list, tapply, cut, subset, lapply, etc, in order to perform the operation of fixed grouping. We have to use different functions and ways of coding to perform the operation of conditional grouping just because overlapping conditions are involved or not. To be honest, it is difficult to master all these methods.
3.  The computed result of the above code is list, while some results of the code in the previous cases are data.frame, and some others are arrays. This inconsistency will bring trouble in practice.
Summary:
The method can complete the task, but the code is complicated and effort is needed to learn to use a lot of functions.

Case 5: Overlapping conditions and non-repeating results
In the previous case, we solved the problem when re-computing the overlapping conditions is needed. But sometimes, we need the non-repeating computed result, that is, the grouped data shouldn't appear again. In this case, the operation will be like this: r2 shouldn't contain data of r14, and r3 shouldn't contain data of r2 and r14.
Code:
        r14<-subset(sales,AMOUNT>=1000 &  AMOUNT<=4000 )
        r2<-subset(sales,AMOUNT<2000 & !(AMOUNT>=1000 &  AMOUNT<=4000))
        r3<-subset(sales,AMOUNT>3000 & !((AMOUNT>=1000 &  AMOUNT<=4000)) & !(AMOUNT<2000))
        grouped<-list(r14=r14,r2=r2,r3=r3)
        result<-lapply(grouped,FUN=function(x) sum(x$AMOUNT))
Computed result:

Note that when it is no need to re-compute the overlapping data, the value of r2 and r3 is smaller than the result computed previously.
Code interpretation:
More logical judgments are used in the above code, which further complicates the code. It thus can be imagined that the amount of code will be huge if groups are many and conditions are complicated.
Summary
This method can complete the task but the code is complicated.

The third-party solution
We can also fulfil the task in this case using Python, esProc and Perl. All of them can perform the operation of fixed grouping and summarizing as well as the computation of structured data. The following will briefly introduce the solution in esProc.
Case 1:
        byFac=["ARO","BON","CHO"]
        grouped=sales.align@a(byFac, CLIENT)
        grouped.new(byFac(#), ~.sum(AMOUNT))
Computed result:

Case 2:
The code is omitted because it is completely the same as that in case 1.
Computed result:

If we want to summate the extra group, a little modification will be enough:
        byFac=["ARO","BON","CHO","ZTOZ"]
        grouped=sales.align@a@n(byFac,CLIENT)
        grouped.new((byFac|"redundant")(#),  ~.sum(AMOUNT))

Modified part is marked in red. @n represents that an extra group is added to the result set. This way of coding is easy to master than that of R language.
Computed result:
Case 3:
For simple conditional grouping, esProc will only need to change the align function to enum function. The other part will remain unchanged.
        byFac=["?<=1000" ,"?>1000 && ?<=2000","?>2000 && ?<=4000","?>4000"]
        grouped=sales.enum(byFac,AMOUNT)
        grouped.new(byFac(#),~.sum(AMOUNT))
Computed result:

esProc:
@r option is added to the previous code when re-computing the overlapping data is needed.
        byFac=["?>=1000 && ?<=4000","?<2000" ,"?>3000"]
        grouped=sales.enum@r(byFac,AMOUNT)
        grouped.new(byFac(#),~.sum(AMOUNT))
Computed result:

esProc:
Remove the @r option when it is no need to compute the overlapping data. This is the same as the operation of simple condition grouping.
        byFac=["?>=1000 && ?<=4000","?<2000" ,"?>3000"]
        grouped=sales.enum(byFac,AMOUNT)
        grouped.new(byFac(#),~.sum(AMOUNT))
Computed result:

It can be seen that esProc needs only two functions: align and enum, to realize all types of fixed grouping and summarizing. Its code is consistent and simple. 

August 31, 2014

A Handy Method of Accessing Data in Remote http Server in Java

In Java projects, sometimes accessing data in remote http server is required. The data can be of xml format or json format. The following will compare two accessing methods through an example.

Here is a servlet which provides employee information query in json format. servlet accesses the employee table in the database and stores the data of employees as follows:
EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY
1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000
2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000
3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000
4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000
5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000
6       Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000
7       Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000
8       Megan     Wilson      F       California 1979-04-19       1984-04-19       Marketing        11000
9       Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000
servelet'sdoGet function receives the employee id string of json format, queries corresponding employee information in the database, generates an employee information list of json format and return it. The following code omits the process of accessing the database and generating the employee information list:

protected void doGet(HttpServletRequestreq, HttpServletResponseresp) throws ServletException, IOException {
         String inputString=(String) req.getParameter("input");
         //the input value of inputString is"[{EID:8},{EID:32},{EID:44}]";
         if (inputString==null) inputString="";
         String outputString ="";
        
         {...}//here the code of generating outputString through inputString’s queryof the database is omitted
         //the code of generated outputString is
//"[{EID:8,NAME:"Megan",SURNAME:"Wilson",GENDER:"F",STATE:\...";
         resp.getOutputStream().println(outputString);
         resp.setContentType("text/json"); 
}

Java will access this http servlet, get the employee information in which EID is 1 and 2, and then sort the records of information by EID in descending order. Detailed steps are as follows:
1. Import an open source project httpclient to access servlet and get the result.
2. Import an open source project json-lib to parse the returned strings.
3. Use comparison method to sort by EID in descending order.

The sample code is a follows:
public static voidmyHTTP() throws Exception {
                   // the following defines http’s url
                   URL url =
new URL("http://localhost:6080/myweb/servlet/testServlet?input=[{EID:1},{EID:2}]");
                   URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), null);
                   //then send a request from http and receive the returned result
                   CloseableHttpClient client = HttpClients.createDefault();
                   HttpGet get = new HttpGet(uri);
                   CloseableHttpResponse response = client.execute(get);
                   String myJson=EntityUtils.toString(response.getEntity());
                   //then parse the imported data into json object
         JSONArrayjsonArr = JSONArray.fromObject(myJson );
             //then sort the json data (in descending order)
         JSONObjectjObject = null;
                   for(inti = 0;i<jsonArr.size();i++){
                            long l = Long.parseLong(jsonArr.getJSONObject(i).get("EID").toString());
                            for(int j = i+1; j<jsonArr.size();j++){
                                               longnl = Long.parseLong(jsonArr.getJSONObject(j).get("EID").toString());
                                               if(l<nl){
                                                        jObject = jsonArr.getJSONObject(j);
                                                        jsonArr.set(j, jsonArr.getJSONObject(i));
                                                        jsonArr.set(i, jObject);
                                               }
                            }
                   }
                   System.out.println(jsonArr.toString());
         }

The open source project json-lib needs to be imported. The jars necessary for its function are:
         json-lib-2.4-jdk15.jar
         ezmorph-1.0.6.jar
         commons-lang.jar
         commons-beanutils.jar
         commons-logging.jar
         commons-collections.jar


Import the open source project httpclient. The jars necessary for its function are:
            commons-codec-1.6.jar
            commons-logging-1.1.3.jar
            fluent-hc-4.3.5.jar
            httpclient-4.3.5.jar
            httpclient-cache-4.3.5.jar
            httpcore-4.3.2.jar
            httpmime-4.3.5.jar

As can be seen from this example, Java needs to import two open source projects to complete its job. By the way, the jars may have overlapping parts. In addition, myHTTP function's operation for accessing and sorting http data is not universal enough. When it is required to sort in ascending order or according to more than one field, the program has to be modified. To make myHTTP function more universal and as flexible as SQL in accessing and processing data, dynamic analysis of expressions should be achieved, which will produce rather complicated code.

The method in accessing and processing http data in Java can be replaced by the cooperative work of Java and esProc. The advantage of this cooperation is that only one project will be imported in order to realize dynamic accessing and sorting with simple code. esProc can getand compute data from a remote http server conveniently. To achieve the dynamic processing, the expression for sorting could be sent to esProc as a parameter. Please see the figure below:

The value of the parameter sortBy is EID:-1. The program for esProc to access http data contains only six lines of code as follows:


A1Define the input parameters to be sent to servlet, that is, the employee id list of json format.

A2: Define anhttpfile object, the URL is http://localhost:6080/myweb/servlet/testServlet?input=[{EID:1},{EID:2}].

A3Import the result returned by httpfile object in A2.

A4Parse one by one each employee’s information of json format, and generates a sequence.

A5Sort the data. esProc will first compute the parameter sortBy in macro ${sortBy}, and then execute the resulting statement A4.sort(EID:-1) which means sorting by EID in descending order.

A6Return the result in A5 to the Java code that called this piece of esProc program.

If the fields and method for sorting are changed, the program needn’t to be modified. We just need to change the parameter sortBy. For example, sort by EID in ascending order and by name in descending order. In this case, what we need is to change the value of sortBy to EID:1,NAME:-1. The sorting statement we finally execute is A4.sort(EID:1,NAME:-1).

This piece of esProc program can be called conveniently in Java using jdbc provided by esProc. To save the above esProc program as file test.dfx, Jave need to call the following code:

          // create a connection between esProc and jdbc
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
//call esProc program (the stored procedure) in which test is the name of filedfx
com.esproc.jdbc.InternalCStatementst;
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");
// set parameters
st.setObject(1,"EID:1,NAME:-1");//esProc’s input parameters, that is, the dynamic expression for sorting
// execute esProc stored procedure
ResultSet set=st.executeQuery();
while(set.next()) System.out.println("EID="+set.getInt("EID"));

As the esProc code in this example is relatively simple and can be called directly in Java, it is unnecessary to write the esProc script file (like the above-mentioned test.dfx). Thus the code will be written in this way:
st=(com. esproc.jdbc.InternalCStatement)con.createStatement();
ResultSet set=st.executeQuery("=httpfile(\"http://localhost:6080/myweb/servlet/testServlet?input=[{EID:1},{EID:2}]\").read().import@j().sort(EID:1,NAME:-1)");

The above Java code directly called a line of esProc statement, that is, read data from http serverand sort them by specified fields and return the result ResultSet to Java. 

August 28, 2014

Methods of Grouping and Summarizing in R Language

The operation of grouping and summarizing includes grouping one or more certain fields of two-dimensional structured data and then summarizing fields of each group. The following will introduce methods of grouping and summarizing in R language through an example. In order to make the example more typical, we’ll set two fields to be grouped and two summarizing operations.

Case description:
Please group data frame orders according to CLIENT and SELLERID, and then summate field AMOUNT and seek its maximum value respectively in each group.


Note: orders contains records of sales orders. Its source can be a database or a file. Such as orders<-read.table("sales.txt",sep="\t", header=TRUE).The first rows of data are as follows: 

Method 1: aggregate function
Code:
    result1<-aggregate(orders$AMOUNT, orders[,c("SELLERID","CLIENT")],sum)
    result2<-aggregate(orders$AMOUNT, orders[,c("SELLERID","CLIENT")],max)
    result<-cbind(result1,result2$x)

Part of the computed result:

Code interpretation:
1.The name aggregate implies that it is a function specializing in grouping and summarizing. Both its input parameters and computed result are data frame and its usage is relatively simple.
2.aggregate function cannot perform multiple summarizing operations on grouped data, thus two lines of code are required to realize the operations of seeking sum and max respectively, then their results are combined using cbind. Obviously, the code is not satisfactory in performance and usability.
3. aggregate function has a strange requirement about the order of the fields to be grouped, that is, the fields must be in reversed order. In view of this, the code for grouping CLIENT first and then SELLERID should only be written as orders[,c("SELLERID","CLIENT")]. The code written according to the normal way of thinking will be incorrect instead.
4.  Not only the code is written in an unnatural way, but the computed result is weird too by putting filed SELLERID before CLIENT. In reality, the code should be improved in order to make the computed result conform to the business logic.

Summary:
aggregate function manages to perform the task after a fashion. But it is not good in performance and usability because the way of coding, computed result and business logic are inconsistent with each other.

Code:
    result1<-lapply(sp,FUN=function(x) sum(x$AMOUNT))
    result2<-lapply(sp,FUN=function(x) max(x$AMOUNT))
    result<-cbind(result1,result2)

Part of the computed result:

Code interpretation:
1.  The role of split function is to group the data frame according to specified fields. No further computation is involved. lapply function can perform the same operation on data of each group. By working with each other, split and lapply can fulfill the task.
2.  Because the grouped data can be reused, this operation performs better than that using aggregate function.
3.  As lapply function doesn’t support multiple statistical approaches, two lines of code are required too to realize the operations of seeking sum and max respectively, and then use cbind to combine the results. What’s more, this operation requires an extra split function, so instead of enhancing the usability, it reduces it.
4. The grouping order is still unnatural and the code has to be written reversely as orders[,c("SELLERID","CLIENT")].
5. The computed result needs a lot of modification which brings great inconvenience. It can be seen that the first column of the computed result is, in fact, the “SELLERID.CLIENT”. The column needs tobe split into two columns whose orders should be exchanged.

Summary:
This operation improves some performance but the usability is obviously poor with inconsistency in the aspects of way of coding, business logic and the computed result.

lapply belongs to the family of apply function. Similar functions include sapply and tapply, whose usages differ on parameters. For example:
    sp<-split(orders,orders[,c("SELLERID","CLIENT")],drop=TRUE)
    result1<-sapply(sp,simplify=FALSE,FUN=function(x) sum(x$AMOUNT))
    result2<-sapply(sp,simplify=FALSE,FUN=function(x) max(x$AMOUNT))
    result<-cbind(result1,result2)

tapply specializes in data frame, which, by rights, is the most suitable one for fulfilling this task. But it isn’t in fact. It applies only to the situation where a single field is required to be grouped. When it is used to group two fields together, the result will be two-dimensional matrix. This requires users to make further complicated processing. For example, the computed result of the line of code tapply(orders$AMOUNT, orders[,c("SELLERID","CLIENT")],function(x) sum(x))is as follows:

Third-party library functions
There are various disadvantages when using R's built-in functions to group and summarize. In response to the problem, we may consider using the third-party library functions, such as reshape, stack, etc. The stability and computational efficiency of these library functions is generally not as good as those of the built-in functions, and the information for their use is not many. Therefore it is difficult for them to fulfil the task. Here we won’t go into any example about their use.

Third-party languages
Python, esProc and Perl can also be employed to fulfil this task. All of them can perform grouping and summarizing as well as structured data computing as R language can. We’ll briefly introduce solutions of esProc and python.

esProc
esProc can fulfil this task by simply using groups function. Its syntax is concise and easy to understand, as well as in line with the natural way of thinking. The code is as follows:      result=orders.groups(CLIENT,SELLERID;sum(Amount),max(Amount))

The computed result, syntax and business logic are highly consistent with each other in esProc. Some of the computed results are as follows:

Python(pandas)
If python's built-in functions are used to deal with this task, the code will be rather complicated. Here pandas, the third-party function library, comes to help. pandas will first perform grouping operation using groupby function, then summarize using agg function. Its code, which is simpler than R language but not as good as esProc,is written like this:     result=orders.groupby(['CLIENT','SELLERID']).agg({'AMOUNT':[sum,max]}).

Pandas' computed result and syntax are highly consistent with the business logic. Part of the computed result is as follows:

August 27, 2014

Code Examples of Common In-Memory Grouping with esProc

It is convenient to realize some common in-memory grouping with esProc, such as, equal grouping, alignment grouping and enumeration grouping. They are to be illustrated with the following examples.

Equal grouping

Grouping basis of equal grouping is certain fields (or computed columns derived from fields) within a data set. Each group is a subset of original data set.
Case description: Group sales orders by the year.

Data description: Data of sales orders are shown below: 
The above data set (table sequence) can be accessed from a database or a file. For example:
A1=file("E:/sales.txt").import@t()

esProc code:
         A2=A1.group(year(OrderDate))

Computed results:
Code interpretation:

1. In this example, grouping basis comes from OrderDate. The date of sales order will be converted into the year through year(OrderDate), and data of the same year will be grouped together.

2. There may be multiple fields for grouping. For example, regroup data of different years and sellers according to year and sellers. The code is:

A1.group(year(OrderDate),SellerId)

3. Often, the grouped data are used to perform aggregation operations, such as, compute each year's sales amount according to A2. The code is:
A2.new(year(OrderDate):y,~.sum(Amount):a)

Computed results are:

Or, combine grouping and summarizing into one step with groups function:
A1.groups(year(OrderDate):y; sum(Amount):a)

Of course, sometimes we have to execute grouping and summarizing separately in order to reuse the code and improve computational efficiency. For example, filter one of the groups of A2, and perform association computation for another group. Another situation is that, if, after summarizing, data of a certain group are unusual and worth further study, then this group can be used directly to go on with the computations. It's no need to filter the group again.  

4. By default, esProc's group function will group data by using hash algorithm. But, comparing adjacent rows can have higher performance for ordered data. This can be executed by using option @o in group function. For example:
A1.group@o(year(OrderDate),SellerId)

Alignment grouping


Criterions used for equal grouping come from within a dataset. But sometimes, they originate from without, like fields of other data sets, arrays created by users, parameter list and so on. Thus the alignment grouping comes into being.
   
Different from equal grouping, method of alignment grouping may produce empty subsets, which have no members to correspond to data of a group. It may also produce incomplete groups, meaning that some data won’t be included in any group. These things won't happen for equal grouping. 

Case description: Table of top 10 sellers has been worked out in the light of performance, please group sales orders according to the table order.
The data set before grouping:

The sales orders are the same as those in above example. Data are stored in A1.

Table of the top 10 sellers is stored in B1 as follows: 

Sellers table may come from a temporary table, or be generated by a piece of code. The generating process is not the focus of this example.
esProc code:
    A1.align@a(B1:empID,SellerId)

Computed results:

Code interpretation:
1. In this example, the grouping basis (sellers table) comes from without the data set to be grouped. After grouping is completed, a group contains only the data of one seller, and groups are sorted according to sellers table. 

2. Because sellers in sales orders outnumber those in sellers table, some of the orders won't appear in any groups. If we want to create one more group to store these orders, we can use function option @n as follows:
A1.align@a@n(B1:empID,SellerId)

The one more group will be put last as follows:  

3. Sometimes, the grouping basis is not within the data set to be grouped, such as, "newly-employed sellers table". In this case, it's normal to produce empty groups. Modify the first group of data in the table into empID=100, for example, the computed results will be: 

Enumeration grouping 


The grouping basis for enumeration grouping could be more flexible. It could be any Boolean expressions. Those records consistent with the expression will get into the same group.

Similar to alignment grouping, enumeration grouping is also of incomplete grouping, probably producing empty subsets or results that some records are not included in any group. In addition, with this grouping method, it is likely that some records may appear in more than one group. 

Case description: Dividing sales orders into four groups, they are: A. order amount is less than 1000; B. order amount is less than 2000; C. order amount is less than 3000; D. order amount is less than 10,000. Note that the data cannot be grouped repeatedly, that is, if an order has been in group A, it must not be put into group B, C, or D.

The data set before grouping:
The sales orders are the same as those in above example. Data are stored in A1.

esProc code:
         A2=["?<=1000","?<=2000","?<=3000","?<=10000"]
         A3=A1.enum(A2,Amount)

Computed results:
Case interpretation:

1.In this example, grouping basis for grouping is multiple flexible expressions. Each record will be compared with the expressions. Those consistent with the same expression will be put into the same group. Groups are sorted according to order of grouping basis as well.

2. By default, enumeration grouping will not produce identical results. The method, which is showed in the above example, is that after group A’s data are selected, the rest of data will be compared with expression B to see their consistency. While the use of function option @r, which represents that all data are compared with expression B, may produce identical results. For example: A3=A1.enum@r(A2,Amount), computed results are: 

3. Similar to alignment grouping, if the expression for enumeration grouping is inconsistent with any data to be grouped, empty group will appear. Besides, if some data are inconsistent with any expression, function option @n can be used to put them into a surplus group. 

August 26, 2014

Method of Filtering Data of Big Text Files with R Language

It's common to use R language to filter data of files. Sometimes we need to filter data of big files which cannot be imported to the memory all at once. Solutions to this problem include batch importing and filtering as well as result merging. Here is an example to illustrate the method of filtering data of big files with R language.

There is a file sales.txt of 1G size containing a great number of sales order records. Please filter out records in which the field value of AMOUNT is between 2,000 and 3,000. The column separator used in the file is "\t". The first rows of data are as follows: 

R's solution:
con <- file("E:\\sales.txt", "r")
readLines(con,n=1)
result=read.table(con,nrows=100000,sep="\t")
result<-result[result$V4>=2000 & result$V4<=3000,]
while(length(databatch<-read.table(con,header=FALSE,nrows=100000,sep="\t"))!=0) {
  result<-rbind(result,databatch)
}
close(con)

These are some of the computed results:

Code interpretation:
The 1st rowOpen the file handle.

The 2nd row: Skip the first row, i.e. the column name.

The 3rd ~4th row: Import the first batch of 100,000 rows of data, then perform 

filtering and save the result in result.

The 5th ~ 8th row: Import by loop. 100,000 rows data are imported per batch. Append the filtering result to result variable, and then begin the processing of next batch.

The 9th row: Close the file handle.

Points for attention:

For the small files, only a line of code is enough for executing data importing and the first row can be set as the column name of data frame. But this won't work with the big files whose data require to be imported in batches. And for the second batch of imported data,the first row cannot be set as the column name. The column names are by default V1, V2, V3 and so forth.

In order to import data from big files in batches, R language has to use while statement in performing the operation. Besides, it is not convenient in using column names. These will produce fairly complicated code.

Alternative methods:
Python, esProc and Perl can also perform the same operation to do this task. As R language, all of them can perform file data filtering and structured data computing. The following will briefly introduce the methods used by esProc and Python.

esProc can automatically process data in batches and it’s not necessary for programmers to control the operation manually with loop statement, so its code is quite simple:


Cursor is an esProc data type used for structured data computing, which has almost the same usage as data frame but is better at processing big files and performing complex computation. What's more, cursor can set the first row of the file as the column name using @t option.

Python's code structure is similar to that of R language. Both require controlling loop manually. But Python hasn't structured data types like data frame or cursor, so its code is classified into a lower level:
result = []
myfile = open("E: \\sales.txt",'r')
BUFSIZE = 10240000
myfile.readline()
lines = myfile.readlines(BUFSIZE)
value=0
while lines:
    for line in lines:
        record=line.split('\t')
        AMOUNT=float(record[3])
        if(AMOUNT>=2000 and AMOUNT<=3000):
result.append(record)
    lines = myfile.readlines(BUFSIZE)
myfile.close()


Though Python can perform the above operation using the third-party packages, like pandas, which has structured data object similar to the data frame, it lacks sufficient ability in processing big files. Thus it is hard for it to further simplify the code.