December 15, 2014

esProc Helps with Computation in MongoDB – Sorting in Local Language

MongoDB uses unicode, instead of the coding for a certain local language, to sort data in this language (i.e. Chinese). Together with esProc, MongoDB can realize sorting in local language conveniently (i.e. sort Chinese according to Chinese phonetic alphabet). The following will teach you the method in detail by taking Chinese as an example.

person - a collection in MongoDB - stores names and genders as follows:
> db.person.find()
{ "_id" : ObjectId("544e4e070f03ad39eb2bf498"), "name" : "宋江", "gender" : ""}
{ "_id" : ObjectId("544e4e070f03ad39eb2bf499"), "name" : "李逵", "gender" : ""}
{ "_id" : ObjectId("544e4e070f03ad39eb2bf49a"), "name" : "吴用", "gender" : ""}
{ "_id" : ObjectId("544e4e070f03ad39eb2bf49b"), "name" : "晁盖", "gender" : ""}
{ "_id" : ObjectId("544e4e070f03ad39eb2bf49c"), "name" : "公孙胜", "gender" : "" }
{ "_id" : ObjectId("544e4e070f03ad39eb2bf49d"), "name" : "鲁智深", "gender" : "" }
{ "_id" : ObjectId("544e4e070f03ad39eb2bf49e"), "name" : "武松", "gender" : ""}
{ "_id" : ObjectId("544e4e070f03ad39eb2bf49f"), "name" : "阮小二", "gender" : "" }
{ "_id" : ObjectId("544e4e070f03ad39eb2bf4a0"), "name" : "杨志", "gender" : ""}
{ "_id" : ObjectId("544e4e070f03ad39eb2bf4a1"), "name" : "孙二娘", "gender" : "" }
{ "_id" : ObjectId("544e4e070f03ad39eb2bf4a2"), "name" : "扈三娘", "gender" : "" }
{ "_id" : ObjectId("544e4e080f03ad39eb2bf4a3"), "name" : "燕青", "gender" : ""}
Sort the data using MongoDB’s sort function rather than the Chinese phonetic alphabet:
> db.person.find({},{"name":1,"gender":1,"_id":0}).sort({"name":1})
{ "name" : "公孙胜", "gender" : "" }
{ "name" : "吴用", "gender" : "" }
{ "name" : "孙二娘", "gender" : "" }
{ "name" : "宋江", "gender" : "" }
{ "name" : "扈三娘", "gender" : "" }
{ "name" : "晁盖", "gender" : "" }
{ "name" : "李逵", "gender" : "" }
{ "name" : "杨志", "gender" : "" }
{ "name" : "武松", "gender" : "" }
{ "name" : "燕青", "gender" : "" }
{ "name" : "阮小二", "gender" : "" }
{ "name" : "鲁智深", "gender" : "" }

The esProc script helping with MongoDB computation is as follows:

A1Connect to the MongoDB database. The IP and port number is localhost:27017, the database name is test and both the user name and the password are test. If any other parameters are needed, write them in line with the format mongo://ip:port/db?arg=value&…

A2Fetch data from the MongoDB database using find function to create a cursor. The collection is person. The filtering criterion is null and the specified keys are name and gender. It can be seen that this find function is similar to the find function of MongoDB. By fetching and processing data in batches, the esProc cursor can avoid the memory overflow caused by big data importing.

A3Since the data here are small, fetch function will fetch them all at once.

A4Sort the data by name in ascending order, using sort function. Chinese is used in the data sorting. For the other localized languages esProc supports, please see below.

The result of operation is:

One thing to note is that esProc doesn't provide the java driver of MongoDB. To access MongoDB with esProc, the latter (a driver of 2.12.2 version or above is required, i.e. mongo-java-driver-2.12.2.jar) should be put into the [esProc installation directory]\common\jdbc beforehand.

The script for computation in MongoDB with the assistance of esProc is easy to integrate with Java program. By adding another line of code – A5, which is result A4, the result in the form of resultset can be output to Java program. For detailed code, please refer to esProc Tutorial. In the same way, to access MongoDB by calling esProc code with Java program also requires putting the java driver of MongoDB into the classpath of Java program.

The java driver of MongoDB can be downloaded from the following URL:

October 21, 2014

esProc Helps Process Structured Text in Java - Sorting

It is simple for Java to sort data in text files of small size. But when big files are involved, you need to import data segmentally, write out the sorting result of each segment in temporary files and at last, merge these temporary files. The programming will be rather complicated. Even if the file is small enough to be loaded to the memory, you will have to parse the data type of the text file. The job is not difficult but the code is long.

However, these problems can be avoided by using esProc to help with programming in Java. Let’s look at in detail how this will happen. Now you are required to sort the employee data in the text file employee.txt by STATE in ascending order and by BIRTHDAY in descending order. It is assumed that the data size of the file is huge and exceeds the memory capacity.

employee.txt is of the following format:
EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY
1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000
2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000
3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000
4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000
5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000
6       Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000
7       Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000
8       Megan     Wilson      F       California 1979-04-19       1984-04-19       Marketing        11000
9       Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000
10     Ryan         Johnson   M     Pennsylvania    1976-03-12       2006-03-12       R&D          13000
11     Jacob        Moore      M     Texas        1974-12-16       2004-12-16       Sales         12000
12     Jessica     Davis        F       New York 1980-09-11       2008-09-11       Sales         7000
13     Daniel       Davis        M     Florida      1982-05-14       2010-05-14       Finance    10000

Implementation approach: call the esProc script with Java, import and compute the data, and then return the result in the form of ResultSet to Java. To perform sorting by STATE in ascending order and by BIRTHDAY in descending order, esProc will use an input parameter "sortBy" as the sorting expression, as shown in the following figure:

The value of "sortBy" is STATE,-BIRTHDAY. The field with a minus before it represents the opposite number, which is valid for string data, numerical data and date data.

The code in esProc is as follows:

A1Define a file cursor object, with the first row being the title and tab being the field separator by default.

A2Perform sorting according to the expression, using macro to realize parsing the expression dynamically. The "sortBy" in this process is an input parameter. In executing, esProc will first compute the expression enclosed by ${…}, then replace ${…} with the computed result acting as the macro string value and interpret and execute the code. The final code to be executed in this example is =A1.sortx(STATE,-BIRTHDAY;1000000).

A3Return the result cursor to external program. While Java receives the returned result and traverses the data with ResultSet, esProc will automatically fetch the data corresponding to the cursor. If the sorted data are to be written into other files, the code in A3 should be modified into =file("D:/employee_result.txt").export@t(A2).

If the sorting fields and order are changed, you just modify sortBy– the parameter. For example, if the data are required to be sorted by NAME in ascending order and by STATE and BIRTHDAY in descending order, the value of sortBy will be written as NAME,-STATE,-BIRTHDAY.

sortx function performs sorting by importing data segmentally according to buffer rows, write the results of sorting each segment into temporary files, redistribute the memory usage and then merge these temporary files. Here the parameter 1000000 refers to buffer rows. The principle of assigning value to it is to make the best of the memory to reduce the number of temporary files as far as possible. The number of temporary files is related to the size of both the physical memory and the records, and should be evaluated during programming. Generally, the recommended number is between magnitudes of several hundred thousand to a magnitude of one million.

The code of calling this piece of code (which is saved as test.dfx) in Java with esProc JDBC is as follows:
          //create a connection usingesProc JDBC
con= DriverManager.getConnection("jdbc:esproc:local://");
//call the program in esProc (the stored procedure); test is the name of file dfx
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");
//set the parameters
//execute the esProc stored procedure
//get the result set, which is the eligible set of employees
ResultSet set = st.getResultSet();

If the script is simple, the code can be written directly into the program in Java that calls the esProc JDBC. It won’t be necessary to write a special esProc script file (test.dfx):
st=(com. esproc.jdbc.InternalCStatement)con.createStatement();
ResultSet set=st.executeQuery("=file(\"D:/employee.txt\").cursor@t().sortx(NAME,-STATE,BIRTHDAY;1000000)");

This piece of code in Java calls a line of code in esProc script directly, that is, get the data from the text file, compute themand return the result set to set– the object of ResultSet.

If the data in employee.txt can be loaded to the memory altogether, sort function can thus be used to perform all-in-memory sorting, in which esProc won't generate the temporary files. The computational speed will be much faster. The code is as follows:

Parameter sortBy can be written as STATE,-BIRTHDAY, or STATE:1,BIRTHDAY:-1. And it is no need to modify the calling program of Java. 

October 20, 2014

esProc Simplifies SQL-style Computation – Relational Computation on Groups

During developing database applications, we are often faced with complicated SQL-style computations like relational computation on multilayered groups. As SQL requires summarizing during data grouping and it doesn't support object-style relational access, it is difficult to deal with these computations. We have to resort to an advanced technique of window functions nested with multilayered subqueries. esProc, however, can handle them more easily by supporting real grouping and visual object-style relational access.

In practice, there are a lot of cases in which relational computation on groups is needed, such as the one presented on the website Based on this practical example, we have designed a more common one for illustrating in detail how esProc works to realize the relational computation on groups.

Computing target: Query branches of a DVD store where there are less than four categories of DVD copies.

Data StructureBranch Table stores information of the branch stores. DVD Table stores titles and categories of DVDs, in which DVDs, like Transformers IV, are virtual data items instead of physical disks. DVDCopy Table stores physical DVD copies branch stores have. Note: DVDCopy Table is related to Branch Table through BranchID field, and to DVD Table through DVDID field. The following is part of the data:

Branch Table
DVD Table 

DVDCopy Table 
1.Computed results should be certain records in Branch Table.

2.That the Status field of a record in DVDCopy shows Miss means the DVD is missing, and that a record’s LastDateReturned field is empty means the DVD has been rented out but not returned. Obviously the DVDs that are missing and haven't been returned are outside of our computing objects and should be filtered away.

3.We should consider the situation that there may be certain branch stores whose information isn't included in DVDCopy Table, though it is rarely seen.

1.Select valid, existing DVD copies the branch stores have from DVDCopy Table.

2.Group DVDCopy Table by BID. Each group will contain all DVD copies a branch store has.

3.Select the DVDs corresponding to the DVD copies each branch store has, and compute the number of categories to which these DVDs belong.

4.Select branch stores where the number of categories of existing DVDs is less than four. These branch stores are eligible.

5.Select branch stores which DVDCopy Table hasn't. They are also eligible.

6. Combine the two kinds of eligible branch stores.

Code written in esProc:

A1-A3Query data from three tables in the database. The three tables are made variables which are named respectively as Branch, DVD and DVDCopy. Computed results are as follows:

A4 Switch the DVDID field and BID field in DVDCopy Table to corresponding records in DVD Table and Branch Table respectively. Note: This step is the basis of object-style relational access, which requires the use of switch function. After computing, DVDCopy becomes as follows:

Fields in blue have corresponding records. Click one and you can see the details, as shown in the following figure:

Now we can perform object-style relational access only with the operator “.”. For instance, DVDCopy.(DVDID). (Category) represents the category of each DVD copy, and DVDCopy.(BID) gets the detailed information (complete record) about the branch store corresponding to each DVD copy.!="Miss" && LastDateReturend!=null). This line of code aims to filter away the DVD copies that are missing and that haven’t been returned. After the code is executed, values of A5 are shown as follows: is to group data in A5 by BID, with each row representing all DVD copies a branch store has. The result is as follows:

Click the data in blue and you’ll see members of each group:

It can be seen that group function only groups data, but doesn't summarize the data at the same time. In this point, it is different from the function for grouping in SQL. Sometimes, we need to further process the grouped data, rather than simply summarizing them. To do this, esProc's group function is more convenient to use, as shown in the code in A7: ~.BID:BonList, ~.(DVDID).id(Category).count():CatCount )
The above line of code computes the number of categories of DVD copies to which each branch store corresponds. new function can generate a new object A7 based on the data in A6. A7 has two columns: BonList and CatCount. BonList originates directly from column BID of the grouped data in A6, and CatCount originates from column DVDID of the grouped data. There are three steps to compute CatCount: ~.(DVDID) finds the DVD records corresponding to all DVD copies each branch store has; id(Category) removes repeated records of Category from these DVD records; and count() computes the number of categories. The result is as follows:
That is, branch store B002 has three categories of DVD copies, B003 also has three categories and B001 has four categories.<4).This line of code makes query to select branch stores whose CatCount is less than 4. The result is as follows:
The above branch stores that are in short supply are computed according to DVDCopy Table. But maybe some branch stores with serious supply shortage are not in the DVDCopy, such as the cases that all the DVD copies in the branch store has been rented out, or that the branch store hasn't any DVD copies. So these branch stores should also be counted. The code for this step is as follows:
A9=A8.(BonList) | (Branch \ A7.(BonList))

In the above code, operator "|" represents the union operation of two data sets (which can be replaced by union function); operator "\" represents the complement operation (which can be replaced by diff function). A8.(BonList), Branch and A7.(BonList) represent respectively branch stores with supply shortage in DVDCopy Table, all branch stores and branch stores which are included in DVDCopy Table. Their respective values are:

A9 computes the final result of this example. Its values are:

Or we can realize the computation indirectly. For instance, first compute "branch stores that are not in short supply", and then compute the complement of the result and Branch Table. The final result should be the same as that of A9.

Please note variables like A8 or Branch cannot be used to represent a data set in SQL because it doesn't support explicit set. Thus the simple code in the above have to be replaced by lines of complicated SQL statements.

Besides, esProc can be called by a Java program. The method is similar to that with which a Java program calls an ordinary database. The JDBC provided by esProc can be used to return a computed result of the form of ResultSet to Java main program. For more details, please refer to the related documents 

October 19, 2014

esProc Helps Process Structured Texts in Java – Alignment Join

The join statements of the database can be used conveniently to perform the operation of alignment join. But sometimes the data is stored in the text files, and to compute it in Java alone we need to write a large number of loop statements. This makes the code cumbersome. Using esProc to help with programming in Java can solve the problem easily and quickly. Let’s look at how this works through an example.

The text file emp.txt contains employee information, except that in which EId is 1. Another text file sOrder.txt contains information of sales orders in which field SellerId corresponds to field EId inemp and from which the information whose SellerId is 2 is excluded. Part of the original data is listed below:



It is required to join the three fields: Name, Dept and Gender, in emp to sOrder in alignment and output the computed result to a new file. The expected result is as follows:

Code written in esProc:

In cells A1 and A2 respectively, data is imported from the two text files and stored in two variables: emp and sOrder. Here import function uses tab as the column separator by default. Option @t represents the first row will be imported as the field names. Because only some of the fields in emp.txt are needed, the code in A1 uses the names of these desired fields as parameters. After execution, values of emp and sOrder are as follows:

In the code in A3: =join@1(sOrder:s,SellerId;emp:e,EId), join function performs the operation of alignment join and changes the names of the two tables to s and e respectively. Option @1 represents the left join which is in line with the requirement of the example: join emp to sOrder in alignment. The computed result is as follows:

Click the numbers in blue and we can see the detailed information, as shown below:

esProc can also be used to realize the right join which only requires exchanging positions of data in alignment. For example, to align sOrder according to emp, we just need to exchange their positions in the code, that is, =join@1(emp:e,EId;sOrder:s,SellerId). The computed result is as follows:

It is also easy to realize the full join using option @f. The code is join@f(sOrder:s,SellerId;emp:e,EId). The computed result is as follows:

There are altogether four operations of alignment join: left join, right join, full join and inner join. By default, join function is used to execute the inner join, the code is =join(sOrder:s,SellerId;emp:e,EId). The computed result is as follows:

Let‘s get back to the example. The code in A4:, s.Client, s.SellerId, s.Amount, s.OrderDate,e.Name, e.Dept, e.Gender), is for getting the desired fields from table eand creating a new structured two-dimensional table. The computed result is as follows:
Now the alignment is done and data needs to be exported to a new file. The code for this is =file("E: \\result.txt").export@t(A4). In export function, tab is by default the column separator and option @t represents the field names are exported to the first row. Open result.txt and we can see information as follows:

         //create a connection using esProcjdbc
         con= DriverManager.getConnection("jdbc:esproc:local://");
         //call esProc script; the name of the script file is test
         st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");
         // execute esProc stored procedure

By executing the above Java code, emp will be joined to sOrder in alignment and the result will be output to file result.txt.

If the requirement is changed to this: query data in sOrder according to dynamic periods of time, execute the same operation of alignment join and return the result directly to Java. To complete the task esProc needs to define two parameters: begin and end, to represent starting time and ending time respectively. The esProc code is as follows:

A2Filter sOrder again using select function according to the starting and ending time passed from Java, that is, @begin and @end.
A5Output the computed result in A4 to JDBC interface.
And Java code should be modified too to pass parameters to esProc code and get thefinal result. The modified code is as follows:
         con= DriverManager.getConnection("jdbc:esproc:local://");
         st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?,?)");
         ResultSet set = st.getResultSet();

October 15, 2014

esProc Simplifies SQL-style Computation –Records Corresponding to Max Value

In developing database applications, usually it is the records corresponding to the max/min value that we need to retrieve, instead of the value itself. For example, the occasion in which each employee gets his/her biggest pay raise; the three lowest scores ever got in golf; the five days in each month when each product gets its highest sales amount; and so on. As the max function of SQL can only retrieve the max value, instead of the records to which the max value corresponds, it is quite complicated to handle the computation in SQL with the aid of some advanced techniques, such as window functions or the nested sub-queries or keep/top/row number. If multi-layered grouping or relations are involved, the computation will become even more complicated.

With the top function in esProc, the records corresponding to the max value can be retrieved and the computation becomes much easier. The following is such an example.

The database table golf contains the scores of members in a golf club. Please select the best three scores each member has ever got. Part of the data is as follows:

The code written in esProc:

A1Retrieve data from the database. If the data come from a structured text file, the following equivalent code can be used: =file("\\golf").import@t(). Click the cell and we can check the retrieving result:, i.e., group the result of A1. The result is as follows:

As shown in the above figure, the data have been separated into multiple groups by User_ID and each row is a group. Click the blue hyperlink and members of the group will be shown as follows:

A3=A2.(;3)).The code is to compute the records of each group of data whose field Score is in the top three. Here "~" represents each group of data. represents that top function will work on every group of data in turn. The top function can retrieve the N biggest/smallest records from a data set. For example, top(Score;3) represents sorting by Score in ascending order and fetching the first three records (i.e. min values); top(-Score;3) represents sorting in descending order and fetching the first three records (i.e. max values). The result of this step is as follows:

A4=A3.union(), which means union data of every group. The result is as follows:
In the above, the computation is performed step by step. But the steps can be integrated into one for the convenience of maintenance and debugging:db.query("select * from golf").group(User_ID). (;3)).union().

The method with which a Java program calls esProc is similar to that with which it calls an ordinary database. The JDBC provided by esProc can be used to return a computed result of the form of ResultSet to Java main program. For more details, please refer to the related documents