March 13, 2015

esProc Improves Text Processing – String Matching with Big Files

There are many occasions during text processing which require performing string matching with big files. Coding with command line grep\cat is simple yet inefficient. Though higher efficiency can be achieved with high-level languages, coding will be rather difficult.

Yet this operation, as well as multithreaded parallel computing, can be handled more easily in esProc, with more concise code and much better performance. The following examples will show esProc method in detail.


file1.txt has a great many strings. Find out the rows ending with “.txt” and export them to result.txt. Some of the original data are as follows:

esProc code for doing this task:

A1: Open the file in the form of cursors. Instead of importing all the data into the memory at a time, cursor function opens the file in the form cursors (stream) without memory footprint. The function uses default parameters to import all the fields with tab being the column separator and to automatically name them _1, _2, _3…_n respectively. There is only one field, _1, in this example.

A2=A1.select(like@c(_1,"*.txt"))

This line of code selects rows ending with “.txt” from cursor A1. select function executes the query and like function performs string matching. _1 represents the first field. The use of @c option in like function means the matching is case insensitive.

One point worth noting is that the result of A2 is still a cursor without memory footprint. Only with the use of functions like export/fetch/groups will esProc allocate suitable memory buffers and convert the cursor computing to memory computing.

A3=file("e:\\result.txt").export(A2). This line of code exports the final result to a file. Some of the data are as follows:

The matching rule in the example above is relatively simple. If the rule is complex, a regular expression will be needed. For example, find out rows starting with “c:\windows” and not ending with “.txt”.

regex function is used to perform string matching with the regular expression. Just modify A2’s code to A1.regex@c("^c:\\\\windows.*(?<!\\\\(.txt)$)") , in which @c option means case insensitive.

Though the regular expression can be used to realize the string matching with complex rule, its performance is not satisfactory. For example, to find out rows ending with “.txt” from a file of 2.13G size in the same test environment, it takes 206 seconds with a regular expression, while it takes only 119 seconds with an ordinary expression (the select statement).

In fact, many tasks of string matching with complex rule can also be realized with the ordinary expression. Moreover, the syntax is more visual and cost of learning is lower. For example, emp.txt holds a large number of user records, each of which has multiple fields, separated by tab and with the first row being the column names. Suppose you are to find out rows with the rule that “Eid field is lesser than 100, the first letter of Name filed is a and Birthday field is greater than 1984-01-01”. You can do it in esProc as follows:

The @t option used with cursor function means that the first row will be imported as column names for the use of accessing data at a later time.

The three query conditions can be represented by EId>100, like@c(Name,"a*") and Birthday>=date("1984-01-01") respectively. The logic relation between the conditions is “AND”, which can be represented by &&.

The above algorithm is sequential computation. The performance can be further improved if parallel computing is used. The method is this: Import the file using multithreads, each of which will access some of the data of the file with a cursor, and perform set operations at the same time; finally, merge the result of each cursor together.

Test the processing of a file of 2.13G size under the same hardware environment. It takes an average of 119 seconds with the sequential computation, whereas it takes only an average of 56 seconds with the parallel computing, which speeds the performance almost doubly. The algorithm used in the example is not so complex, so the bottleneck is the hard driver’s ability to import data. With the increase of the complexity of the computation, the performance will be improved more greatly.

esProc code for parallel computing:

A1=4. A1 is the number of segments, which means the file will be divided into 4 segments. The number is equal to the number of parallel tasks in operation, which generally should not exceed the number of CPU cores. Otherwise the tasks will be queued for processing and the performance won’t be really increased. The maximum number of the parallel tasks can be configured in the environment option.

A2=A1.(file("e:\\file1.txt").cursor@z(;, ~:A1))
This line of code will generate four cursors according to the specified number of segments. A1.(express) means computing the expression with members of A1 respectively. “~” can be used in the parentheses to represent the current member. Generally A1 is a set, like ["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.

In the expression, file("e:\\file1.txt").cursor@z(;, ~:A1), surrounded in the parentheses, cursor function uses @z option to segment the file and fetch each part with a cursor. ~:A1 means that the file is roughly divided into four segments (A1=4) and the ~th segment is fetched. “~” represents the current member in A1 and each cursor corresponds to the first, the second, the third and the fourth segment respectively.

Besides, though exact division will result in incomplete lines, esProc can import complete lines automatically by skipping the beginning half line of a segment and completing the ending half line of the segment. This is why the file should be divided “roughly”.

A3=A2.(~.select(like@c(_1,"*.txt"))). This line of code queries data of each cursor (i.e. ~) in A2 and selects the eligible rows. The computed results are still four cursors.
A4=A3.conj@xm(). This line of code merges the four cursors in A3 in parallel.
A5=file("e:\\result.txt”).export(A4). This line of code exports the final result to a file.

An esProc script not only can work independently in an Integration Development Environment (IDE), it also can be called by a Java program through JDBC interface. The calling method is the same as the method of calling an ordinary database. A one-step esProc script can be embedded in the Java program directly without script file. Actually the above steps can be combined into one single step:
file("e:\\result.txt").export(4.(file("e:\\file1.txt").cursor@z(;, ~:4)).(~.select(like@c(_1, "*.txt"))).conj@xm())

It is also allowed to run this kind of one-step script in operating system’s command line. Please refer to related documents for further information. 

March 11, 2015

esProc Improves Text Processing – Set Operations on Big Files

It is common to perform set operations on big files in text processing. For example, find different rows between two files. The code for handling the operations with command line grep or cat command is simple but inefficient. While the operational efficiency is high when high-level languages are used to handle the operations, the code is difficult to write.

esProc supports performing set operations on big files and multithreaded parallel computing. Its code is concise and its performance is remarkable. The following examples will show the esProc method in detail.


file.1txt and file2.txt hold a large number of strings respectively. Find their common rows (that is, compute the intersection). Some of the data are shown as follows:

Both files are big
When both files are too big to be loaded into the memory, esProc’s way of cursor merge can be used to realize the intersection operation. The code is as follows:

A1, B1Open the files as cursors. cursor function doesn’t import all the data into the memory, it opens a file in the form of a cursor (stream) without the memory footprint. The function uses default parameters to import all the fields and to name the new columns automatically as _1, _2, _3…_n, with tab being the column separator. There is only one field - _1 - in this example.

A2=[A1.sortx(_1),B1.sortx(_1)].merge@xi(_1)

The above code uses the merge operation to find the common rows of the two files, that is, to compute the intersection. A merge operation requires ordered data, so sortx function is used to sort the cursors first. The corresponding code is A1.sortx(_1) and A2.sortx(_1), in which _1 is the default field name in the files. merge function is used to merge multiple groups of data (two groups as with this example). Without parameter options, it merges the sets in the memory; the use of parameter option @x means to merge the cursors and @i means that the merge result is the intersection.

Notice that the result of A2 is still a cursor without the memory footprint. Only when the computation involves functions like export, fetch, groups and etc. will esProc engine allocate suitable buffers and automatically convert the cursor computing into in-memory computing. 

A3=file("E:\\result.txt").export(A2)

This line of code writes cursor A2 to a file. export function can be used to write both the in-memory data and cursors to files. Here default parameters are used, that is, no column name, using tab as the column separator, overwriting instead of appending files and writing as text files instead of binary files. Open result.txt and you can see some of the data as shown below:

Actually the above three-step esProc code can be combined into a single line of code: 
A1=file("e:\\result.txt").export([file("E:\\file1.txt").cursor().sortx(_1),file("E:\\file2.txt").cursor().sortx(_1)].merge@xi(_1))   

In addition to the intersection operation performed in this example, there are also operations of union, concatenation and difference. Just to modify the option of merge function in A2 to realize them. For example, union file1.txt and file2.txt and remove the duplicate members to get their union. Function option @u can be used to compute the union with the following code: [A1.sortx(_1),B1.sortx(_1)].merge@xu(_1). Their union is as follows:

By not removing the duplicate members, concatenation will be computed. The code is [A1.sortx(_1),B1.sortx(_1)].merge@x(_1), which shows the real meaning of merge operation. The result is as follows:

Use function option @d to compute the difference, the data which are included in file1.txt but not included in file2.txt. The code is [A1.sortx(_1),B1.sortx(_1)].merge@xd(_1). Result is as follows:

Note: The commutative law doesn’t apply to difference operation. So the code for getting the data which are included in file2.txt but not included in file1.txt is [B1.sortx(_1),A1.sortx(_1)].merge@xd(_1). The result is as follows:

One file is big, the other is small
When there is only one big file that runs out of the memory, the smaller one can be loaded into the memory and then use a hash table to perform the set operations. Thus the efficiency can be increased significantly. esProc code is as follows:

A1Open the files as cursors.

A2=file("e:\\file2.txt").import(). This line of code imports the smaller file2.txt into the memory entirely. Similar to cursor function, import function’s default field names are _1 and _2 as well. Click B1 in esProc’s Integrated Development Environment (IDE) to see the computed result of the current cell:

B2>B1.primary(_1).index()

The above code defines a primary key for B1 and creates a hash index. The query speed can be increased greatly in this way. B1 and B2 can be combined into a single line of code : B1=file("E:\\ file2is.txt").import().primary(_1).index()

A3=A1.select(B1.find(~._1))

This line of code selects the common data of cursor A1 and cursor B1, which is computing the intersection. select function is used to execute the query statement and eligible data will be selected. In the function, ~ represents the current record.

The query criterion is B1.find(~._1), meaning to find in B1 the _1 field of the current record of A1.

Notice that the result is still a cursor when the code is executed.

A4=file("E:\\result.txt").export(A3). This line of code writes the final result into a file.

When computing difference, modify the code in A3 to A1.select(!B1.find(~._1)), which selects from A1 the rows which are not included in B1.

To compute the union of file1 and file2, first compute the difference of file1 and file2 and then union it with file2.

conj function in the expression [A3,B1.cursor()].conj@x() in A4 can concatenate multiple sets (which is concatenation operation) – [[1,2,3],[2,3],[]]=[1,2,3,2,3], for example. Function option @x is used to concatenate the data in multiple cursors. Since B1 is not a cursor, cursor function should be used to convert it to a cursor, i.e. B1.cursor(). This is faster than importing data from the file.

Compute a big file and a small one in parallel
The sequential computation is used in the above case, but parallel computation can further enhance the performance. The method is to import files using multithreads. Each thread accesses a part of the file with a cursor, and meanwhile performs a set operation and finally combines the result of each cursor together.

Test a big file of 2.77G size and a small one of 39.93M size under the same hardware environment. It takes an average of 85 seconds to accomplish the sequential computation while it takes an average of 47 seconds to accomplish it with parallel computing. The speed has nearly been doubled. As the set operations are not that complicated in themselves, the computation hits a bottleneck in the hard drive’s ability to import data. The performance will be improved more greatly while the operation is getting more complex.
esProc code for parallel computing is as follows:

B1Import the small file into the memory, define a primary and create an index.
A2=4. A2 represents the number of segments into which a file is going to be divided. Here the number is 4. Generally the number of segments should not exceed the number of CPU cores; otherwise the tasks will be queued for processing and the speed won’t be really increased. The maximum parallel number can be configured in the environment option.

A3=A2.(file("e:\\file1.txt").cursor@z(;, ~:A2))

The above code generates four cursors according to the number of segments. A2.(express) means computing the expression with members of A2 in order. In the parentheses, “~” can be used to represent the current member. Generally A2 is a set, like ["file1", " file2"] or [2,3]. If members of the set are a series of consecutive numbers beginning from 1, like [1,2,3,4], A2 can be abbreviated to 4.( express), as with the code in this example.

In the expression - file("e:\\file1.txt").cursor@z(;, ~:A2) - surrounded by the parentheses, cursor function uses @z option to divide the file into multiple segments and to import one of them using a cursor. ~:A2 means that the file will be roughly divided into 4 segments (A2=4), with “~” representing the current member in A2. Thus the cursors correspond to the first, the second, the third and the fourth segment of file respectively.

The reason for dividing a file “roughly” is that half rows will appear with the exact division. esProc can make sure of importing whole rows automatically by skipping the head incomplete row and making up the tail incomplete row, this is tedious to realize in Java.

A4=A3.(~.select(B1.find(~._1))). This line of code computes intersection on each cursor in A3 and the result is a set of cursors.

A5=A4.conj@xm(). This line of code combines the multiple cursors in A4 in parallel.

A6=file("e:\\result.txt”).export(A5). The code writes the final result into a file. The intersection operation has been accomplished at this point. You can refer to the previous examples to compute difference and union.

esProc can be used alone or be integrated into a Java program. The following is to integrate the esProc script into the Java program through JDBC. The Java code is as follows:
  // create a connection using esProc jdbc
  Class.forName("com.esproc.jdbc.InternalDriver");
  con= DriverManager.getConnection("jdbc:esproc:local://");
  // call esProc script, whose name is test.dfx
  st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");
  st.execute();//execute esProc stored procedure, or input a parameter and output the computed result

For the simple script as with the first example, it can be embedded in the Java code and executed directly, without writing the script file (text.dfx). The corresponding Java code is as follows:
ResultSet set=st.executeQuery("file(\"e:\\result.txt\").export([file(\"E:\\file1.txt\").cursor().sortx(_1),file(\"E:\\file2.txt\").cursor().sortx(_1)].merge@xi(_1))");

March 9, 2015

esProc Improves Text Processing – Conditional Query on Big Files

During text processing, you often have the tasks of querying data from a big file on one or more conditions. Command line grep\cat command can be used to handle some simple situations with simple command yet low efficiency. Or high-level languages can be used to get a much higher efficiency with complicated code. If the query conditions are complex or dynamic, you need to create an additional SQL-like low-level class library, which increases the complexity of the computation.

esProc supports performing conditional query on big files and multithreaded parallel computing, and its code for handling this kind of problem is both concise and efficient. The following example will teach you the esProc method of doing the job.

A text file - employee.txt – holds the employee data. Import the data, select the female employees born after January 1, 1981 inclusive and export the query result to result.txt.

The format of employee.txt is as follows:
EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY
1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000
2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000
3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000
4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000
5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000

esProc code for accomplishing the task: 

A1Open the file as a cursor. cursor function won’t import all the data into the memory, it will open the file in the form of a cursor (stream) without the memory footprint. The function uses a default parameter which makes tab as the column separator to import all the fields. @t option means that the file’s first line will be the column names and thus specific column names can be used in the expression later. Without the option, columns will be named _1, _2, _3…_n automatically.

A2=A1.select(${where})

Filter the data according to the condition. Here a macro is used to dynamically parse the expression. "where" is the dynamic input parameter, it needs to be pre-defined. The following is the interface on which a parameter is defined:

The esProc program will first compute the expression surrounded by ${…}, then assign the computed result as the value to the macro string and replace ${…} with it; after that, the program will interpret and execute the code. For example, if where gets assigned as BIRTHDAY>=date(1981,1,1) && GENDER=="F" according to the given condition in the example, the expression in A2 will be =A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F"). The parameter can be entered into esProc’s Integration Development Environment (IDE), or can be passed from the Java code or the command line.

A3=file("D:/result.txt").export@t(A2). This line of code exports the computed result to a file. If the size of computing result is always small, use the code =A2.fetch() in A3 to fetch the results into the memory for direct observation, or use result A2.fetch() to return the results to the Java application. 
The final result of this example is as follows:

This example shows the method of realizing a dynamic query, that is, there is no need to change the code when the query condition changes, just modify the value of the parameter “where”. For example, if the condition becomes “query female employees born after January 1, 1981 inclusive, or employees whose FULLNAME is RebeccaMoore”, the value of “where” can be written as BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore". After the code is executed, the result set of A2 will be as follows:

The above algorithm is a sequential computation. But the use of parallel computation can further improve the performance. The method is: Import the file using multithreads, each of which accesses a part of the file with a cursor; meanwhile query the data according to the condition and finally merge the result of each cursor together.

The esProc code for parallel computing is as follows:

A1=4. A1 is the number of segments, which means the file will be divided into 4 segments. The number is equal to the number of parallel tasks in operation, which generally should not exceed the number of CPU cores. Otherwise the tasks will be queued for processing and the efficiency won’t be really increased. The maximum number of parallel tasks can be configured in the environment option.
A2=A1.(file("d:/employee.txt").cursor@z(;, ~:A1))
This line of code will generate four cursors according to the specified number of segments. A1.(express) means computing the expression with each member of A1 in order. “~” can be used in the parentheses to represent the current member. Generally A1 is a set, like ["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.

file("d:/employee.txt ").cursor@z(;, ~:A1) surrounded in the parentheses is an expression, in which cursor function uses @z option to segment the file and fetch each part with a cursor. ~:A1 means that the file will be roughly divided into four segments (A1=4) and the ~th segment will be fetched. “~” represents the current member in A1 and each cursor corresponds to the first, the second, the third and the fourth segment respectively.

Besides, though exact division will result in incomplete lines, esProc can import complete lines automatically by skipping the beginning half line of a segment and completing the ending half line of the segment. This is why the file should be divided “roughly”.

A3=A2.(~.select(${where})). This line of code will query data of each cursor (i.e. ~) in A2 and select the eligible rows. The computed results are still four cursors.

A4=A3.conj@xm(). This line of code will merge the four cursors in A3 in parallel.

A5=file("d:/result.txt”).export(A4). This line of code will export the final result to a file. 

March 3, 2015

esProc Helps Process Heterogeneous Data Sources in Java – Excel

Generally, Java programmers use poi or other open source packages to read and compute Excel data. These open source packages support low-level programming, which increases the overall learning cost and complicates the operation. But with the help of esProc, Java can avoid these problems.

An example will make the point clearly understood. Read the information of sales orders from the Excel file orders.xls and select sales orders that placed on and after January 1st, 2010 and in which SELLERID is equal to 18. The content of orders.xls is shown as follows:
ORDERID
CLIENT
SELLERID
AMOUNT
ORDERDATE
1
UJRNP
17
392
2008/11/2 15:28
2
SJCH
6
4802
2008/11/9 15:28
3
UJRNP
16
13500
2008/11/5 15:28
4
PWQ
9
26100
2008/11/8 15:28
5
PWQ
11
4410
2008/11/12 15:28
6
HANAR
18
6174
2008/11/7 15:28
7
EGU
2
17800
2008/11/6 15:28
8
VILJX
7
2156
2008/11/9 15:28
9
JAYB
14
17400
2008/11/12 15:28
10
JAXE
19
19200
2008/11/12 15:28
11
SJCH
7
13700
2008/11/10 15:28

Implementation approach: Call esProc script using Java program, read and compute the data in the Excel file, and then return the result in the form of ResultSet to Java program. Since esProc supports analyzing and evaluating expressions dynamically, it will enable Java to process data as flexibly as SQL does.


First, programmers can take the criteria “sales orders that placed on and after January 1st, 2010 and in which SELLERID is equal to 18” as the parameter where and pass it to esProc program. This is shown as follows:

where is a string, its value is ORDERDATE>=date(2010,1,1) && SELLERID==18.       

The code for esProc program is:

A1: Define a file object and import the Excel data into it. esProc’s IDE can display the imported data visually, as shown in the right part of the above figure. Importxls function can access xlsx files too and can automatically identify the version of Excel according to the filename extension.

A2: Perform the filtering according to the criteria, using macro to realize parsing the expression dynamically. The “where” in this process is an input parameter. In executing, esProc will first compute the expression surrounded by ${…}, take the computed result as macro string value, replace ${…} with it and then interpret and execute the code. The final code to be executed in this example is=A1.select(ORDERDATE>=date(2010,1,1) && SELLERID==18).

A3Return the eligible result set to Java. If the result needs to be written to another Excel file, just modify the code in cell A3 to =file("D:/file/orders_result.xls").exportxls@t(A2). If the criteria are changed, you just need to modify “where”– the parameter. For example, it is required to select sales orders that placed on and after January 1st, 2010 and in which SELLERID is equal to 18, or Client is equal to PWQ. The value of “where” can be written as CLIENT=="PWQ"||ORDERDATE>=date(2010,1,1) && SELLERID==18. After the code is executed, the result set in A2 is as follows:

Call this piece of code in Java with esProc JDBC and get the result. Detailed code is as follows (save the above program in esProc as test.dfx):
          // create a connection using esProc jdbc
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
// call the program in esProc (the stored procedure); test is the name of file dfx
com.esproc.jdbc.InternalCStatementst =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");
    // set parameters
st.setObject(1,"ORDERDATE>=date(2010,1,1) && SELLERID==18 || CLIENT==\"PWQ\"");    // execute the esProc stored procedure
ResultSet set =st.executeQuery();

If the script is simple, the code can be written directly into the program in Java that calls the esProc JDBC. It won’t be necessary to write a special script file (test.dfx):
esultSet set = st.executeQuery(
"=file(\"D:/file/orders.xls\").importxls@t().select(ORDERDATE>=date(2010,1,1) && SELLERID==18 || CLIENT==\"PWQ\")");

This piece of code in Java calls a line of code in esProc script directly, that is, get the data from the Excel file and filter it according to the criteria. 

March 1, 2015

esProc Helps Process Heterogeneous Data Sources in Java - HDFS

It is not difficult for Java to access HDFS through API provided by Hadoop. But to realize computations, like grouping, filtering and sorting, on files in HDFS in Java is troublesome. esProc is a good helper in Java’s dealing with these computations. It can execute the access to HDFS too. With the help of esProc, Java will increase its ability in performing structured and semi-structured data computing, like the above-mentioned computations. Let’s look at how it works through an example.

The text file employee.gz in HDFS contains the employee data. You are required to import the data and select the female employees who were born on and after January 1st, 1981. The text file has been zipped with gzip in HDFS and cannot be loaded to the memory entirely.

The data in employee.gz is as follows:
EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY
1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000
2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000
3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000
4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000
5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000
6       Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000
7       Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000
8       Megan     Wilson      F       California 1979-04-19       1984-04-19       Marketing        11000
9       Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000
10     Ryan         Johnson   M     Pennsylvania    1976-03-12       2006-03-12       R&D          13000
11     Jacob        Moore      M     Texas        1974-12-16       2004-12-16       Sales         12000
12     Jessica     Davis        F       New York 1980-09-11       2008-09-11       Sales         7000
13     Daniel       Davis        M     Florida      1982-05-14       2010-05-14       Finance    10000
Implementation approach: Call the esProc script with Java program, import and compute the data, then return the result to Java program in the form of ResultSet.

First, you should develop and debug program in esProc’s Integration Development Environment (IDE). The preparatory work is to copy the core packages and the configuration packages of Hadoop to “esProc’s installation directory\esProc\lib”, such as commons-configuration-1.6.jarcommons-lang-2.4.jarhadoop-core-1.0.4.jarHadoop1.0.4.


Because esProc supports analyzing and evaluating expressions dynamically, it will enable Java to filter the data in HDFS file as flexibly as SQL does. For example, to query the data of female employees who were born on and after January 1st, 1981, esProc will use an input parameter “where” as the condition, as shown in the figure below:

“where” is a string, its value is BIRTHDAY>=date(1981,1,1) && GENDER=="F".
The code in esProc is as follows:

A1: Define a HDFS file object cursor with the first row being the title and tab being the default field separator. The zipping mode is determined by the filename extension. Here gzip is used. esProc also supports other zipping modes. UTF-8 is a charset, which is a JVM charset by default.

A2: Filter the cursor according to the condition. Here macro is used to realize analyzing the expression dynamically, in which “where” is the input parameter. esProc will first compute the expression surrounded by ${…}, take the computed result as the macro string value and replace ${…} with it, then interpret and execute the code. The final code executed in this example is =A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F").

A3: Return the cursor. If the filtering condition is changed, you only need to change the parameter “where” without modifying the code. For example, you are required to query the data of the female employees who were born on January 1st, 1981, or of the employees in which NAME+SURNAME is ”RebeccaMoore”. The code for the value of “where” can be written as BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore".

The code for calling this block of code in Java with esProc JDBC is as follows (save the esProc program as test.dfx and put the Hadoop jars needed by HDFS in Java’s classpath):
          // create a connection using esProc jdbc
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
// call the program in esProc (the stored procedure); test is the file name of dfx
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");
//set the parameters
st.setObject(1," BIRTHDAY>=date(1981,1,1) && GENDER==\"F\" ||NAME+SURNAME==\"RebeccaMoore\"");// the parameters are the dynamic filtering conditions
// execute esProc stored procedure
st.execute();
// get the result set, which is the eligible set of employees
ResultSet set = st.getResultSet();