Tag Archives: hadoop

Using R for Scalable Data Analytics

1 Apr

At the recent Strata conference in San Jose, several members of the Microsoft Data Science team presented the tutorial Using R for Scalable Data Analytics: Single Machines to Spark Clusters. The materials are all available online, including the presentation slides and hands-on R scripts. You can follow along with the materials at home, using the Data Science Virtual Machine for Linux, which provides all the necessary components like Spark and Microsoft R Server. (If you don’t already have an Azure account, you can get $200 credit with the Azure free trial.)

The tutorial covers many different techniques for training predictive models at scale, and deploying the trained models as predictive engines within production environments. Among the technologies you’ll use are Microsoft R Server running on Spark, the SparkR package, the sparklyr package and H20 (via the rsparkling package). It also touches on some non-Spark methods, like the bigmemory and ff packages for R (and various other packages that make use of them), and using the foreach package for coarse-grained parallel computations. You’ll also learn how to create prediction engines from these trained models using the mrsdeploy package.


The tutorial also includes scripts for comparing the performance of these various techniques, both for training the predictive model:


and for generating predictions from the trained model:


(The above tests used 4 worker nodes and 1 edge node, all with with 16 cores and 112Gb of RAM.)

You can find the tutorial details, including slides and scripts, at the link below.

Strata + Hadoop World 2017, San Jose: Using R for scalable data analytics: From single machines to Hadoop Spark clusters


Source: http://blog.revolutionanalytics.com/big-data/


Several Methods for Structured Big Data Computation

24 Apr

All data can only have the existence value by getting involved in the computation to create value. The big data makes no exception. The computational capability on structural big data determines the range of practical applications of big data. In this article, I’d like to introduce several commonest computation methods: API, Script, SQL, and SQL-like languages.

API: The “API” here refers to a self-contained API access method without using JDBC or ODBC. Let’s take MapReduce as an example. MapReduce is designed to handle the parallel computation cost-effectively from the very bottom layer. So, MapReduce offers superior scale-out, hot-swap, and cost efficiency. MapReduce is one of the Hadoop components with open-source code and abundant resources.

Sample code:

public void reduce(Text key, Iterator<Text> value,

OutputCollector<Text, Text> output, Reporter arg3)

throws IOException {

double avgX=0;

double avgY=0;

double sumX=0;

double sumY=0;

int count=0;

String [] strValue = null;



strValue = value.next().toString().split(“\t”);

sumX = sumX + Integer.parseInt(strValue[1]);

sumY = sumY + Integer.parseInt(strValue[1]);



avgX = sumX/count;

avgY = sumY/count;


tValue.set(avgX + “\t” + avgY);

output.collect(tKey, tValue);


Since the universal programming language adopted is unsuitable for the specialized data computing, MapReduce is less capable than SQL and other specialized computation languages in computing. Plus, it is inefficient in developing. No wonder that the programmers generally complain it is “painful”. In addition, the rigid framework of MapReduce results in the relatively poorer performance.

There are several products using API, and MapReduce is the most typical one among them.

Script: The “Script” here refers to the specialized script for computing. Take esProc as an example. esProc is designed to improve the computational capability of Hadoop. So, in addition to the inexpensive scale-out, it also offers the high performance, great computational capability, and convenient computation between heterogeneous data sources, especially ideal for achieving the complex computational goal. In addition, it is the grid-style script characterized with the high development efficiency and complete debug functions.

Sample code:

  A B
1 =file(“hdfs://”).size() //file size
2 =10 //number of tasks
3 =to(A2) //1 ~ 10, 10 tasks
4 =A3.(~*int(A1/A2)) //parameter list for start pos
5 =A3.((~-1)*int(A1/A2)+1) //parameter list for end pos
6 =callx(“groupSub.dfx”,A5,A4;[“”, “”]) //sub-program calling, 10 tasks to 2 parallel nodes
7 =A6.merge(empID) //mergingtaskresult
8 =A7.group@i(empID;~.sum(totalAmount):orderAmount, 



//summarizing is completed

Java users can invoke the result from esProc via JDBC, but they are only allowed to invoke the result in the form of stored procedure instead of any SQL statement. Plus, esProc is not open source. These are two disadvantages of esProc.

The Script is widespread used in MongoDB, Redis, and many other big data solutions, but they are not specialized enough in computing. For another example, the multi-table joining operation for MongoDB is not only inefficient, but also involves the coding of one order of magnitude more complex than that of SQL or esProc.

SQL: The “SQL” here refers to the complete and whole SQL/SP, i.e. ANSI 2000 and its superset. Take Greenplum as an example, the major advantages of Greenplum SQL are the powerful computing, highly efficient developing, and great performance. Other advantages include the widespread use of its language, low learning cost, simple maintenance, and migration possibility -not to mention its trump-card of offering support for stored procedure to handle the complex computation. By this way, business value can be exploited from the big data conveniently.

Sample code:

CREATE OR REPLACE function view.merge_emp()

returns voidas$$


truncate view.updated_record;

insert into view.updated_recordselect y.* from view.emp_edw x right outer join        emp_src y       on x.empid=y.empid where x.empid is not null;

update view.emp_edwset deptno=y.deptno,sal=y.salfrom view.updated_record y       where          view.emp_edw.empid=y.empid;

insert into emp_edwselect y.* from emp_edw x right outer join emp_src y on    x.empid=y.empid where  x.empid is null;


$$ language ‘plpgsql’;

The other databases with the similar structure to MPP include Teradata, Vertical, Oracle, and IBM. Their syntax characteristics are mostly alike. The disadvantages are similar. Theacquisition cost and the ongoing maintenance expenses are extremely high. Charging its users by data scale, the so-called inexpensive Greenplum is actually not a bargain at all – it is way more like making big money under cover of big data. Other disadvantages include awkward debugging, incompatible syntax, lengthy down-time if expansion, and awkward multi-data-source computation.

SQL-like language: It refers to the output interfaces like JDBC/ODBC and only limited to those scripting languages that are the subset of standard SQL. Take Hive QL as an example. The greatest advantage of Hive QL is its ability to scale out cost-effectively while still a convenient tool for users to develop. The SQL syntax feature is kept in Hive QL, so that the learning cost is low, development efficient, and maintenance simple. In addition, Hive is a component of Hadoop. The open-source is another advantage.

Sample code:


SELECT name, salary, deductions[“Federal Taxes”] as ded,

salary * (1 – deductions[“Federal Taxes”]) as salary_minus_fed_taxes

FROM employees

) e

WHERE round(e.salary_minus_fed_taxes) > 70000;

The weak point of Hive QL is its non-support for stored procedure. Due to this, it is difficult for HiveQLto undertake the complex computation, and thus difficult to provide the truly valuable result. The slightly more complex computation will rely on MapReduce. Needless to say, the development efficiency is low. The poor performance and the threshold time can be regarded as a bane, especially in task allocation, multi-table joining, inter-row computation, multi-level query, and ordered grouping, as well as implementing other algorithm alike. So, it is quite difficult for HiveQL to implement the real-time Hadoop application for big data.

There are also some other products with SQL-like languages – MongoDB as an example – they are still worse than Hive yet.

The big data computation methods currently available are no more than these 4 types of API, Script, SQL, and SQL-like languages. Wish they would develop steadfastly and there would be more and more cost-effective, powerful, practical and usable products for data computing.

About esProc: http://www.raqsoft.com/product-esproc

Source: http://datathinker.wordpress.com/2014/04/24/several-methods-for-structured-big-data-computation/

%d bloggers like this: