Archive | 8:39 am

You Can’t Hack What You Can’t See

1 Apr
A different approach to networking leaves potential intruders in the dark.
Traditional networks consist of layers that increase cyber vulnerabilities. A new approach features a single non-Internet protocol layer that does not stand out to hackers.

A new way of configuring networks eliminates security vulnerabilities that date back to the Internet’s origins. Instead of building multilayered protocols that act like flashing lights to alert hackers to their presence, network managers apply a single layer that is virtually invisible to cybermarauders. The result is a nearly hack-proof network that could bolster security for users fed up with phishing scams and countless other problems.

The digital world of the future has arrived, and citizens expect anytime-anywhere, secure access to services and information. Today’s work force also expects modern, innovative digital tools to perform efficiently and effectively. But companies are neither ready for the coming tsunami of data, nor are they properly armored to defend against cyber attacks.

The amount of data created in the past two years alone has eclipsed the amount of data consumed since the beginning of recorded history. Incredibly, this amount is expected to double every few years. There are more than 7 billion people on the planet and nearly 7 billion devices connected to the Internet. In another few years, given the adoption of the Internet of Things (IoT), there could be 20 billion or more devices connected to the Internet.

And these are conservative estimates. Everyone, everywhere will be connected in some fashion, and many people will have their identities on several different devices. Recently, IoT devices have been hacked and used in distributed denial-of-service (DDoS) attacks against corporations. Coupled with the advent of bring your own device (BYOD) policies, this creates a recipe for widespread disaster.

Internet protocol (IP) networks are, by their nature, vulnerable to hacking. Most if not all these networks were put together by stacking protocols to solve different elements in the network. This starts with 802.1x at the lowest layer, which is the IEEE standard for connecting to local area networks (LANs) or wide area networks (WANs). Then stacked on top of that is usually something called Spanning Tree Protocol, designed to eliminate loops on redundant paths in a network. These loops are deadly to a network.

Other layers are added to generate functionality (see The Rise of the IP Network and Its Vulnerabilities). The result is a network constructed on stacks of protocols, and those stacks are replicated throughout every node in the network. Each node passes traffic to the next node before the user reaches its destination, which could be 50 nodes away.

This M.O. is the legacy of IP networks. They are complex, have a steep learning curve, take a long time to deploy, are difficult to troubleshoot, lack resilience and are expensive. But there is an alternative.

A better way to build a network is based on a single protocol—an IEEE standard labeled 802.1aq, more commonly known as Shortest Path Bridging (SPB), which was designed to replace the Spanning Tree Protocol. SPB’s real value is its hyperflexibility when building, deploying and managing Ethernet networks. Existing networks do not have to be ripped out to accommodate this new protocol. SPB can be added as an overlay, providing all its inherent benefits in a cost-effective manner.

Some very interesting and powerful effects are associated with SPB. Because it uses what is known as a media-access-control-in-media-access-control (MAC-in-MAC) scheme to communicate, it naturally shields any IP addresses in the network from being sniffed or seen by hackers outside of the network. If the IP address cannot be seen, a hacker has no idea that the network is actually there. In this hypersegmentation implementation of 16 million different virtual network services, this makes it almost impossible to hack anything in a meaningful manner. Each network segment only knows which devices belong to it, and there is no way to cross over from one segment to another. For example, if a hacker could access an HVAC segment, he or she could not also access a credit card segment.

As virtual LANs (VLANs) allow for the design of a single network, SPB enables distributed, interconnected, high-performance enterprise networking infrastructure. Based on a proven routing protocol, SPB combines decades of experience with intermediate system to intermediate system (IS-IS) and Ethernet to deliver more power and scalability than any of its predecessors. Using the IEEE’s next-generation VLAN, called an individual service identification (I-SID), SPB supports 16 million unique services, compared with the VLAN limit of 4,000. Once SPB is provisioned at the edge, the network core automatically interconnects like I-SID endpoints to create an attached service that leverages all links and equal cost connections using an enhanced shortest path algorithm.

Making Ethernet networks easier to use, SPB preserves the plug-and-play nature that established Ethernet as the de facto protocol at Layer 2, just as IP dominates at Layer 3. And, because improving Ethernet enhances IP management, SPB enables more dynamic deployments that are easier to maintain than attempts that tap other technologies.

Implementing SPB obviates the need for the hop-by-hop implementation of legacy systems. If a user needs to communicate with a device at the network edge—perhaps in another state or country—that other device now is only one hop away from any other device in the network. Also, because an SPB system is an IS-IS or a MAC-in-MAC scheme, everything can be added instantly at the edge of the network.

This accomplishes two major points. First, adding devices at the edge allows almost anyone to add to the network, rather than turning to highly trained technicians alone. In most cases, a device can be scanned to the network via a bar code before its installation, and a profile authorizing that device to the network also can be set up in advance. Then, once the device has been installed, the network instantly recognizes it and allows it to communicate with other network devices. This implementation is tailor-made for IoT and BYOD environments.

Second, if a device is disconnected or unplugged from the network, its profile evaporates, and it cannot reconnect to the network without an administrator reauthorizing it. This way, the network cannot be compromised by unplugging a device and plugging in another for evil purposes.

SPB has emerged as an unhackable network. Over the past three years, U.S. multinational technology company Avaya has used it for quarterly hackathons, and no one has been able to penetrate the network in those 12 attempts. In this regard, it truly is a stealth network implementation. But it also is a network designed to thrive at the edge, where today’s most relevant data is being created and consumed, capable of scaling as data grows while protecting itself from harm. As billions of devices are added to the Internet, experts may want to rethink the underlying protocol and take a long, hard look at switching to SPB.


Using R for Scalable Data Analytics

1 Apr

At the recent Strata conference in San Jose, several members of the Microsoft Data Science team presented the tutorial Using R for Scalable Data Analytics: Single Machines to Spark Clusters. The materials are all available online, including the presentation slides and hands-on R scripts. You can follow along with the materials at home, using the Data Science Virtual Machine for Linux, which provides all the necessary components like Spark and Microsoft R Server. (If you don’t already have an Azure account, you can get $200 credit with the Azure free trial.)

The tutorial covers many different techniques for training predictive models at scale, and deploying the trained models as predictive engines within production environments. Among the technologies you’ll use are Microsoft R Server running on Spark, the SparkR package, the sparklyr package and H20 (via the rsparkling package). It also touches on some non-Spark methods, like the bigmemory and ff packages for R (and various other packages that make use of them), and using the foreach package for coarse-grained parallel computations. You’ll also learn how to create prediction engines from these trained models using the mrsdeploy package.


The tutorial also includes scripts for comparing the performance of these various techniques, both for training the predictive model:


and for generating predictions from the trained model:


(The above tests used 4 worker nodes and 1 edge node, all with with 16 cores and 112Gb of RAM.)

You can find the tutorial details, including slides and scripts, at the link below.

Strata + Hadoop World 2017, San Jose: Using R for scalable data analytics: From single machines to Hadoop Spark clusters



Streaming Big Data: Storm, Spark and Samza

1 Apr

There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of their similarities and differences.

Apache Storm

In Storm, you design a graph of real-time computation called a topology, and feed it to the cluster where the master node will distribute the code among worker nodes to execute it. In a topology, data is passed around between spouts that emit data streams as immutable sets of key-value pairs called tuples, and bolts that transform those streams (count, filter etc.). Bolts themselves can optionally emit data to other bolts down the processing pipeline.


Apache Spark

Spark Streaming (an extension of the core Spark API) doesn’t process streams one at a time like Storm. Instead, it slices them in small batches of time intervals before processing them. The Spark abstraction for a continuous stream of data is called a DStream (for Discretized Stream). A DStream is a micro-batch of RDDs (Resilient Distributed Datasets). RDDs are distributed collections that can be operated in parallel by arbitrary functions and by transformations over a sliding window of data (windowed computations).


Apache Samza

Samza ’s approach to streaming is to process messages as they are received, one at a time. Samza’s stream primitive is not a tuple or a Dstream, but a message. Streams are divided into partitions and each partition is an ordered sequence of read-only messages with each message having a unique ID (offset). The system also supports batching, i.e. consuming several messages from the same stream partition in sequence. Samza`s Execution & Streaming modules are both pluggable, although Samza typically relies on Hadoop’s YARN (Yet Another Resource Negotiator) and Apache Kafka.


Common Ground

All three real-time computation systems are open-source, low-latencydistributed, scalable and fault-tolerant. They all allow you to run your stream processing code through parallel tasks distributed across a cluster of computing machines with fail-over capabilities. They also provide simple APIs to abstract the complexity of the underlying implementations.

The three frameworks use different vocabularies for similar concepts:


Comparison Matrix

A few of the differences are summarized in the table below:


There are three general categories of delivery patterns:

  1. At-most-once: messages may be lost. This is usually the least desirable outcome.
  2. At-least-once: messages may be redelivered (no loss, but duplicates). This is good enough for many use cases.
  3. Exactly-once: each message is delivered once and only once (no loss, no duplicates). This is a desirable feature although difficult to guarantee in all cases.

Another aspect is state management. There are different strategies to store state. Spark Streaming writes data into the distributed file system (e.g. HDFS). Samza uses an embedded key-value store. With Storm, you’ll have to either roll your own state management at your application layer, or use a higher-level abstraction called Trident.

Use Cases

All three frameworks are particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one to use? There are no hard rules, at most a few general guidelines.

If you want a high-speed event processing system that allows for incremental computations, Storm would be fine for that. If you further need to run distributed computations on demand, while the client is waiting synchronously for the results, you’ll have Distributed RPC (DRPC) out-of-the-box. Last but not least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching.

A few companies using Storm: Twitter, Yahoo!, Spotify, The Weather Channel...

Speaking of micro-batching, if you must have stateful computations, exactly-once delivery and don’t mind a higher latency, you could consider Spark Streaming…specially if you also plan for graph operations, machine learning or SQL access. The Apache Spark stack lets you combine several libraries with streaming (Spark SQL, MLlibGraphX) and provides a convenient unifying programming model. In particular, streaming algorithms (e.g. streaming k-means) allow Spark to facilitate decisions in real-time.


A few companies using Spark: Amazon, Yahoo!, NASA JPL, eBay Inc., Baidu…

If you have a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates storage and processing on the same machines, allowing to work efficiently with state that won’t fit in memory. The framework also offers flexibility with its pluggable API: its default execution, messaging and storage engines can each be replaced with your choice of alternatives. Moreover, if you have a number of data processing stages from different teams with different codebases, Samza ‘s fine-grained jobs would be particularly well-suited, since they can be added/removed with minimal ripple effects.

A few companies using Samza: LinkedIn, Intuit, Metamarkets, Quantiply, Fortscale…


We only scratched the surface of The Three Apaches. We didn’t cover a number of other features and more subtle differences between these frameworks. Also, it’s important to keep in mind the limits of the above comparisons, as these systems are constantly evolving.

The IoT: It’s a question of scope

1 Apr

There is a part of the rich history of software development that will be a guiding light, and will support creation of the software that will run the Internet of Things (IoT). It’s all a question of scope.

Figure 1 is a six-layer architecture, showing what I consider to be key functional and technology groupings that will define software structure in a smart connected product.

Figure 1

The physical product is on the left. “Connectivity” in the third box allows the software in the physical product to connect to back-end application software on the right. Compared to a technical architecture, this is an oversimplification. But it will help me explain why I believe the concept of “scope” is so important for everyone in the software development team.

Scope is a big deal
The “scope” I want to focus on is a well-established term used to explain name binding in computer languages. There are other uses, even within computer science, but for now, please just exclude them from your thinking, as I am going to do.

The concept of scope can be truly simple. Take the name of some item in a software system. Now decide where within the total system this name is a valid way to refer to the item. That’s the scope of this particular name.

(Related: What newcomers to IoT plan for its future)

I don’t have evidence, but I imagine that the concept arose naturally in the earliest days of software, with programs written in machine code. The easiest way to handle variables is to give them each a specific memory location. These are global variables; any part of the software that knows the address can access and use these variables.

But wait! It’s 1950 and we’ve used all 1KB of memory! One way forward is to recognize that some variables are used only by localized parts of the software. So we can squeeze more into our 1KB by sharing memory locations. By the time we get to section two of the software, section one has no more use for some of its variables, so section two can reuse those addresses. These are local variables, and as machine code gave way to assembler languages and high-level languages, addresses gave way to names, and the concept of scope was needed.

But scope turned out to be much more useful than just a way to share precious memory. With well-chosen rules on scope, computer languages used names to define not only variables, but whole data structures, functions, and connections to peripherals as well. You name it, and, well yes, you could give it a name. This created new ways of thinking about software structure. Different parts of a system could be separated from other parts and developed independently.

A new software challenge
There’s a new challenge for IoT software, and this challenge applies to all the software across the six boxes in Figure 1. This includes the embedded software in the smart connected device, the enterprise applications that monitor and control the device, as well as the software-handling access control and product-specific functions.

The challenge is the new environment for this software. These software types and the development teams behind them are very comfortable operating in essentially “closed” environments. For example, the embedded software used to be just a control system; its universe was the real-time world of sensors and actuators together with its memory space and operating system. Complicated, but there was a boundary.

Now, it’s connected to a network, and it has to send and receive messages, some of which may cause it to update itself. Still complicated, and it has no control over the timing, sequence or content of the messages it receives. Timing and sequence shouldn’t be a problem; that’s like handling unpredictable screen clicks or button presses from a control panel. But content? That’s different.

Connectivity creates broadly similar questions about the environment for the software across all the six layers. Imagine implementing a software-feature upgrade capability. Whether it’s try-before-you-buy or a confirmed order, the sales-order processing system is the one that holds the official view of what the customer has ordered. So a safe transaction-oriented application like SOP is now exposed to challenging real-world questions. For example, how many times, and at what frequency, should it retry after a device fails to acknowledge an upgrade command within the specified time?

An extensible notion
The notion of scope can be extended to help development teams handle this challenge. It doesn’t deliver the solutions, but it will help team members think about and define structure for possible solution architectures.

For example, Figure 2 looks at software in a factory, where the local scope of sensor readings and actuator actions in a work-cell automation system are in contrast to the much broader scope of quality and production metrics, which can drive re-planning of production, adjustment of machinery, or discussions with suppliers about material quality.

Figure 2

Figure 3 puts this example from production in the context of the preceding engineering development work, and the in-service life of this product after it leaves the factory.

Figure 3

Figure 4 adds three examples of new IoT capabilities that will need new software: one in service (predictive maintenance), and two in the development phase (calibration of manufacturing models to realities in the factory, and engineering access to in-service performance data).

Figure 4

Each box is the first step to describing and later defining the scope of the data items, messages, and sub-systems involved in the application. Just like the 1950s machine code programmers, one answer is “make everything global”—or, in today’s terms, “put everything in a database in the cloud.” And as in 1950, that approach will probably be a bit heavy on resources, and therefore fail to scale.

Dare I say data dictionary?
A bit old school, but there are some important extensions to ensure a data dictionary articulates not only the basic semantics of a data item, but also its reliability, availability, and likely update frequency. IoT data may not all be in a database; a lot of it starts out there in the real world, so attributes like time and cost of updates may be relevant. For the development team, stories, scrums and sprints come first. But after a few cycles, the data dictionary can be the single reference that ensures everyone can discuss the required scope for every artifact in the system-of-systems.

Software development teams for every type of software involved in an IoT solution (for example, embedded, enterprise, desktop, web and cloud) will have an approach (and possibly different approaches) to naming, documenting, and handling design questions: Who creates, reads, updates or deletes this artifact? What formats do we use to move data inside one subsystem, or between subsystems? Which subsystem is responsible for orchestrating a response to a change in a data value? Given a data dictionary, and a discussion about the importance of scope, these teams should be able to discuss everything that happens at their interfaces.

Different programming languages have different ways of defining scope. I believe it’s worth reviewing a few of these, maybe explore some boundaries by looking at some more esoteric languages. This will remind you of all the wonderful possibilities and unexpected pitfalls of using, communicating, and sharing data and other information technology artifacts. The rules the language designers have created may well inspire you to develop guidelines and maybe specific rules for your IoT system. You’ll be saving your IoT system development team a lot of time.


%d bloggers like this: