Unlearn to Unleash Your Data Lake

16 Sep

The Data Science Process is about exploring, experimenting, and testing new data sources and analytic tools quickly.

The Challenge of Unlearning
For the first two decades of my career, I worked to perfect the art of data warehousing. I was fortunate to be at Metaphor Computers in the 1980’s where we refined the art of dimensional modeling and star schemas. I had many years working to perfect my star schema and dimensional modeling skills with data warehouse luminaries like Ralph Kimball, Margy Ross, Warren Thornthwaite, and Bob Becker. It became engrained in every customer conversation; I’d built a star schema and the conformed dimensions in my head as the client explained their data analysis requirements.

Then Yahoo happened to me and soon everything that I held as absolute truth was turned upside down. I was thrown into a brave new world of analytics based upon petabytes of semi-structured and unstructured data, hundreds of millions of customers with 70 to 80 dimensions and hundreds of metrics, and the need to make campaign decisions in fractions of a second. There was no way that my batch “slice and dice” business intelligence and highly structured data warehouse approach was going to work in this brave new world of real-time, predictive and prescriptive analytics.

I struggled to unlearn engrained data warehousing concepts in order to embrace this new real-time, predictive and prescriptive world. And this is one of the biggest challenge facing IT leaders today – how to unlearn what they’ve held as gospel and embrace what is new and different. And nowhere do I see that challenge more evident then when I’m discussing Data Science and the Data Lake.

Embracing The “Art of Failure” and The Data Science Process
Nowadays, Chief Information Officers (CIOs) are being asked to lead the digital transformation from a batch world that uses data and analytics to monitor the business to a real-time world that exploits internal and external, structured and unstructured data, to predict what is likely to happen and prescribe recommendations. To power this transition, CIO’s must embrace a new approach for deriving customer, product, and operational insights – the Data Science Process (see Figure 2).

Figure 2:  Data Science Engagement Process

The Data Science Process is about exploring, experimenting, and testing new data sources and analytic tools quickly, failing fast but learning faster. The Data Science process requires business leaders to get comfortable with “good enough” and failing enough times before one becomes comfortable with the analytic results. Predictions are not a perfect world with 100% accuracy. As Yogi Berra famously stated:

“It’s tough to make predictions, especially about the future.”

This highly iterative, fail-fast-but-learn-faster process is the heart of digital transformation – to uncover new customer, product, and operational insights that can optimize key business and operational processes, mitigate regulatory and compliance risks, uncover new revenue streams and create a more compelling, more prescriptive customer engagement. And the platform that is enabling digital transformation is the Data Lake.

The Power of the Data Lake
The data lake exploits the economics of big data; coupling commodity, low-cost servers and storage with open source tools and technologies, is 50x to 100x cheaper to store, manage and analyze data then using traditional, proprietary data warehousing technologies. However, it’s not just cost that makes the data lake a more compelling platform than the data warehouse. The data lake also provides a new way to power the business, based upon new data and analytics capabilities, agility, speed, and flexibility (see Table 1).

Data Warehouse Data Lake
Data structured in heavily-engineered structured dimensional schemas Data structured as-is (structured, semi-structured, and unstructured formats)
Heavily-engineered, pre-processed data ingestion Rapid as-is data ingestion
Generates retrospective reports from historical, operational data sources Generates predictions and prescriptions from a wide variety of internal and external data sources
100% accurate results of past events and performance “Good enough” predictions of future events and performance
Schema-on-load to support the historical reporting on what the business did Schema-on-query to support the rapid data exploration and hypothesis testing
Extremely difficult to ingest and explore new data sources (measured in weeks or months) Easy and fast to ingest and explore new data sources (measured in hours or days)
Monolithic design and implementation (water fall) Natively parallel scale out design and implementation (scrum)
Expensive and proprietary Cheap and open source
Widespread data proliferation (data warehouses and data marts) Single managed source of organizational data
Rigid; hard to change Agile; relatively ease to change

Table 1:  Data Warehouse versus Data Lake

The data lake supports the unique requirements of the data science team to:

  • Rapidly explore and vet new structured and unstructured data sources
  • Experiment with new analytics algorithms and techniques
  • Quantify cause and effect
  • Measure goodness of fit

The data science team needs to be able perform this cycle in hours or days, not weeks or months. The data warehouse cannot support these data science requirements. The data warehouse cannot rapidly exploration the internal and external structured and unstructured data sources. The data warehouse cannot leverage the growing field of deep learning/machine learning/artificial intelligence tools to quantify cause-and-effect. Thinking that the data lake is “cold storage for our data warehouse” – as one data warehouse expert told me – misses the bigger opportunity. That’s yesterday’s “triangle offense” thinking. The world has changed, and just like how the game of basketball is being changed by the “economics of the 3-point shot,” business models are being changed by the “economics of big data.”

But a data lake is more than just a technology stack. To truly exploit the economic potential of the organization’s data, the data lake must come with data management services covering data accuracy, quality, security, completeness and governance. See “Data Lake Plumbers: Operationalizing the Data Lake” for more details (see Figure 3).

Figure 3:  Components of a Data Lake

If the data lake is only going to be used another data repository, then go ahead and toss your data into your unmanageable gaggle of data warehouses and data marts.

BUT if you are looking to exploit the unique characteristics of data and analytics –assets that never deplete, never wear out and can be used across an infinite number of use cases at zero marginal cost – then the data lake is your “collaborative value creation” platform. The data lake becomes that platform that supports the capture, refinement, protection and re-use of your data and analytic assets across the organization.

But one must be ready to unlearn what they held as the gospel truth with respect to data and analytics; to be ready to throw away what they have mastered to embrace new concepts, technologies, and approaches. It’s challenging, but the economics of big data are too compelling to ignore. In the end, the transition will be enlightening and rewarding. I know, because I have made that journey.

Source: http://cloudcomputing.sys-con.com/node/4157284


5 Reasons Why We Need Microchips Under Our Skin

31 Jan


On our journey towards shrinking (and flattening) all the things we use on a daily basis, such as cell phones or screens, eventually we will hit the milestone where we will finally be free of all the stuff and require only this tiny little thing called “the microchip” that will substitute many things. First of them is identification, naturally. So, I won’t have to carry my ID card always with me, you say? My ID card is inside ME? AND my credit card? I’m not sure that motivates me enough to accept the concept. I don’t trust people who will control the data. And more importantly, I don’t want to be controlled, I want to be free. What else do you have?

So what else is there? For me it’s obvious, so without further ado, here are the reasons:

  1. Safety. Imagine having your child kidnapped. A truly horrible scenario for every sane human being. But what if you would never have to worry about it? If it ever happens you could easily alarm the authorities who would then contact the manufacturer and then easily locate your child. In order for this to be achievable, we need the central system of highest security imaginable that stores all the data from all the implants. Not only that, but we also need the technology so sophisticated and so hard-to-get for the “usual criminal” that it will be nearly impossible for an outsider to scan and find the implant in your body. Another scenario includes having an Alzheimer’s disease – you or your loved one(s). If you ever get lost, finding you would not be a mission impossible.
  2. Health. These are easy. Starting with basic features such as providing a doctor with all your medical records. In my imagination, hospitals don’t need to have databases containing this information; it can all be stored inside the central system. The system decides which data will it show to the hospitals that are connected to its servers – hospitals that want to be in the program, obviously. And no, I wouldn’t charge a fee to a hospital.
  3. Human enhancement. Or biological limitations reasons. How many are there? The list could be endless or short, depending on your perception of limitations and your knowledge of the topic. I can only begin to imagine all possible and impossible scenarios where the microchip implant will work for me, thus making me a – superwoman. The IBM has already made the computer chip which features components that serve as 256 neurons and 262,144 synapses. The goal is to make a processor that can work as a human brain. Sounds impossible, however, many believe it will quickly become reality.
  4. Convenience. These are very handy – imagine not ever having a wallet with you, instead, you just pass by a scanner on your way out of the store and your check is paid. Or entering a club the same way. Or never ever having to search for your keys. You will be able to pre-program your microchip implant to work for you.
  5. Advertising. I have to mention these, as I come from advertising industry. Imagine having some kind of scanners on billboards or citylights. These scanners would be able to recognize what target audience is standing in front of them and show them an advert that fits their demographics for example. Naturally, you would pay a fee to the Central Implant System (it’s what I like to call it) and thus have access to the basic information about their implant bearers. And then you can go wild: want to show your advert only to male Caucasians in their twenties who love fast cars. Or how about to pregnant women only? Finally, all those billions of dollars spent on advertising will be spent efficiently.

First things first

According to Wikipedia, a microchip implant is “an identifying integrated circuit device or RFID transponder encased in silicate glass and implanted in the body“ while RFID stands for „radio-frequency identification, which is the wireless non-contact use of radio-frequency electromagnetic fields to transfer data, for the purposes of automatically identifying and tracking tags attached to objects“.

There was a case of the VeriChip Corporation who, in 2004, received approval from the FDA to market their microchips in the U.S. Three years after, in 2007, it was revealed that nearly identical implants had caused cancer in hundreds of laboratory animals which, naturally, had a disastrous impact on the company’s stock price and the production of microchip implants.” However, the link between foreign-body tumorigenesis in lab animals and implantation in humans has been publicly refuted as misleading.

I can only hope that right now, as I write, someone somewhere is testing microchip implants that are not only safe to have inside your body, but also untraceable to criminals. For instance, if someone kidnaps you, the last thing you need is a scanner used to find the implant and get it out. This means that we need some kind of Federal Reserve System for human microchip implants; an impenetrable fortress. Only on that level of security and systematization we will be able to start and further develop the microchip implant world.

We obviously need more sophisticated microchips if we want to achieve all these marvelous things. I am sure that nanotechnology will be the saviour here. I believe in nanotechnology.  I also believe in artificial intelligence.  AND I believe in humans – which brings me to the Singularity.

The future of artificial intelligence (AI)

Well, all my reasons go along well with the concept of singularity – us and technology, combined. I believe it is obvious now that we are headed towards the singularity. Take Google Glass for example. It’s a big step towards it.  Raymond Kurzweil is so convinced it’s going to happen during his lifetime that he does everything he possibly can to prolong his life. If you need more convincing – even Google hired this guy to “work on new projects involving machine learning and language processing.” Kurzweil predicts the singularity to occur around 2045.

My conclusion is simple – the sooner we start using microchip implants for humans, the faster we’ll come to the singularity era, merely because we’ll have the solid ground for further development of the concept.

What is your opinion on this topic? Are you excited about the future? Or afraid? What are the negative possible outcomes of microchip implants in humans and the human-technology juncture?


Source: http://jmbg.biz/2014/01/28/5-reasons-why-we-need-microchips-under-our-skin/

%d bloggers like this: