Big Data – Big Help or Big Risk?

By Andy Thurai (Twitter: @AndyThurai)

[Original shorter version of this article appeared on PW]

As promised in my last blog “Big Data, API, and IoT …..Newer technologies protected by older security” here is a deep dive on Big Data security and how to effortlessly secure Big Data effectively.

It is an unfortunate fact that like other open source models, Hadoop has followed a similar path in that it hasn’t focused that much on security.  “Project Rhino”, an Apache Hadoop security project initiative spearheaded by Intel is aimed at correcting the inherent deficits that previously made Hadoop an untenable solution for security conscious enterprises.

In order to effectively use Big Data, it needs to be secured properly. However if you try to force fit everything into an older security model with older security tools, you will undoubtedly end up compromising more than you think.

There is a fundamental architectural difference with Big Data that is dissonant with previously accepted security models. Characteristic s of the new paradigm include globally distributed data collection, the collection and storage of both structured and unstructured data (such as images, audio, video, sensor data, etc.), the introduction of massive data storage, high speed massive parallel processing,  and near real time analysis with decision-making  made on massive amounts of data.

When you move this unsecure data, or your goldmine, into the cloud where your security controls are not as good as the ones in your enterprise, this problem becomes highly magnified. The associated problem with that is, if you make it highly secure, using existing security model and tools, it will become a proverbial “brick.”  In other words, your security will interfere with performance. The amount of data stored, analyzed, and moved using Big Data is massive, and making it even slower will only hinder the associated systems that depend on it.

In order to effectively secure Big Data, you must mitigate the following security risks that aren’t addressed by prior security models.

Consider the following:

Issue #1:  Are the keys to the kingdom with you?

In a hosted environment, the provider holds the keys to your secure data. If a government agency legally demands access, the providers are legally obligated to provide access to your data with and sometimes even without your prior knowledge. While it is necessary, the onus should be on you to control when, what, and how much you are giving others access to and also keep track of the information released to facilitate internal auditing processes.

gove agency

Keep the keys to the kingdom with you. One possibility is to provide your own set of key management controls so the keys can be managed by you (and only you). A better solution would be for you to provide gateway encryption proxies. They will not only allow you to have the key control, but they will also ensure that the data that is flowing in and out of your system can be properly encrypted based on your choice of encryption algorithm, encryption strength/ type, strong keys, etc. This also affords you the flexibility to either manage the keys in the cloud (when there is computing done in the cloud) or manage the keys in the enterprise and send encrypted data to the cloud (when the cloud is mainly used for storage and distribution).

Issue#2: Encrypting slows things down

If you encrypt the entire data, it could slow the performance down significantly.  In order to avoid that, some of the Big Data, BI, and analytics programs choose to encrypt only portions of sensitive data. It is imperative to use a Big Data eco-system that is intelligent enough to encrypt data selectively.

A separate and more desirable option is to run faster encryption/ decryption. Solutions such as Intel Hadoop security Gateway use Intel chip based encryption acceleration (Intel AES-NI instruction set as well as SSE 4.2 instruction set) which is several orders of magnitude faster than software based encryption solutions. It is not only faster, but it is also more secure as the data never leaves the processor for an on or off-board crypto processor.

Issue #3: Identifiable, sensitive data is a big risk         

Sensitive data can be classified into two groups: The group that poses risk and the group that needs to be compliant (I know you could say that the IP (Intellectual Property) and business confidential information can be classified into a separate category by itself, but, for the purposes of this discussion, they are deemed risky data). While there may be some that might fall into both categories, these two categories drive the necessary level of protection. The sensitive data might include PCI related information, such as PAN data, or PII data, such as bank account numbers, passport information, DOB, etc. or PHI data, such as medical records. It may include confidential business data ranging from intellectual property, sales forecasts, or pricing information. Assuming you have a set of corporate policies that can help you define and classify and identify this information, you still need an effective mechanism to safeguard this information.

Safeguarding your data might include one of the following:

  1. Completely redact this information so you can never get the original information back. While this is the most effective method & one that could be used for old archives, it would be difficult to get the original data back if needed. Another option could be to partially mask the data and leave only harmless residual information, such as the last four of social security numbers. Care should be taken that enough information is redacted so that there is no way the original information can be obtained mathematically or logically.
  2.  Anonymize or tokenize the sensitive data using proxy tokenization solution. The advantage is that you can create a completely random token that can be made to look like the original data to fit the format so it won’t break the backend systems. The sensitive data can be stored in a secure vault and only associated tokens can be distributed. An out-of-band mechanism can be used, with a pre-required security handshake, to get the original data back if needed.
  3. Encrypt the sensitive data using mechanisms such as Format Preserving Encryption (FPE) so the output encrypted data fits the format of the original data. Care should be exercised in selecting a solution to make sure the solution has strong key management & strong encryption capabilities.

The most sensible way to do this would be to have a touchless security/ tokenization gateway which can do all of the above based on the context of the content. In addition, it can also be used to strip/redact Java exception traces (such as*) exposed by the sensitive APIs in #5 (below). By introducing a touchless gateway concept, there is no need to touch the Hadoop clusters to implement all of the above functionality, and yet it will give you complete control to enforce corporate policy and address the attendant security & compliance issues.

Issue #4: Data and access control properties together

One of the major issues with distributing data is that you often have people who are accessing information from different geos without consistent enforcement.  This is particularly true with Big Data, where data is distributed for parallel processing and for redundancy purposes. If you let applications/services access the raw data, without an appropriate level of enforcement, you are depending on the proverbial “honor system” assuming it is only the good guys that are accessing the information. Instead, you might want to enforce the data access controls based on the classification levels as close to the data as possible. You need to distribute data, associated properties, classification levels, and enforce them where the data is. One way to enforce this would be to have an API expose data that can control the exposure based on data attributes locally, so it can be consistent every time.

Issue #5: Protect the exposure APIs

Many of the Big Data components, not surprisingly, communicate via APIs (i.e. HDFS, HBase, and HCatalog). The idea around exposing these components would be to manipulate them via REST calls. If anyone were to use these API calls they would know very well that there is very limited security in exposing these APIs. When you allow such powerful APIs to be exposed with very little, or no protection, it could lead to disastrous results. For example, you can create directories/files, as well as delete them, using these API calls. Some of these RESTful API calls utilize weak or inconsistent protection and yet expose too much information, such as full Java stack trace exposing the ‘soft inside’ easily.  One way to compensate for this deficiency would be to write some plugins and inject some API /interface based protection into all of the Hadoop components. Aside from this, there is the issue of name node exposure which is important enough to justify its own separate paragraph and treatment. The consistency of API security and configuration is also questionable, as all of the Hadoop APIs are configured, managed, and audited independently of one another.

The most effective way to protect your Big Data goldmine would be to introduce a touchless API security Gateway in front of the Hadoop clusters. The clusters can be made to trust calls ONLY from the secure gateway. By choosing a hardened Big Data security gateway you can enforce all of the above by using very rich authentication and authorization schemes and make it work with an existing enterprise identity system not just that SPNEGO based basic schemes offered by Hadoop clusters.  More importantly, this would allow you to integrate the security by extending your existing security model, incorporate your existing identity scheme, and  provide a secure version for these newer technologies.

Issue #6: Name node protection

This issue is important enough for me to call this out as a separate issue. This arises from the architectural perspective that, if no proper resource protection is enforced, the NameNode can become the single point of failure making the entire Hadoop cluster useless. It is as easy as someone launching a DOS attack against webHDFS by producing excessive activity that can bring webHDFS down.


Issue #6: Identify, Authenticate, Authorize and control the data access

You need to have an effective Identity Management and Access control system in place to make this happen. You also need to identify the user base and effectively control access to the data consistently based on access control policies without relying on an additional identity silos. Ideally, authentication and authorization for Hadoop should leverage existing identity management investments. The enforcement should also take into account the time based restrictions as well (such as certain users can access certain data only during specific periods, etc.).

Issue #7: Monitor, Log and analyze the usage patterns

Once you have implemented an effective data access controls based classification, you also need to monitor and log the usage patterns. You need to constantly analyze the usage patterns to make sure that there is no unusual activity. It is very crucial to catch an unusual activity and access-pattern early enough so you can avoid dumps of data making it out of your repository to a hacker.


As more and more organizations are rushing to implement and utilize the power of Big Data, care should be exercised to secure Big Data. Extending the existing security models to fit Big Data may not solve the problem; as a matter of fact it might introduce additional performance issues as discussed above. A solid security framework needs to be thought out before organizations can adopt enterprise grade Big Data.

About Andy Thurai
My website is

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: