Decoded: Data lakes for security

The capabilities that data lakes support make them powerful security tools in today's complex and fast-moving threat landscape

By Tim Ferguson

Wed 15 Nov 2017 @ 11:29

I keep hearing about data lakes. Tell me how they work.

Big data and analytics are being used by numerous industries to extract the maximum value from large datasets as businesses increasingly focus on efficiency and gaining a competitive edge.

A recent development in this space has been the use of data lakes. Data lakes pool together raw data held by organisations and store it in its original format. Unlike hierarchical data warehouses, which use files and folders, data lakes use a flat architecture. This makes data lakes cheaper to run than traditional data warehouses, but also enables analytics tools to work across data that may not have been associated previously, allowing businesses to generate new insights.

In addition, as data lakes are often based on the open-source Hadoop framework, they are extremely scalable, even extending into the cloud if needed. This means they can accommodate a rapid influx of data—from a company merger or an increase in data generated within a corporate network, for example—both easily and cost effectively.

OK, so what are the applications for security?

The attributes of data lakes mean they have the potential to be powerful tools in addressing the increasingly sophisticated and fast-moving threat landscape.

Organisations should expect their networks to be compromised as cyber criminals find innovative ways, such as spear phishing or whaling, to get through perimeter security into corporate networks. In addition, as the attack surface widens—thanks to cloud computing, mobile devices and the Internet of Things—the number of routes into networks increases.

The key is to deal with these threats quickly once they have entered the network. In order to drive down the mean time to detect and mean time to respond, it is essential to effectively monitor data as it travels across a network. This means any potential issues can be quickly identified for investigation before they inflict serious damage.

Data lakes allow vast amounts of security data to be pooled together so they can be interrogated by a range of security technologies such as monitoring and forensics tools, log management systems, and security incident and event management (SIEM) systems.

As mentioned previously, data lakes are also incredibly scalable, so they can quickly deal with a spike in incoming security data as a result of a serious cyberattack, and not fail when most needed.

By having all security data in a single location, different security tools can access the relevant data in one place, making storage requirements less complex, and increasing the speed and efficiency of security processes.

It's not just about the volume of data, though, is it?

Correct. Data lakes can also help deal with the variety and complexity of data produced across organisations that could indicate a compromise or suggest suspicious patterns of behaviour.

The vast array of information and formats that data lakes can deal with means they're able to provide organisations with a single repository for data produced by a plethora of security tools.

As part of this, data lakes avoid duplication of security data, which is often stored in multiple copies across an organisation as different security products collect and store data in their own formats. For example, network monitoring tools process and store their own copies of data, while network anomaly detection, user scoring and correlation engines also all need copies of the data to function. Data lakes avoid this duplication by collecting data once, and making it available to all the tools and products that need it.

Another key capability when dealing with different types of data is that data lakes allow analytics tools to work across data that may have resided in separate siloes in the past. In a security context, this means patterns of activity within different sets of data can be more easily connected to flag malicious activity that may have been missed before. This is particularly useful when you consider how sophisticated cyberthreats are becoming.

How do security systems make heads or tails of all this data?

Although data lakes represent a great way to gather, store and process security data, the sheer range of data types, formats and standards requires technology to make sense of it all before user and entity behaviour analytics (UEBA) and SIEM systems can get to work on it.

Data classification tools can translate hundreds of different data formats by adding the context needed to produce consistent information that can be understood by any analyst who comes into contact with it. These tools essentially act as a Babel fish for security data, meaning analysts can tap into the vast majority of data held in a security data lake.

The ability to translate such a vast range of data and make it into useable is invaluable as security teams work night and day to keep pace with the cyberthreats impacting their organisations.