The advantages of data lakes when it comes to gathering and analysing large quantities of disparate data should now be clear. As I detailed in my previous blog, this new way of thinking about data promises to revolutionise how businesses approach analytics, discovery and learning, offering them a more in-depth, programmable solution for digging deeper into the digital assets they possess.
But with so much valuable data all held in a single solution, it should be obvious that protecting this information needs to be paramount. This is particularly true for certain industries such as healthcare, for which the advent of data lakes offers huge opportunities - but also comes with its own challenges.
The healthcare opportunities
In the healthcare sector, the rise of data lakes can be particularly transformative. This is an area that is greatly dependent on large volumes of data - often coming in a wide range of unstructured formats that, in the past, have been difficult to reconcile. But with data lakes, this is no longer the case.
For example, when it comes to population health, data lakes can expand the analytics capabilities of healthcare organisation greatly, allowing them to look at the records of hundreds of thousands of people with no more time and effort than would have previously been required to look at just 100 patients. This means key patterns can be spotted much more quickly and ensures the results of studies will be much more accurate.
But achieving this will be dependent on companies choosing the right technology solutions to ensure they can effectively analyse the data and gain insight - and on ensuring that data stored in such solutions is fully secure.
What solutions should be used?
The central technology for most data lake deployments will be the open-source Apache Hadoop and related technologies. But one of the key decisions to be made is whether this is managed on an on-premises basis, or whether it is run in the cloud. Cloud services like AWS and Microsoft Azure now provide a lot of ready-made services and components, so companies can get up and running very quickly. This naturally makes them an attractive option.
But while there's an obvious convenience value to these tools, should we be using them for highly-sensitive data lakes such as those used in the healthcare sector? Personally, I'm in favour of cloud options because, in addition to the high level of performance on offer, they offer a range of tools not easily available on-premises.
The most important of these is the ability to create temporary and scalable calculation clusters, so the data lake becomes a fully-scalable and manageable asset that allows users to provision extra resource when needed. This is something that's very hard to achieve with traditional on-premise solutions, which usually restrict companies to a specific number of nodes and clusters.
The freedom and flexibility of these will be hugely valuable to data scientists. While the tools on offer in the cloud are mostly the same as on-premises, the resources available are much higher, which enables companies to make their data lakes much more manageable and cost-effective.
The security question
When it comes to security, today's cloud services offer very good levels of protection - as good as or often better than what companies would be able to put in place on their own - so if businesses are concerned about entrusting highly-confidential data outside of their controlled environment, they needn't worry.
Of course, in some circumstances, such as some types of government and healthcare data, regulations will restrict the use of cloud services. In Finland, for example, there are requirements for certain data to be stored and processed within the country's borders- something that may not always be guaranteed with certain cloud providers. Therefore, on-premises may be the only option.
But regardless of what option businesses use, there are a few common requirements for ensuring security. These include centralised administration of the data lake, to ensure that policies and security measures are consistent across all clusters, as well as tough authentication and perimeter defences to regulate who has access. Comprehensive auditing processes are also essential to keep track of who is viewing the information.
Doing this will ensure that operations dealing with the most sensitive data, such as healthcare studies, are fully protected, regardless of whether organisations stick with on-premises tools or take advantage of the benefits offered by the cloud.
Missed my previous blog about data lake revolution? You can read it here.