Over the past few posts, I've covered the importance of enabling the Azure Data Lake Gen 2 firewall and the changes required to ensure your SQL Data Warehouse and Databricks services maintain proper access. That was a critical first step in securing our big data. However, we now need to give attention to locking down the filesystems and the folders within to only those who are authorized.

Microsoft has taken two different cloud technologies (Storage Accounts and Azure Data Lake Gen 1) and combined them into a single resource: Storage Account with Hierarchical namespace enabled. This new Storage Account contains 3 main access control models that work in tandem with each other: Role-Based Access Controls, Shared Key or Shared Access Signature, and Posix style ACL's.

Let's review each method to see what is appropriate for our situation.

Role-Based Access Controls

  • Coarsely Grained Access Control
  • Smallest Granularity is Container level
  • Permissions applied at the RBAC level are automatically inherited to all filesystems and folder
  • Access can be granted to the following Active Directory object types: Users, Groups, Service Principals, and Managed Identities.

Shared Key or Shared Access Signature Authentication

  • No identity is associated with the caller
  • Effectively gains super-user access to the data

Posix style ACL's

  • Most granular Access Control Available
  • Access can be applied down to files and directories
  • Storage Blob Data Owner RBAC acts as a super-user and can control ACL's
  • Access can be granted to the following Active Directory object types: Users, Groups, Service Principals, and Managed Identities.
  • Default ACL's can be applied to new files and folders that are added to the directory.

As you can the Posix Style ACL's provide the most granular style controls to our Data Lake, however there are some gotcha's you should be aware of that will impact your implementation and possibly some upfront decisions you might want to make ahead of time.

Be Proactive

The first area I'd like to draw your attention to is the fact that Azure Data Lake is an enterprise level solution, and as such you can expect that there will be a massive amount of folders and files that will be put there (quite possibly billions of files). Now think about this, adding a new ACL to a folder will not automatically propagate to all children because ACL's are stored at the individual file and folder level. What this means is that if you want to update your ACL's you will need to apply it to every single folder and file that is affected by it (billions with a b).

Based on these facts, my advice here is think carefully about the permissions you will need ahead of time and apply them to higher level folders. One possibility is to create Active Directory Security Groups that mirror the dataset you are trying to protect. For example say you have a filesystem with a top level folder with numerous directories representing the source system's you are integrating, one example being sales force.

  • data/salesforce

What you could do is add the following security groups:

  • ADLS-SALESFORCE-READER
  • ADLS-SALESFORCE-CONTRIBUTOR

If an ACL applied at the data\salesforce folder and set as accordingly to the default ACL, all folders and files created underneath this folder will inherit the appropriate permissions. It's so much more efficient to modify a User/Group's access to a security group than to go and update a billion files out on your lake.

Don't Forget Execute

Posix styles permissions include 'rwx' or Read-Write-Execute. When applying an ACL to folder that sits under another folder, be share to add the execute permission to any parent folder.

Let me illustrate, say you want to a read permission to the following folder:

  • data/salesforce/2019/12

You need to apply 'r-x' to the folder list above, however you will to also update the following folders with an execute permission '--x' in order for them to be able to see it:

  • data
  • data/salesforce
  • data/salesforce/2019

Posix Update Methods

According to Microsoft's documentation found here, there are two main ways to update the ACL's on Azure Data Lake Gen 2. They are by using the Azure Storage Explorer or via the REST API. What's missing from these options include some other means we might be used to via a CLI or an SDK, so if you are looking to create/update the permissions using Infrastructure As Code you will need to be prepared to write some code to implement.

Organization

With the sheer volume of data being persisted by the Data Lake, you can quickly lose track of what's being added. Another benefit of securing your lake is that it might force you to think a bit more carefully about the structure of your data and it might help to prevent a data swamp.

Image by Gerd Altmann from Pixabay