Blog
February 7, 2020
TABLE OF CONTENTS
Before your big-brained data scientists wring value out of your reams of data, it has to be accessible and, on some basic level, coherently arranged. And while data architecture is certainly something any data scientist should at least be familiar with, few would consider it among the tasks that drew them to the profession.
To harness all that brainpower, and keep them from running elsewhere, you need to keep the data wrangling to a minimum. Enter the data lake, the catch-all buzz phrase we love to bandy about when talking about data that isn’t necessarily ready for prime time but will someday come in handy.
This is not to say that information stored in a data lake cannot be vital to success. But with the advent of limitless storage, the question of whether or not to retain data is less about it being vital now and more about making sure the data are accessible, properly secured, ready for analysis, and consistently defined.
The primary challenge facing any organization looking to stand up a data lake is where and how to store it. All major cloud providers provide the basics of a data lake:
These three features should be considered the base requirement when keeping more data for longer that may contain sensitive information to your business or customers.
But since these features have become commonplace, cloud practitioners want their data automatically categorized and immediately consumable by a wide variety of products. At Levvel, we frequently use AWS Glue to generate schemas for paths in S3 buckets. The result is not only a handy way to stitch together multiple AWS products but also a Hive metastore that can be referenced by external tools. Google and Azure have similar offerings in “Cloud Composer” and “Data Catalog,” respectively.
Data lakes are organized like a file system. Paths determine the data set and may be used to partition data, as well. When coupled with a Hive metastore and a Hadoop cluster, you can execute traditional interactive queries or batch jobs on your data lake. If better performance is needed, a data warehouse such as AWS Redshift can quickly ingest data directly from S3.
Whatever process is generating this data is probably not thinking too hard about making life easy for data architects or data scientists. Whether it is coming from IoT devices or application logs, it makes sense to send them through a buffer to collect a reasonable block of data, perform some transformations or enhancements, then write to its final destination in the data lake. AWS Kinesis Firehose is a common tool for the job, able to buffer 1 to 128 megabytes before executing a serverless function to augment the data. Choosing your preferred compression format is as simple as checking a box. Learn more about serverless architectures with our webinar.
Unsure if your business can benefit from a data lake? See our checklist to help you make that determination.
Authored By
Ben Hunter
Senior Cloud Consultant
Meet our Experts
Senior Cloud Consultant
Ben is a data scientist and AWS Certified Solutions Architect and Developer. As an analyst and data scientist, he has worked in the retail, banking and automotive industries in consulting and practitioner capacities. In his work as an cloud consultant, he has advised Fortune 50 banks, written a Python library for multiple-account management, and created big-data and machine-learning pipelines for nationally-recognized media brands. He holds an M.S. in Economics and lives in New York City.
Let's chat.
You're doing big things, and big things come with big challenges. We're here to help.