February 7, 2020
TABLE OF CONTENTS
Before your big-brained data scientists wring value out of your reams of data, it has to be accessible and, on some basic level, coherently arranged. And while data architecture is certainly something any data scientist should at least be familiar with, few would consider it among the tasks that drew them to the profession.
To harness all that brainpower, and keep them from running elsewhere, you need to keep the data wrangling to a minimum. Enter the data lake, the catch-all buzz phrase we love to bandy about when talking about data that isn’t necessarily ready for prime time but will someday come in handy.
This is not to say that information stored in a data lake cannot be vital to success. But with the advent of limitless storage, the question of whether or not to retain data is less about it being vital now and more about making sure the data are accessible, properly secured, ready for analysis, and consistently defined.
The primary challenge facing any organization looking to stand up a data lake is where and how to store it. All major cloud providers provide the basics of a data lake:
These three features should be considered the base requirement when keeping more data for longer that may contain sensitive information to your business or customers.
But since these features have become commonplace, cloud practitioners want their data automatically categorized and immediately consumable by a wide variety of products. At Levvel, we frequently use AWS Glue to generate schemas for paths in S3 buckets. The result is not only a handy way to stitch together multiple AWS products but also a Hive metastore that can be referenced by external tools. Google and Azure have similar offerings in “Cloud Composer” and “Data Catalog,” respectively.
Data lakes are organized like a file system. Paths determine the data set and may be used to partition data, as well. When coupled with a Hive metastore and a Hadoop cluster, you can execute traditional interactive queries or batch jobs on your data lake. If better performance is needed, a data warehouse such as AWS Redshift can quickly ingest data directly from S3.
Whatever process is generating this data is probably not thinking too hard about making life easy for data architects or data scientists. Whether it is coming from IoT devices or application logs, it makes sense to send them through a buffer to collect a reasonable block of data, perform some transformations or enhancements, then write to its final destination in the data lake. AWS Kinesis Firehose is a common tool for the job, able to buffer 1 to 128 megabytes before executing a serverless function to augment the data. Choosing your preferred compression format is as simple as checking a box. Learn more about serverless architectures with our webinar.
Unsure if your business can benefit from a data lake? See our checklist to help you make that determination.
Senior Cloud Consultant
Ben is a data scientist and AWS Certified Solutions Architect and Developer. As an analyst and data scientist, he has worked in the retail, banking and automotive industries in consulting and practitioner capacities. In his work as an cloud consultant, he has advised Fortune 50 banks, written a Python library for multiple-account management, and created big-data and machine-learning pipelines for nationally-recognized media brands. He holds an M.S. in Economics and lives in New York City.
In this video series from Levvel, our industry experts discuss the ways cloud services can support digital innovation, the differences between cloud and on-premises security approaches, and the benefits that come with adopting a cloud architecture.
This article aims to present critical issues businesses are facing in light of the COVID-19 pandemic and how to use modern data solutions to resolve, mitigate, and/or insulate businesses from those problems in the future.
This article provides insight into the legacy architecture challenges national insurers face and their impact on reaching business goals.
In this new video series, Levvel experts discuss the key aspects of providing a great experience for insureds, where technology should be introduced to the insured, insurer relationship, and how tech enables insurers to provide better experiences.