What is a Data Lake? A Primer on Big Data Storage

Blog

February 7, 2020

TABLE OF CONTENTS

Introduction to Data Lakes

Before your big-brained data scientists wring value out of your reams of data, it has to be accessible and, on some basic level, coherently arranged. And while data architecture is certainly something any data scientist should at least be familiar with, few would consider it among the tasks that drew them to the profession.

To harness all that brainpower, and keep them from running elsewhere, you need to keep the data wrangling to a minimum. Enter the data lake, the catch-all buzz phrase we love to bandy about when talking about data that isn’t necessarily ready for prime time but will someday come in handy.

This is not to say that information stored in a data lake cannot be vital to success. But with the advent of limitless storage, the question of whether or not to retain data is less about it being vital now and more about making sure the data are accessible, properly secured, ready for analysis, and consistently defined.

Durable Storage and Data Definition

The primary challenge facing any organization looking to stand up a data lake is where and how to store it. All major cloud providers provide the basics of a data lake:

  • Limitless space for object storage
  • Highly durable and available
  • Flexible and fine-grained access control

These three features should be considered the base requirement when keeping more data for longer that may contain sensitive information to your business or customers.

But since these features have become commonplace, cloud practitioners want their data automatically categorized and immediately consumable by a wide variety of products. At Levvel, we frequently use AWS Glue to generate schemas for paths in S3 buckets. The result is not only a handy way to stitch together multiple AWS products but also a Hive metastore that can be referenced by external tools. Google and Azure have similar offerings in “Cloud Composer” and “Data Catalog,” respectively.

Data Processing

Data lakes are organized like a file system. Paths determine the data set and may be used to partition data, as well. When coupled with a Hive metastore and a Hadoop cluster, you can execute traditional interactive queries or batch jobs on your data lake. If better performance is needed, a data warehouse such as AWS Redshift can quickly ingest data directly from S3.

Serverless Options

Whatever process is generating this data is probably not thinking too hard about making life easy for data architects or data scientists. Whether it is coming from IoT devices or application logs, it makes sense to send them through a buffer to collect a reasonable block of data, perform some transformations or enhancements, then write to its final destination in the data lake. AWS Kinesis Firehose is a common tool for the job, able to buffer 1 to 128 megabytes before executing a serverless function to augment the data. Choosing your preferred compression format is as simple as checking a box. Learn more about serverless architectures with our webinar.

Unsure if your business can benefit from a data lake? See our checklist to help you make that determination.

Authored By

Ben Hunter

Senior Cloud Consultant

RECOMMENDED CONTENT

KubeCon 2019 Retrospective: Why Kubernetes?

Blog

API Design: GraphQL vs. REST

Blog

Everything You Need to Know About the California Consumer Privacy Act (CCPA) and How to Be Compliant

Blog

Meet our Experts

Ben Hunter
Senior Cloud Consultant

Ben is a data scientist and AWS Certified Solutions Architect and Developer. As an analyst and data scientist, he has worked in the retail, banking and automotive industries in consulting and practitioner capacities. In his work as an cloud consultant, he has advised Fortune 50 banks, written a Python library for multiple-account management, and created big-data and machine-learning pipelines for nationally-recognized media brands. He holds an M.S. in Economics and lives in New York City.

Related Content

KubeCon 2019 Retrospective: Why Kubernetes?

At the end of lunch with a mentee, I used the items on our table to express the fundamental concepts of Kubernetes. Sometime after explaining the purpose of the Kubernetes scheduler, she asked a question I spent the next several weeks thinking about.

Blog

Apr 01

API Design: GraphQL vs. REST

API design is crucial, giving structure to application interaction. Given cross-functional teams and applications, development time is reduced with a clear, intuitive way to access data. API development often follows two approaches: REST and GraphQL.

Blog

Feb 07

Let's chat.

You're doing big things, and big things come with big challenges. We're here to help.

Access the Blog

By clicking the button below you agree to our Terms of Service and Privacy Policy.

levvel mark white

Let's improve the world together.

levvel-mark-mint

© Levvel 2020