February 7, 2020
TABLE OF CONTENTS
Before your big-brained data scientists wring value out of your reams of data, it has to be accessible and, on some basic level, coherently arranged. And while data architecture is certainly something any data science should at least be familiar with, few would consider it among the tasks that drew them to the profession.
To harness all that brainpower, and keep them from running elsewhere, you need to keep the data wrangling to a minimum. Enter the data lake, the catch-all buzz phrase we love to bandy about when talking about data that isn’t necessarily ready for prime time but will someday come in handy.
This is not to say that information stored in a data lake cannot be vital to success. But with the advent of limitless storage, the question of whether or not to retain data is less about it being vital now and more about making sure the data are accessible, properly secured, ready for analysis, and consistently defined.
The primary challenge facing any organization looking to stand up a data lake is where and how to store it. All major cloud providers provide the basics of a data lake:
These three features should be considered the base requirement when keeping more data for longer that may contain sensitive information to your business or customers.
But since these features have become commonplace, cloud practitioners want their data automatically categorized and immediately consumable by a wide variety of products. At Levvel, we frequently use AWS Glue to generate schemas for paths in S3 buckets. The result is not only a handy way to stitch together multiple AWS products but also a Hive metastore that can be referenced by external tools. Google and Azure have similar offerings in “Cloud Composer” and “Data Catalog,” respectively.
Data lakes are organized like a file system. Paths determine the data set and may be used to partition data, as well. When coupled with a Hive metastore and a Hadoop cluster, you can execute traditional interactive queries or batch jobs on your data lake. If better performance is needed, a data warehouse such as AWS Redshift can quickly ingest data directly from S3.
Whatever process is generating this data is probably not thinking too hard about making life easy for data architects or data scientists. Whether it is coming from IoT devices or application logs, it makes sense to send them through a buffer to collect a reasonable block of data, perform some transformations or enhancements, then write to its final destination in the data lake. AWS Kinesis Firehose is a common tool for the job, able to buffer 1 to 128 megabytes before executing a serverless function to augment the data. Choosing your preferred compression format is as simple as checking a box. Learn more about serverless architectures with our webinar.
Unsure if your business can benefit from a data lake? See our checklist to help you make that determination.
Senior Cloud Consultant
Ben is a data scientist and AWS Certified Solutions Architect and Developer. As an analyst and data scientist, he has worked in the retail, banking and automotive industries in consulting and practitioner capacities. In his work as an cloud consultant, he has advised Fortune 50 banks, written a Python library for multiple-account management, and created big-data and machine-learning pipelines for nationally-recognized media brands. He holds an M.S. in Economics and lives in New York City.
API design is crucial, giving structure to application interaction. Given cross-functional teams and applications, development time is reduced with a clear, intuitive way to access data. API development often follows two approaches: REST and GraphQL.
As of June 2018, the state of California passed a new privacy law that could lead to more consequences for US-based companies than the European Union’s General Data Protection Regulation (GDPR). Here's what you need to know and how to be compliant.
Before your data scientists wring value out of your reams of data, it has to be accessible and, on some basic level, coherently arranged. To harness all that brainpower, you need to keep the data wrangling to a minimum. Enter the data lake.
Legacy applications get no respect. The developers who wrote them have aged out and no new developers want to work on career-killing software stacks. But they are still faithfully doing the job they were created to do long ago. So what's the problem?