February 9, 2017
TABLE OF CONTENTS
Working with clients gives us a great opportunity to apply new technologies to solve real problems. Combine opportunity with a love for the bleeding edge, and it makes for some engaging dialogs on how to make a business more effective. Today’s article is the first part in a series discussing how to use machine learning.
To get started in machine learning you need to know:
Data science used to come with a steep learning curve and expensive expertise to help build systems and models to make predictions. These approaches would lead to teams trying to support a convolution neural network (CNN), recurrent neural network (RNN), or machine learning algorithms. Building, supporting, and re-validating model accuracy with an in-house solution is an expensive venture.
With the emergence of winning open source algorithms and frameworks, you no longer need a PhD in statistics to be effective. Popular frameworks like Tensorflow enable DevOps engineers with a scalable foundation to quickly start building an intelligent application stack. A team can use all FOSS tools or Amazon Machine Learning to start figuring out how to make better predictions. Hypothetically, an organization might want to increase campaign conversions, hit better click-through rates, build a recommendation engine, or create forecasts to predict events using historical data like fraud detection. These are very different end goals, but at the core, the machine learning tools will mostly be the same. This leaves your team time to focus on understanding how to create a quality, unbiased dataset to improve your product, experience, and platform.
Determining how predictive your data is and what features can be built to help improve prediction accuracy is a journey, not a few sprints.
DevOps adoption has made software teams more effective by helping streamline deployments. Teams looking to increase their data science productivity should leverage these CI/CD tools to build predictive artifacts from a repository commit webhook, just like software teams. Doing so enables an organization to focus on the defining the predictive dataset, benchmark which algorithms and datasets are effective, and deploy those best-of algorithm models across any environment.
DevOps data science tooling is an emerging ecosystem, and we can help your teams build a data science artifact pipeline that is setup to continually adjust, test, and refine predictions. A data science pipeline enables an organization time to focus on refining intelligent features to hit better predictive success rates. By building an effective data science pipeline, your data scientist team can focus on testing new ideas, not deploying intelligent infrastructure.
A recent study found a general trend that large companies using machine learning are targeting higher sales growth and, more importantly, understand their data better than before. Like most things, an organization will get out what it puts into machine learning. Building and supporting a predictive model is like always trying to find a better unicorn. The more you start to see how data can be used to predict an event, the more you will want to make a better mousetrap. Tools like eXtreme gradient boosting (XGB), Tensorflow, and MXNet are great starting points for teams looking to dive into the FOSS ecosystem.
Building that first predictor model is no harder than building a compiled binary, but initial predictive accuracy will likely be pretty low. To increase predictive success, an organization can leverage feature engineering or use deep learning to find hidden relationships in the data. Feature engineering is a science and an art I will reserve for an individual article, but for now, think of this as building a component signal for all or parts of a prediction. Feature engineering can also be a negative due to overfitting and bias, and why it is important to choose tools that enable model evaluation for pruning out biased features. The gist of all this is: if you can identify signals more accurately, you can make better predictions.
If your organization has enough data, you can leverage deep learning to help find hidden relationships. If not, you can use algorithms like XGB to find important features using its native gradient boosting techniques to reduce error and rank features by importance. I found XGB to be a great starting point to machine learning. Under the hood, it is a highly-tunable algorithm that supports running in parallel. Turn a few dials and XGB can quickly build large, trained data models for making predictions.
Choosing the right tools, frameworks, and pipelines will enable your organization to start small and scale into larger problems requiring more data and more processing power (even running on GPUs). These tools, combined with cloud-managed services, are making this discussion easier month by month.
The more data you can collect, the more you can use to make predictions. The emergence of customizable, competition-winning algorithms like XGB are great starting places, but at the core, all machine learning models need quality predictive data to train and learn. Predictive data comes in many forms, and if your organization is concerned about having limited data, then you can build tunable features for helping carve out better success rates while accounting for negative overfitting and bias.
If your organization needs help with machine learning, building model pipelines, distributed model caching, building a remote data science store, or feature engineering, please reach out to us at Levvel, and we’d be happy to get you started.
Until next time,
At the end of lunch with a mentee, I used the items on our table to express the fundamental concepts of Kubernetes. Sometime after explaining the purpose of the Kubernetes scheduler, she asked a question I spent the next several weeks thinking about.
API design is crucial, giving structure to application interaction. Given cross-functional teams and applications, development time is reduced with a clear, intuitive way to access data. API development often follows two approaches: REST and GraphQL.
As of June 2018, the state of California passed a new privacy law that could lead to more consequences for US-based companies than the European Union’s General Data Protection Regulation (GDPR). Here's what you need to know and how to be compliant.
Before your data scientists wring value out of your reams of data, it has to be accessible and, on some basic level, coherently arranged. To harness all that brainpower, you need to keep the data wrangling to a minimum. Enter the data lake.