There’s an exciting side to AI. Intense, multi-million dollar research over many years that leads to a billion dollar algorithmic breakthrough that keeps self-driving cars from crashing or detects lung cancer better than ever is glamorous part of intelligent systems.
But 95% of the work in machine learning isn’t glamorous at all.
It’s hard work.
As Kenny Daniel, co-founder of Algorithmia, said “Tensorflow is open-source, but scaling it is not.”
It’s long days experimenting with ideas. It’s about crunching picture file sizes down as far as they will go without losing the features that an neural net can key in on. It’s waiting for systems to train over hours or days before you know if you got a good answer. And then doing it again. And again.
When you’re starting out, it’s easy to imagine that machine learning is effortless. Just grab a few open source tools, read some papers on Arxiv, hire a few data scientists fresh out of school and you’re off and running. But the truth is that machine learning in production, at scale, is hard and getting harder every day.
To do the fun parts of AI, you need to do the hard work of managing all the pieces in the pipeline. For that you need tools and Infrastructure. Infrastructure isn’t glamorous but it’s absolutely essential. You don’t build a city on quicksand. You need bedrock and a strong foundation.
Most of the breakthroughs in AI have come out of mega-tech companies like Google with their in-house research lab DeepMind. They’ve got incredibly fine tuned, hyper-advanced infrastructures and researchers can throw virtually any problem at the engineers manning those infrastructures. If the key software pieces a researcher needs don’t exist, those engineers can just code it up in house.
But not everyone has an army of in-house programmers and a webscale infrastructure they can modify on the fly. As AI trickles down into enterprises, companies struggle to cobble together a coherent stack of tools to keep their AI/ML pipelines running like a well oiled machine. An explosion of open source projects, data lakes that have grown into data oceans, and a confusing array of complex infrastructure pieces only make it tougher.
And that’s what my work with the AI Infrastructure Foundation is all about, to gather together all the people, organizations, projects and companies that are working on building out the roads and plumbing of the AI driven future.
To democratize AI and put it in the hands of non-tech unicorns we need the right tools to make it easy.
The biggest problem in that stack is data management. Data is the new oil and the new gold. And it’s super easy for data to get out of control.
We don’t have big data anymore, we have very, very big data that keeps growing by the nanosecond.
A data science team doing facial recognition on outdoor cameras may start with a dataset of a hundred terabytes but it won’t stay that way for long. Today’s smart devices are packed with sensors and telemetry systems that send info back in a constant stream. Soon that hundred terabytes has grown into petabytes and beyond.
Data science teams can’t manage that data on their own. They need to work hand in hand with IT Ops and software engineering from start to finish. Too many organizations started their data science efforts as a separate processes from regular engineering tasks and it doesn’t work.
As data grows out of control and machine learning models and tools proliferate it’s essential to treat machine learning as another variation on traditional software engineering. That’s where Kubernetes and containers come into the ML workflow, along with DevOps programming paradigms, open source frameworks like Kubeflow and tools like Pachyderm that act as “Git for data” all come into the picture.
When your code, your files, and your models are all changing at that same time then keeping track of what changed and the relationship between them all gets harder and harder. Pachyderm sprang from the founders’ experience with machine learning over many years such as with detecting money laundering schemes and other types of fraud.
As data scientists at the peer to peer housing pioneer built their models, on the backs of incredibly complex and fragile multi-stage pipelines, the slightest change in the data brought the whole system crashing down. It often took hours to run and when it failed, engineers burned more hours trying to trace the problem, only to find that the data path had moved or the name of a directory had changed. Pachyderm solved that problem by keeping a perfect history of what, where and when data changes over time.
Every data science team will struggle with this in the coming months and years. Every data science team needs to find a way to get better control over their ever changing landscape before it breaks their own complex architectures or causes the kinds of errors they can’t chase down easily.Version control and knowing where data came from and where it’s going, aka “data provenance,” is critical for ethics too, which brings us full circle to were I started here.
When our loan program starts denying loans to people named “Joe” and “Sally” we need to track down the problem and roll it back to an earlier snapshot of that model, or fire up a new training session, or quickly put in a rule to make stop denying all the Joes and Sally’s of the world a chance to buy that shiny new hybrid sedan to drive their new baby Jimmy around in.
More and more solutions like Pachyderm will pop up in the next few years. We’re already seeing tools like Tensorflow grow to Tensorflow Extended, based on Google’s own struggles with managing ever more complex machine learning pipelines.
All of these tools are evolving towards one thing: a canonical stack.
Without a strong foundation we can’t build strong house.
The AI Infrastructure Foundation helps us get our hands around this evolving stack and help drive it to deliver the tools we need to build the houses of tomorrow. And it will make sure we’re all working together on it instead of at cross purposes. Too many organizations try to develop everything in house. Not invented here syndrome leads to lots of overlaps. Better to adopt the mantra of the Kubeflow team and not try to invent everything themselves but to embrace a collection of the best tools and bring them all together.
Lots of teams built their own AI Studio because they had no choice but these home grown systems aren’t going to cut it over the long haul. Over the next few years, we’ll see an industry standard machine learning stack solidify. Many different organizations will contribute a piece of the puzzle. Who those key players are is still very much in flux, but it will happen faster than anyone expects as organizations laser in on the best open source software, in house and in the cloud and eventually in the fog too.
Be ready to adapt. As Bruce Lee said “Be like water. Your put water in the cup, it becomes the cup. You put water in the bottle, it becomes the bottle.”
Make sure your organization doesn’t hold on to those home grown systems while the rest of the world blows by them. Change with the circumstances and the times.
Be flexible and willing to adopt new tools as they mature.