Snorkel.ai: Unlocking Subject Matter Experts to make Software 2.0 [Alex Ratner]
Source: https://www.thecloudcast.net/2021/06/automated-data-labeling-for-ai-apps.html
See also: https://softwareengineeringdaily.com/2020/04/09/snorkel-training-dataset-management-with-braden-hancock/
Software 2.0 is Andrej Karpathy's idea that instead of coding business logic by hand, the applications of the future will be trained by data. In other words, machine learning. But ML is limited by the quality of data available, and there is a lot of unstructured, unlabeled data out there that is still being manually labeled today. Scale.AI is a well known startup that has done very well offering a scalable manual labeling workforce, however they are still bottlenecked by the number of subject matter experts available for labeling critically important data, like cancer diagnosis and drug trafficking rings. In order to get labels from subject matter experts, you typically have to put them through a very tedious process of labeling to build up a useful structured dataset upfront before any useful machine learning can be done.
I did some very minor ML work about 5 years ago and found Christopher Re's work on DeepDive at Stanford. It takes a revolutionary approach by making it easy to write the labeling functions themselves. This turns the labeling process into an iterative, REPL like experience where subject matter experts can suggest a function, see its impact right away, and continue refining it, assisted by AI. DeepDive is now commercialized in a startup called Snorkel.AI, so I was very excited to find a clear explanation of Snorkelflow from its CEO, Alex Ratner.
Here it is!
Transcript
[00:01:15] Alex Ratner: [00:01:15] SnorkelFlow is a platform that's meant to take this process of building machine learning models and AI applications. And I get all starting with buildings, the data that they rely on that fuels them and make it, in a nutshell, look more like an iterative software development process. Then you know, this kind of 80, 90% upfront just, hand labeling exercise.
[00:01:34]And so snorkel flow supports that entire iterative loop of, actually laboring data. Can be by hand in the platform, but also most centrally programmatically by letting users, what we call labeling. Basic idea, is that rather than say asking your, legal associate at a bank to, or your doctor friends to sit down and, label a hundred thousand contracts or a hundred thousand electronic health records have them, right.
[00:02:00]Sharistics are bits of their expertise look for this keyword or look for this pattern or look for this, et cetera. I'm like a bridge from old, expert knowledge type input. Modern machine learning models using one to power. The other. So a snorkel flow is an IDE basically, and has a no-code UI component as well, but let's not people either via code or by pushing buttons for even, non-developer subject matter experts say to.
[00:02:24]Programmatically labeled their data by writing these labeling functions and then uses a bunch of modeling techniques. A lot of which was actually, the work that, that the co-founding team. And I did in, in, in our kind of thesis work around how you take a bunch of programmatic data and clean it up and turn it into a final.
[00:02:41]Instead of clean training data for machine learning models, and then actually in snorkel flow, you can, autumn, basically push button train best-in-class open source models. You can then analyze where they're succeeding or failing and, and use that to go back and iterate on your data.
[00:02:54]And there's a Python SDK throughout the whole thing. So many of our customers will mix and match. Will you start. Create the training data set and then train the model on some other system, et cetera. But what's normal flames of support. Is it basic iterative development process where, you know, rather than just spending months to label a training at once and then being stuck with it and having to throw it out and start all over again, anything in the world changes your upstream input, data changes your downstream objectives.
[00:03:18] Change, making it again more like an iterative process where you push some buttons or write some code. That label the data. You compile a model or train it, but you can think of it like compiling and then you go back and debug by, by iterating on your data, everything centers and snorkel flow around looking at your data and iterating on how it's labeled to improve models.
[00:03:38]Brian Gracely: [00:03:38] I'm curious. So you mentioned you mentioned in there's a there's a Python SDK, which for anybody who, works in data science, data modeling, right? Python is your language to Frank sort of the language you use or are you a couple of them, that's the language that, you how you do your program, but I'm curious, like in today's world, Do data scientists consider themselves programmers or is there still Hey, look, I work on the numbers, I'm good at building models and the numbers, but I don't think of myself as a programmer.
[00:04:08] Like how do you bridge those two worlds together or do you not really have to bridge them together? How much does the data scientists have to go? I have to focus on numbers and models versus I have to focus on programming, something to do stuff. What's their world look like?
[00:04:21]Alex Ratner: [00:04:21] It's a great question. I think I, I haven't been are currently I'm part of four or five different data science institutes or something. And I don't even still know. I mean, the data science is such a broad umbrella term. There's so many different varietals of us and, and types.
[00:04:35] And so I do think there's a very broad spectrum of, the data scientists. An ML engineer and just, loves writing codes are the one that, to your point really just wants to push some buttons and get back to the numbers and the modeling and the outcome. And, we definitely, try to support the range through a layered approach.
[00:04:50]And, we, we have , but on top of that, we have a a no-code UI that allows you to write these wavelength functions without writing code. So for example, if you're trying to train a CA a contract classifier and snorkeled flow, you can, write Lateline functions based on clicking on keywords or pressing buttons with kind of templates for types of patterns or signals you want to look for.
[00:05:11] So, No we try to support basically, if you want to move fast and you're a non developer, or you're just not looking to spend time there, you can just do it in push-button way. But then if you want to go and customize or inject custom logic or really get creative, you can always fall back to the Python SDK.
[00:05:27] And so, I mean, I think a lot of the what we're trying to accomplish in the very beginning, right? Raised me abstraction know level at which you're interfacing with and programming your machine learning model or your AI application. And the first step is the hardest, right?
[00:05:39] If you think of the way that hand labeled training data is, it's like the machine code, or really actually, just so you know, I think of it as like the ones and zeros, literally for binary classification cases. Yeah, a lot of the effort behind the circle project and the company is just, or was just getting from that layer to the layer of...
Create your
podcast in
minutes
It is Free