dbt is taking the data eng world by storm. this is the best 10 minutes i've heard on it, from Fishtown Analytics cofounder Drew Banin.
Fun fact: Fishtown is a town in Philadelphia!
Audio source: https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/ (14 mins in)
Fivetran CEO geeking out on dbt: https://medium.com/hashmapinc/the-origins-and-future-of-fivetran-with-george-fraser-1af8e7cb90aa (17 mins in)
Transcript
Tobias Macey: And also, we don't have this version repository of queries and reports and the ability to collaborate on it fairly easily, then you end up leaving everyone to write their own sequel, usually ad hoc, and they might have their own assumptions as to what an order is or what a customer is, or how to structure the query to join across different tables. And so everybody's going to have slightly different views of the data or slightly different outputs. And so definitely having the ability to have one location that everybody can look to. And one interface for everybody to collaborate on makes it much easier and more scalable to have more people working on building and interacting with these Analytics reports and these analytics pipelines.
Drew Bannon: Absolutely, I think it's a great point, what we find is that in the process of building up these data models, what you're actually doing is generating knowledge about your organization. And you're saying here's exactly what an order is, or here's exactly how we calculate MRR. And to that end, dbt ships with auto generated documentation about your project, you can run dbt docs generate, to generate this single page app, have all of your models with all the tests on the columns and descriptions that you can populate for these different models. And so if you do have some consumer of the data that isn't using GBT, they have a great place that can go and see all the models that exist and all the columns and your pros about all of it. And so in that way, it's sort of a catalog of all the data that your organization commands and serve instructions for us toTobias Macey: Yeah, and I think that that is definitely very powerful, because particularly having the documentation be generated as part of the code as opposed to something that someone does after the fact or alongside the work that they're doing means that it's much more likely to stay fresh and actually be up David periodically, rather than somebody putting in the time and effort to write some prose once when they first build the overall reporting pipeline. And then it quickly grows stale and useless over time as new modifications are made. Yeah, that's absolutely right. And another interesting capability that dbt has is the idea of packaging, and being able to package up these different subsets or reports or transformations so that they're reusable across different teams and across different code bases. So can you talk a bit about how those packages are set up and implemented, and also maybe talk a bit about who the sort of primary drivers are for the different packages that are currently available? Sure.
Drew Bannon: So when dbt runs, it will look in a couple different places for we will call it resources. And so an example of a resource is a model or a test of a model, or things like documentation, snippets, etc. And so one of the places that looks is your models directory, which are the models that you've created, but the other place it looks is a folder called dbt modules, which is sort of note inspired. And so what you can do is just drop whole dbt projects into that dbt modules folder. And they get picked up as though they're a natural part of your project. And all of these resources become available in the compilation context the dbt provides. And so there are basically two types of packages that that are produced. One is data set specific packages, and the other is sort of macro or utility packages. An example of a data set package is something like snowplow. And so we're huge fans of the snowplow event tracking system at Fishtown analytics, the big idea is that you can track events from all your different clients. And they flow directly into a big events table in your warehouse. And so this event table is like an immutable event log, it has the full history of you know, every interaction that you cared about to track in a single table, which is phenomenal. It's a great resource. But the problem is, it's difficult to plug a BI tool right into that, either because it's too much data or because the things you really care about are hard to analyze, like how many people had two different events in a single session. And so what we frequently find ourselves doing is rolling up these events into sessions using some code that was actually originally produced by the snow pub team called their web data model. And so what we can do is we can make a package of these transformations that go from bra events, to page views to sessions, all the way up to users. And then we encode these things as dbt models. And if you include this package into your dbt project, when you take dbt run, these models will automatically run you can also reference them from your own models. So if you want to do marketing attribution on top of session ization, that was provided by the snowplow package, you can absolutely do that. The other broad type of package that we make is maybe more focused on macros. And so the Jinja template engine supports something called macros, which are kind of functions that return text basically, in most cases, texts, we've actually been Kak them. So they return other things, which is pretty wild how we do it. And and so what you can do is if you find yourself writing the same type of code over and over again, what you can do is make a macro that accepts arguments and spits back out, usually the sequel that you need to encode that particular piece of logic. So here's a really good example that shows the full like force of the dbt compilation engine, we wrote a Actually, let me grab it, somebody contributed a pivot macro, that you could point to a table and a specific column. And you can say, pivot out the values into this column using this aggregate function. So you say, look at the Orders table, look at the, you know, have this better example. Look at the products table, look at the products color, and then pivot that pivot that out to like color red color blue color green, with a one if that's true, or zero, if it's not. And so this is probably something that a lot of people have written manually many times over. But when macros and the ability to sort of encapsulate logic plus packages, which is a distribution mechanism, we can write that thing once, and many, many people can benefit from it. So this is one example of a macro that was contributed by member of the dbt community. But really this this dbt utility package that contains the pivot macro has dozens of macros, many of which were contributed by dbt users. And the really cool thing is a lot of these people are in engineers by trade their analysts. And so for a lot of them, it's their first time contributing to an open source repository. And that's a pretty cool experience to be the benefactor of the the code that they wrote.
Create your
podcast in
minutes
It is Free