Sources and staging

In this section, we are going to create our source and staging models.

Sources allow us to create a dependency between our source database object and our staging models which will help us when we look at data lineage later. Also, if your source changes database or schema, you only have to update it in your f1_sources.yml file rather than updating all of the models it might be used in.

Staging models are the base of our project, where we bring all the individual components we're going to use to build our more complex and useful models into the project.

Since we want to focus on dbt and Python in this workshop, check out our sources and staging docs if you want to learn more (or take our dbt Fundamentals course which covers all of our core functionality).

Create sources

We're going to be using each of our 8 Formula 1 tables from our formula1 database under the raw schema for our transformations and we want to create those tables as sources in our project.

Create a new file called f1_sources.yml with the following file path: models/staging/formula1/f1_sources.yml.
Then, paste the following code into the file before saving it:

version: 2

sources:
  - name: formula1
    description: formula 1 datasets with normalized tables 
    database: formula1 
    schema: raw
    tables:
      - name: circuits
        description: One record per circuit, which is the specific race course. 
        columns:
          - name: circuitid
            tests:
            - unique
            - not_null
      - name: constructors 
        description: One record per constructor. Constructors are the teams that build their formula 1 cars. 
        columns:
          - name: constructorid
            tests:
            - unique
            - not_null
      - name: drivers
        description: One record per driver. This table gives details about the driver. 
        columns:
          - name: driverid
            tests:
            - unique
            - not_null
      - name: lap_times
        description: One row per lap in each race. Lap times started being recorded in this dataset in 1984 and joined through driver_id.
      - name: pit_stops 
        description: One row per pit stop. Pit stops do not have their own id column, the combination of the race_id and driver_id identify the pit stop.
        columns:
          - name: stop
            tests:
              - accepted_values:
                  values: [1,2,3,4,5,6,7,8]
                  quote: false            
      - name: races 
        description: One race per row. Importantly this table contains the race year to understand trends. 
        columns:
          - name: raceid
            tests:
            - unique
            - not_null        
      - name: results
        columns:
          - name: resultid
            tests:
            - unique
            - not_null   
        description: One row per result. The main table that we join out for grid and position variables.
      - name: status
        description: One status per row. The status contextualizes whether the race was finished or what issues arose e.g. collisions, engine, etc. 
        columns:
          - name: statusid
            tests:
            - unique
            - not_null

Create staging models

The next step is to set up the staging models for each of the 8 source tables. Given the one-to-one relationship between staging models and their corresponding source tables, we'll build 8 staging models here. We know it’s a lot and in the future, we will seek to update the workshop to make this step less repetitive and more efficient. This step is also a good representation of the real world of data, where you have multiple hierarchical tables that you will need to join together!

Let's go in alphabetical order to easily keep track of all our staging models! Create a new file called stg_f1_circuits.sql with this file path models/staging/formula1/stg_f1_circuits.sql. Then, paste the following code into the file before saving it:

with

source  as (

    select * from {{ source('formula1','circuits') }}

),

renamed as (
    select
        circuitid as circuit_id,
        circuitref as circuit_ref,
        name as circuit_name,
        location,
        country,
        lat as latitude,
        lng as longitude,
        alt as altitude
        -- omit the url
    from source
)
select * from renamed

All we're doing here is pulling the source data into the model using the source function, renaming some columns, and omitting the column url with a commented note since we don’t need it for our analysis.

Create stg_f1_constructors.sql with this file path models/staging/formula1/stg_f1_constructors.sql. Paste the following code into it before saving the file:

with

source  as (

    select * from {{ source('formula1','constructors') }}

),

renamed as (
    select
        constructorid as constructor_id,
        constructorref as constructor_ref,
        name as constructor_name,
        nationality as constructor_nationality
        -- omit the url
    from source
)

select * from renamed

We have 6 other stages models to create. We can do this by creating new files, then copy and paste the code into our staging folder.

Create stg_f1_drivers.sql with this file path models/staging/formula1/stg_f1_drivers.sql:

with

source  as (

    select * from {{ source('formula1','drivers') }}

),

renamed as (
    select
        driverid as driver_id,
        driverref as driver_ref,
        number as driver_number,
        code as driver_code,
        forename,
        surname,
        dob as date_of_birth,
        nationality as driver_nationality
        -- omit the url
    from source
)

select * from renamed

Create stg_f1_lap_times.sql with this file path models/staging/formula1/stg_f1_lap_times.sql:

with

source  as (

    select * from {{ source('formula1','lap_times') }}

),

renamed as (
    select
        raceid as race_id,
        driverid as driver_id,
        lap,
        position,
        time as lap_time_formatted,
        milliseconds as lap_time_milliseconds
    from source
)

select * from renamed

Create stg_f1_pit_stops.sql with this file path models/staging/formula1/stg_f1_pit_stops.sql:

with

source  as (

    select * from {{ source('formula1','pit_stops') }}

),

renamed as (
    select
        raceid as race_id,
        driverid as driver_id,
        stop as stop_number,
        lap,
        time as lap_time_formatted,
        duration as pit_stop_duration_seconds,
        milliseconds as pit_stop_milliseconds
    from source
)

select * from renamed
order by pit_stop_duration_seconds desc

Create stg_f1_races.sql with this file path models/staging/formula1/stg_f1_races.sql:

with

source  as (

    select * from {{ source('formula1','races') }}

),

renamed as (
    select
        raceid as race_id,
        year as race_year,
        round as race_round,
        circuitid as circuit_id,
        name as circuit_name,
        date as race_date,
        to_time(time) as race_time,
        -- omit the url
        fp1_date as free_practice_1_date,
        fp1_time as free_practice_1_time,
        fp2_date as free_practice_2_date,
        fp2_time as free_practice_2_time,
        fp3_date as free_practice_3_date,
        fp3_time as free_practice_3_time,
        quali_date as qualifying_date,
        quali_time as qualifying_time,
        sprint_date,
        sprint_time
    from source
)

select * from renamed

Create stg_f1_results.sql with this file path models/staging/formula1/stg_f1_results.sql:

with

source  as (

    select * from {{ source('formula1','results') }}

),

renamed as (
    select
        resultid as result_id,
        raceid as race_id,
        driverid as driver_id,
        constructorid as constructor_id,
        number as driver_number,
        grid,
        position::int as position,
        positiontext as position_text,
        positionorder as position_order,
        points,
        laps,
        time as results_time_formatted,
        milliseconds as results_milliseconds,
        fastestlap as fastest_lap,
        rank as results_rank,
        fastestlaptime as fastest_lap_time_formatted,
        fastestlapspeed::decimal(6,3) as fastest_lap_speed,
        statusid as status_id
    from source
)

select * from renamed

Last one! Create stg_f1_status.sql with this file path: models/staging/formula1/stg_f1_status.sql:
```
with

source  as (

    select * from {{ source('formula1','status') }}

),

renamed as (
    select
        statusid as status_id,
        status
    from source
)

select * from renamed
```
After the source and all the staging models are complete for each of the 8 tables, your staging folder should look like this:
Staging folder
It’s a good time to delete our example folder since these two models are extraneous to our formula1 pipeline and my_first_model fails a not_null test that we won’t spend time investigating. dbt Cloud will warn us that this folder will be permanently deleted, and we are okay with that so select Delete.
Delete example folder
Now that the staging models are built and saved, it's time to create the models in our development schema in Snowflake. To do this we're going to enter into the command line dbt build to run all of the models in our project, which includes the 8 new staging models and the existing example models.
Your run should complete successfully and you should see green checkmarks next to all of your models in the run results. We built our 8 staging models as views and ran 13 source tests that we configured in the f1_sources.yml file with not that much code, pretty cool!
Successful dbt build in Snowflake
Let's take a quick look in Snowflake, refresh database objects, open our development schema, and confirm that the new models are there. If you can see them, then we're good to go!
Confirm models
Before we move onto the next section, be sure to commit your new models to your Git branch. Click Commit and push and give your commit a message like profile, sources, and staging setup before moving on.

Create sources​

Create staging models​

Create sources

Create staging models