Data Warehouse Practice and Lesson Learned

January 30, 2020 4-minute read

Recently, I am working on data warehouse final staging with my skilled colleagues. There are some lesson I learned from my colleagues and this project. I would write it down as a future reminder of my data warehouse development.

Administration

In PostgreSQL, the ownership belongs to those who create schemas and tables. Even an admin account cannot grant privileges of those schema and table at the very beginning. Only the creator can grant privileges to others. Thus, it’s better to create schemas and tables using the admin account. Otherwise, we have to switch the ownership of tables or schemas to admin account to manage.
While granting privileges of tables in a schema to others, we have to grant the usage of schema to others. Upon the usage of schema privileges, we can actually grant the table privileges in that schema to others.

Design

Surrogate Key and Nature Key: Nature key has business values, which is good sometimes. However, due to it doesn’t have consistent standard, it will raise confusion to the data warehouse users. With that drawback, we can introduce surrogate key. Surrogate key could be a column auto generated, such as auto increment field. Take an example to demonstrate the benefit of surrogate key. Suppose we have a user table using username as primary key(mostly, username will be unique in an application) and a post table using username as a foreign. So if we use the nature key username as the foreign key in post table, when the username referenced by post table get changed it will cause some problems in post table. For this situation, if we have a surrogate key, auto increment field in user table as an id column and reference this id as a foreign key in post table, there won’t be any issue when the username get changed in user table. We can also use a trigger to solve the problem caused by using nature key, but using surrogate key definitely make our life easier.
(2020-01-30 Schema Convention): Given a data warehouse, a good way to maintain schemas or tables is taking advantage of data lineage. For example, normally we can have multple levels of table, bronze, silver and gold etc. Depending on the business requirements, we can even have platium table(more aggregation and manipulation invovled). Starting from bronze tables, it normally stores some raw response from data apis. Then we extract the raw response to get it transformed as the way people can use to query in silver tables. If there are some metrics valuable and frequently being used, we can prepare the data in gold tables. This is also a basic example how data lineage get used.

Data Model and Migration

When I use Django to do migration and refactor data warehouse, I notice it is better to make small change one time so that Django will auto generate migration files correctly. For example, I make quite a lot changes in model name, table name and column name in one time migration, Django will cause a simple rename operation to some drop/add operations, which will affect our data integrity some time. We may loss data with the migrations which is terrible for data warehouse. So my suggestion is keep the change small, for example split this large change into two migrations, one for model and table name and the other for column name.
Django base data model includes id column implicitly as an auto field. We don’t need to specify an explicitly auto incremental field in our custom model. But we have to set it as primary key. Although it is auto incremental field, it can also accept data in our upsert query then override its auto increment property.
(2020-01-30 Tests Invovling Models): While using factories in test, we have to make sure the related corresponding factory is instantiated before running into the tests. While instantiating the factories, we should be careful with the order of declaration to make sure not violate the foreign constraints.

To be continued …