Creating a mock data pipeline with Snowflake and AWS – part 2

Programming

Find code here: https://github.com/clevecque/cdk-snowflake-demo

This is the second part of a 2-part article around data partitioning. In this second part we’ll heavily refer to the data pipeline described below and previously created in the first part. Don’t hesitate to read it through for better understanding.

CREATE TABLE linktree_clicks (
    click_id INTEGER,
    timestamp TIMESTAMP,
    account_uuid VARCHAR
) PARTITION BY RANGE(timestamp);

The above query will allow us to create partitions of the clicks received on our profiles by timestamp, in order to more easily scan and retrieve the data needed. We can even enable partition pruning which let Postgres discard partitions which are not relevant in the query for faster performance. But 270M events a day…we would almost need a new partition every day.

Snowflake has this great feature called micro-partitions: tables are partitioned under the hood in little overlapping chunks of 50MB – 500MB without requiring anything from the data engineers. These micro-partitions are dynamic and not static, hence are often recomputed and they’re also overlapping so that all data is quickly accessible. No need to define them or maintain them. Almost too good to be true – we’ll explore this further later.

Clustering is another good thing to implement on relational databases. This allows clustering the data based on an index without having to split a table into smaller tables.

CREATE TABLE linktree_clicks (
    click_id INTEGER,
    timestamp TIMESTAMP,
    account_uuid VARCHAR
);
CREATE INDEX linktree_clicks_ts_idx ON linktree_clicks (timestamp);
CLUSTER linktree_clicks USING linktree_clicks_ts_idx;

However CLUSTER is a one time operation, very useful on static data but has to be regularly rerun if new data is added. It can become a bit of an endless loop.

On Snowflake’s side, whenever data is inserted in a table, there is an automatic clustering done on the metadata to help build the previously described micro-partitions. Thanks to this automatic clustering of metadata, querying is made much faster as only the data needed for the query is being scanned. This is what Snowflake calls query pruning: the data not relevant for the query is automatically pruned (or ignored) thanks to this metadata. Efficient querying, reduced costs, everyone is happy.

Vocab checkpoint:

Partitions split the tables into smaller tables easier to consume.
Pruning of partitions or data, allows to discard data not needed in a query.
Clustering creates, well, clusters of data based on an index within the same table to also improve query performance.

If it’s all automatic under the hood, why do I need to know about it?

When we first started to use Snowflake at Linktree, it was so fast and easy to use we never really lingered on how it worked behind the scenes. Until this happened.

Feel free to try it out with our mock repository. You’ll need to insert initial data and then ongoing timed data.

Conclusion

We run through some vocabulary around data efficiency and optimization in databases. We’ve seen how to generate mock data, create a small data pipeline for close to real time applications and Snowflake use cases. We have also seen how impactful it can be to assume that ease of use means no errors.

This issue was a massive learning curve for our team. As easy to use as data warehouses and cloud computing can be, they’re easy to use because we don’t need to address these issues anymore. However, it’s essential to understand how it works underneath to make correct and great use of these tools.

On our side, it also taught us a fair bit around data quality and using Snowflake as our Transform step of our ELT. It’s extremely performant and efficient and today we’re still using a similar approach performing really well. But it’s helpful to consider better handling of the Transform step, either upstream or during the load process for instance.

Documentation:

https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html
https://www.postgresql.org/docs/12/sql-cluster.html
https://www.cybertec-postgresql.com/en/cluster-improving-postgresql-performance/
https://severalnines.com/database-blog/guide-partitioning-data-postgresql
https://bobbyhadz.com/
Snowflake support team 🧡

Credits

Programming

Words By Clem Levecque

4 mins

Creating a mock data pipeline with Snowflake and AWS – part 2

If it’s all automatic under the hood, why do I need to know about it?

Conclusion

Documentation:

You might also like these

Jumpstart your corner of the internet today