Home
Moritz Körber
Cancel

Add data quality checks to your duckdb pipeline with Soda Core

duckdb is a great and fast choice for pipelines running on a single instance and is an even better match if you’re pipeline oozes with SQL. To ensure that what you are doing with duckdb is actually...

Data Engineering Practice

Yesterday, I finished the last exercise of Daniel Beach’s amazing data engineering exercises. The exercises look simple at first glance, but in each exercise I stumbled upon something of which I th...

Simple and secure deployment with Github Actions OpenID Connect (OIDC)

Continuous delivery (CD) workflows implemented Github Actions help deploy software, create and update cloud infrastructure, or make use of various services of cloud providers like Amazon Web Servic...

Version metadata are like charging cables: Versioning your package with setuptools_scm

If you have ever forgotten to pack a charging cable or something else for a trip, you probably have noticed that we humans are prone to errors in simple and repetitive routine tasks 1. Because mach...

How to plot a grouped stacked bar chart in plotly

plotly makes it easy to create an interactive stacked or grouped bar chart in Python by assigning the desired type to the layout attribute barmode. Unfortunately, barmode only takes either stack or...

How to create a button to exchange the data in a plotly plot

I recently wanted to create a button for a plotly (the Python library) plot that exchanges the underlying data. I got stuck in the middle and since I couldn’t find much on Google, I thought it migh...

Tune your preprocessing steps and algorithm selection like hyperparameters

Using a pipeline to preprocess your data offers some substantive advantages. A pipeline guarantees that no information from the test set is used in preprocessing or training the model. Pipelines ar...

How to apply preprocessing steps in a pipeline only to specific features

The situation: You have a pipeline to standardize and automate preprocessing. Your data set contains features of at least two different data types that require different preprocessing steps. For ex...

A plot says more than 1000 tables: Visualizing missing data with missingno

Real-world data sets are very rarely free of missing values. Their causes are manifold: A participant could just have overlooked a survey item but it could also have been a controversial ques...

Prediction of the quality of physical exercise on the basis of accelerometer data

Background Going to the gym, whether to boost your health, to lose weight, or simply because it is fun, is certainly a worthwile activity. FiveThirtyEight recently reported that according to the l...

Trending Tags