subscribe via RSS

Posts

  • Dealing With Bad Records in Kafka

    A single bad record (a.k.a poison pill) on a Kafka topic can ruin your day. KafkaConsumer does not deal with these records gracefully. Here I cover strategies on how to address this issue.
  • Automating SQL Server Databases with Docker

    Installing and managing SQL Server on a development workstation has never been an attractive prospect. It is time consuming, installs _tons_ of dependencies that are not removed even if you uninstall SQL Server, and can drain resources even when it's not in use. Managing the database is often a drag on contemporary development practices. Automating database deployments in a Docker container can be a big boost in development efficiency.
  • CS0579 Duplicate Attribute Error with .NET Core

    I recently started working with .NET Core and almost immediately came across an CS0579 Duplicate Attribute Error. This is what I discovered and how I fixed the issue.
  • Kappa Architecture

    Part of the Intro to Data Streaming series.

    The 'Intro to Data Streaming' series continues with an overview of the Kappa Architecture, a proposed enhancement to the Lambda Architecture. While agreeing with the basic formula it proposes eliminating some of the technical overhead and complexity.
  • Remembering Peter Lawler

    One year ago today a great man died. To mark the day I'm re-posting what I wrote at the time in memoriam Peter Augustine Lawler.
  • Running Hadoop on WSL

    I was curious if it was possible to get Hadoop running under Windows Subsystem for Linux. It is.
  • Lambda Architecture

    Part of the Intro to Data Streaming series.

    Through the first parts in this series we have covered problems with batch ETL processes and conceptually designed a real-time data processing system. In this post the series shifts to looking at reference architectures that have been successfully used to implement real-time data streaming solutions. The first of these is known as the Lambda Architecture.
  • Improving the Real-Time App

    Part of the Intro to Data Streaming series.

    In the last post we considered an application architecture that would start to achieve real-time ETL requirements, but there were issues remaining with the design. In this post, we will improve the design to further improve upon batch processing and understand the data streaming pattern.

  • A Simple Real-Time App

    Part of the Intro to Data Streaming series.

    To illustrate the principles of data streaming, it’s helpful to start simple by envisioning what a simple application that would achieve real-time ETL capabilities would look like. But first a review.

  • Data Streaming Learning Resources

    Part of the Intro to Data Streaming series.

    During a presentation at the Nashville BI user group, I was asked to provide more material on getting started with data streaming. I’ll cover some of these in other ‘Intro to Data Streaming’ posts, but I promised the group a blog post on the subject, and here it is. There is a ton of material out there, but it will get you started.

  • The Problem with Batch ETL - Part 2

    Part of the Intro to Data Streaming series.

    For many years, application architecture consisted of some reliable constants upon which ETL and Business Intelligence in general relied upon. Early on, having windows of 6-12 hours in which few changes to source systems were made and transactional application servers were idle was not uncommon. That has largely changed across the board except for the smallest of regional or local companies. That data would be stored in a relational database system that was exposed by the ubiquitous SQL + ODBC combination was, also, a given. Core applications were complicated monoliths with a relatively few number of satellite applications that had data relevant for analytics. We have, at ever increasing velocity, seen these and other architecture stalwarts begin to disappear.
  • The Problem with Batch ETL - Part 1

    Part of the Intro to Data Streaming series.

    Chances are, batch ETL is the majority, or perhaps the exclusive, solution for data engineering underlying Business Intelligence in your enterprise. There are good reasons for this. Batch ETL has a legion of engineers trained in its patterns. It is politically non-controversial. There are many established tools backed by major, executive and compliance approved, corporations. While far from simple, it does eliminate some complexities such as isolation from other processes and partially removing contention with other workloads. It is also the approach that most caters to the highest performance write operations of relational database management systems by loading large quantities of data at one time, rather than in separate transactions. Unfortunately, because it has been the default choice for so long, most enterprises have become complacent about its limitations. It is time to take a hard look at this venerable practice.
  • Multi-Threading with Runspaces

    Part of the Concurrency in PowerShell series.

    There are multiple methods to achieve concurrency in PowerShell. This post covers Multi-Threading with Runspaces
  • Background Jobs

    Part of the Concurrency in PowerShell series.

    There are multiple methods to achieve concurrency in PowerShell. This post covers Background Jobs in preparation for covering a method using true multi-threading in a future post.