Lecture 018 - Kafka

eShopOnContainer Reference Application

eShopOnContainer Reference Application

Conceptual scaling example of a video-streaming application

Conceptual scaling example of a video-streaming application

Service Netflix Has: Mastering Chaos - A Netflix Guide to Microservices:

Robots Are Distributed System Too

Robots Are Distributed System Too

ROS Computational Architecture

ROS Computational Architecture

Communication Methods

Direct Communication

RPC: Remote Procedure calls. (invoke a function on another node)

MPI: Message-Passing Interface. (fine-grained programming (sends/receives/puts/gets) at the transport layer)

Actors: "Processes" or "Agents" that pass messages. (higher-level message passing, focus on events and nodes)

Indirect Communication

Time and Space (de)coupling

Time and Space (de)coupling

direct communication: sender and receivers exist in the same time and know of each other.

indirect communication:

Publish-Subscribe

Publish-Subscribe: A model in which we provide a framework to glue requestors (producers) to workers (consumers), with much looser coupling.

Topic-based: Events are classified into predefined topics. Subscriptions can include any number of these topics. (can only be discrete, not adjustable)

Content-based: Events are structured in form of multiple attributes. Subscriptions can define a range over any of these attributes. (can be continuous, adjustable)

Implementations:

Centralized Server can track all subscribers, but it is - Not Reliable: single point failure - Not Avaliable: bottleneck - Not Scalable

Broker Network

Broker Network: a type of overlay network

Broker Network: a type of overlay network

Overlay Network: a network on top of traditional Internet

Overlay Network: a network on top of traditional Internet

Variants of event routing for content-based routing: choose based on subscription model, performance, fault tolerance, availability, consistency requirement.

Flooding

Flooding: send everything to everybody

Flooding is easy to implement by create a lot of unnecessary network traffic.

Filtering

Filtering: broker network in charge of filtering topics

  1. a subscription is sent to the closest broker
  2. brokers share information about subscriptions
  3. broker knows to which neighboring brokers to send published events

Required Data for Each Node

Filtering requires a stable broker network

From the perspective of a node in broker network, the following code can be written:

func onReceivePublish(Event e, FromNode x) {
  matchlist := match(e, subscriptions)
  send notify(e) to matchlist;
  fwdlist := match(e, routing);
  send publish(e) to fwdlist - x;
}

func onReceiveSubscribe(Subscription s, FromNode x) {
  if x is client {
    add x to subscriptions;
  } else {
    add(x, s) to routing;
  }
  send subscribe(s) to neighbors - x;
}

Advertisement

Advertisement: publisher advertise their topic, and subscriber respond if they are interested

  1. publisher advertise topics about to publish
  2. advertisements propagated in the network
  3. subscribers contact publishers if they are interested

Publisher may be contacted too frequent and have huge server load.

Rendezvous

Rendezvous: evenly divide all subscription-handling job evenly between broker nodes by topics

func onReceivePublish(Event e, FromNode x, AtNode i) {
  rvlist := EN(e);
  if i in rvlist {
    matchlist <- match(e, subscriptions);
    send notify(e) to matchlist;
  }
  send publish(e) to rvlist - i;
}

func onReceiveSubscribe(Subscription s, FromNode x, AtNode i) {
  rvlist := SN(s);
  if i in rvlist {
    add s to subscriptions;
  } else {
    send subscribe(s) to rvlist - i;
  }
}

Examples of Publisher-Subscribe Ecosystem

Messaging platform:

Separate service:

Standards:

Kafka

In ordinary message system, messages need to be total order to ensure correctness. In Kafka, however, queue can be split up into multiple partial order. Kafka let the application to specify the way to distribute items in the queue (distribution strategy).

If you need total order, either - set NumberOfPartitions = 1 - use total ordering in your consumer application, (e.g. a Storm topology).

Example Topic: Foodball Game

Example Topic: Foodball Game

Replicas and Brokers

Replicas and Brokers

Propagating writes across replicas

Propagating writes across replicas

Records: key (optional), value, timestamp

Inspecting the current state of a topic: ISR means "in-sync replica" (replicas that are in sync with the leader). Broker 0 is leader for partition 1. Broker 1 is leader for partitions 0 and 2. All replicas are in-sync with their respective partitions

Inspecting the current state of a topic: ISR means "in-sync replica" (replicas that are in sync with the leader). Broker 0 is leader for partition 1. Broker 1 is leader for partitions 0 and 2. All replicas are in-sync with their respective partitions

Producer

Producer is available for JVM (Java, Scala), C/C++, Python, Ruby, etc. But the Kafka project officially only provides the JVM implementation.

Properties props = new Properties();
props.put("metadata.broker.list"."...");
// We can set the producer to be [async] or [sync]
// [async]: producer.send() will produce Future, will not block
// [sync]: producer.send() will block
props.put("producer.type"."async");
ProducerConfig config = new ProducerConfig(props);

Producer p = new Producer(ProducerConfig config);
KeyedMessage<K, V> msg = ...;
p.send(KeyedMessage<K, V> message);

Ways for Producer to Pick Partition:

Java's producer API

class kafka.javaapi.producer.Producer<K, V> {
  public Producer(ProducerConfig config);

  /**
  * Sends the data to a single topic, partitioned by key, using either the
  * synchronous or asynchronous producer.
  */
  public void send(KeyedMessage<K, V> message);

  /**
  * Use this API to send data to multiple topics
  */
  public void send(List<KeyedMessage<K, V>> messages);

  /**
  * Close API to close producer pool connections to all Kafka brokers.
  */
  public void close();
}

Example to Construct Producer

// instantiate a Properties object
private Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers","broker1:9092,broker2:9092");

// we plan on using strings for our message key and value
// so use the built-in StringSerializer.
kafkaProps.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
kafkaProps.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");

// create a new producer by setting the appropriate key and value types
// and passing the Properties object.
producer = new KafkaProducer(kafkaProps);

Ways to Send Message

Example for message sending: fire and forget

// Create ProducerRecord objects
// - name of the topic (always string)
// - key (must match our serializer and producer objects)
// - value (must match our serializer and producer objects)
ProducerRecord<String, String> record =
new ProducerRecord<>("CustomerCountry", "Precision Products", "France");

try {
  // send() method returns a Java Future object with RecordMetadata
  // but since we implement fire-and-forget, we don't care about it
  producer.send(record);
} catch (Exception e) { // errors before sending
  // a SerializationException when it fails to serialize the message
  // a BufferExhaustedException or TimeoutException if the buffer is full
  // or an InterruptException if the sending thread was interrupted.
  e.printStackTrace();
}

Example for message sending: synchronous

try {
  // can use to retrieve the offset the message was written to.
  RecordMetadata meta = producer.send(record).get();
} catch (Exception e) { // errors before sending
  // if the record is not sent successfully to Kafka.
  e.printStackTrace();
}

Example for message sending: asynchronous

// To use callbacks, you need a class that implements the
// org.apache.kafka.clients.producer.Callback interface, which has a single
// function called onCompletion().
private class DemoProducerCallback implements Callback {
  @Override
  public void onCompletion(RecordMetadata recordMetadata, Exception e) {
    if (e != null) {
      // If Kafka returned an error, onCompletion() will have a nonnull exception.
      e.printStackTrace();
    }
  }
}

ProducerRecord<String, String> record =
new ProducerRecord<>("CustomerCountry", "Biomedical Materials", "USA");

// And we pass a Callback object along when sending the record.
producer.send(record, new DemoProducerCallback());

Message Acking Configuration by Producer:

Batching: producer send multiple Record to multiple Leader at once when the number of Records at hand reach threshold.

Consumer

Consumer

Rebalance: dynamically divide partitions evenly across consumers

The assignment of brokers to consumers is dynamic at run-time.

Most Ops issues are due to 1) rebalancing and 2) consumer lag. So DevOps must understand what goes on.

Kafka Usage

Microservice + Kafka

Microservice + Kafka

Other PubSub Frameworks

Other PubSub Frameworks

Kafka: distributed topic based buffer queue

Typical Data:

Big Data Ecosystem Datasheet

Incomplete-but-useful list of big-data related projects packed into a JSON dataset.

External references: Main page, Raw JSON data of projects, Original page on my blog

Related projects: Hadoop Ecosystem Table by Javi Roman, Awesome Big Data by Onur Akpolat, Awesome Awesomeness by Alexander Bayandin, Awesome Hadoop by Youngwoo Kim, Queues.io by Łukasz Strzałkowski

How to contribute

Projects

Add a new JSON file to projects-data directory. Here is an example:

{
  "name": "Apache Hadoop",
  "description": "framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)",
  "abstract": "framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)",
  "category": "Frameworks",
  "tags": ["framework", "yahoo", "apache"],
  "links": [{"text": "Apache Hadoop", "url": "http://hadoop.apache.org/"}]
}

Papers

Add a new JSON file to papers-data directory. Here is an example:

{
  "title": "The Google File System",
  "year": "2003",
  "authors": "",
  "abstract": "",
  "tags": ["google"],
  "links": [{"text": "PDF Paper", "url": "http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf"}]
}

Data

Projects

Frameworks

Distributed Programming

Distributed Filesystem

Key-Map Data Model

Document Data Model

Key-value Data Model

Graph Data Model

NewSQL Databases

Columnar Databases

Time-Series Databases

SQL-like processing

Integrated Development Environments

Data Ingestion

Message-oriented middleware

Service Programming

Scheduling

Machine Learning

Benchmarking

Security

System Deployment

Container Manager

Applications

Search engine and framework

MySQL forks and evolutions

PostgreSQL forks and evolutions

Memcached forks and evolutions

Embedded Databases

Business Intelligence

Data Analysis

Data Warehouse

Data Visualization

Internet of Things

Papers

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

1999

1997


Creative Commons License


This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Table of Content