RPC: Remote Procedure calls. (invoke a function on another node)
MPI: Message-Passing Interface. (fine-grained programming (sends/receives/puts/gets) at the transport layer)
Actors: "Processes" or "Agents" that pass messages. (higher-level message passing, focus on events and nodes)
Indirect Communication
Time and Space (de)coupling
direct communication: sender and receivers exist in
the same time and know of each other.
indirect communication:
Time uncoupling: a sender can send a message even if the receiver is still not available.
Space uncoupling: a sender can send a message but does not know to whom it is sending nor if more than one, if anyone, will receive the message.
Publish-Subscribe
Publish-Subscribe: A model in which we provide a framework to glue requestors (producers) to workers (consumers), with much looser coupling.
Publish (producer): send "messages" on "topics" regardless of whether someone is listening
Subscribe (consumer): receive messages if anyone is sending them regardless of who
Topic-based: Events are classified into predefined topics. Subscriptions can include any number of these topics. (can only be discrete, not adjustable)
Content-based: Events are structured in form of multiple attributes. Subscriptions can define a range over any of these attributes. (can be continuous, adjustable)
Implementations:
Channel: events are published to channel that processes can subscribe to.
Type: used by object oriented languages, subscribe on event of a particular object type.
Centralized Server can track all subscribers, but it is
- Not Reliable: single point failure
- Not Avaliable: bottleneck
- Not Scalable
Broker Network
Broker Network: a type of overlay network
Overlay Network: a network on top of traditional Internet
Variants of event routing for content-based routing: choose based on subscription model, performance, fault tolerance, availability, consistency requirement.
Flooding
Filtering
Advertisement
Rendezvous
Flooding
Flooding: send everything to everybody
publisher broadcast message regardless of content or topics
subscriber filter (match) by itself
Flooding is easy to implement by create a lot of unnecessary network traffic.
Filtering
Filtering: broker network in charge of filtering topics
a subscription is sent to the closest broker
brokers share information about subscriptions
broker knows to which neighboring brokers to send published events
Required Data for Each Node
neighbors list in broker network
subscription list
routing table
Filtering requires a stable broker network
From the perspective of a node in broker network, the following code can be written:
func onReceivePublish(Event e, FromNode x) {
matchlist := match(e, subscriptions)
send notify(e) to matchlist;
fwdlist := match(e, routing);
send publish(e) to fwdlist - x;
}
func onReceiveSubscribe(Subscription s, FromNode x) {
if x is client {
add x to subscriptions;
} else {
add(x, s) to routing;
}
send subscribe(s) to neighbors - x;
}
Advertisement
Advertisement: publisher advertise their topic, and subscriber respond if they are interested
publisher advertise topics about to publish
advertisements propagated in the network
subscribers contact publishers if they are interested
Publisher may be contacted too frequent and have huge server load.
Rendezvous
Rendezvous: evenly divide all subscription-handling job evenly between broker nodes by topics
rendezvous nodes: broker nodes that are responsible for some given subset of the event (topic) space.
SN(s):
input: subscription
output: list of rendezvous nodes responsible for that subscription.
EN(e):
input: an event just published
output: list of rendezvous nodes responsible for matching event against subscription.
func onReceivePublish(Event e, FromNode x, AtNode i) {
rvlist := EN(e);
if i in rvlist {
matchlist <- match(e, subscriptions);
send notify(e) to matchlist;
}
send publish(e) to rvlist - i;
}
func onReceiveSubscribe(Subscription s, FromNode x, AtNode i) {
rvlist := SN(s);
if i in rvlist {
add s to subscriptions;
} else {
send subscribe(s) to rvlist - i;
}
}
Examples of Publisher-Subscribe Ecosystem
Messaging platform:
Java Messaging Service (JMS)
ZeroMQ
Redis
Kafka
Separate service:
Google Cloud Pub/Sub
Standards:
OMG Data Distribution Service (DDS)
Atom - web feeds (RSS), clients poll for updates
Kafka
In ordinary message system, messages need to be total order to ensure correctness. In Kafka, however, queue can be split up into multiple partial order. Kafka let the application to specify the way to distribute items in the queue (distribution strategy).
If you need total order, either
- set NumberOfPartitions = 1
- use total ordering in your consumer application, (e.g. a Storm topology).
Example Topic: Foodball Game
Replicas and Brokers
Propagating writes across replicas
Records: key (optional), value, timestamp
Record: each message send from any producer
PartitionKey: a field value user specify to be used for partitioning record into partitions
You can find a specific record given Topic, Offset, and PartitionNumber
Topic: a grouping of partitions handeling the same type of data
Offset: queue position in each partition. For each partition, Kafka tracks an offset for each ConsumerGroup, not for each Consumer (since there is maximally 1 consumer for each ConsumerGroup to read Partition).
NumberOfPartitions: the number of partition for a Topic is configurable by user, since data within partitions are ordered, but are not ordered accross partitions
PartitionNumber: uniquely identify partition
ReplicationFactor: replicate how many copies of the same partition. Replicas are never read from, never written to. Kafka tolerates (ReplicationFactor - 1) failed brokers before losing data. (LinkedIn uses ReplicationFactor=2)
Replica: replicated Partition. ReplicaNumber for each Partition is set to be the same number as BrokerNumber. For each Partition, Kafka elect a Leader hosting the TrueReplica that other Replicas syncs to.
Data Storage: use persistent, immutable, append-only log on disk
Delete: Kafka delete old Record based on age, or maxSize, or key.
ConsumerGroup: Consumer in the same ConsumerGroup do not share partition (will always read different record than other Consumer in the same group). Therefore the number of Consumer in a ConsumerGroup is less or equal to NumberOfPartitions. You can allow different application reading the same record by creating one application group for one application.
Producer API: API to produce streams of records
Consumer API: API to consume streams of records
Broker: Kafka server that runs in a Kafka Cluster. Can host multiple Replicas but only 1 Recplica for each Partition
Inspecting the current state of a topic: ISR means "in-sync replica" (replicas that are in sync with the leader). Broker 0 is leader for partition 1. Broker 1 is leader for partitions 0 and 2. All replicas are in-sync with their respective partitions
Producer
Producer is available for JVM (Java, Scala), C/C++, Python, Ruby, etc. But the Kafka project officially only provides the JVM implementation.
Properties props = new Properties();
props.put("metadata.broker.list"."...");
// We can set the producer to be [async] or [sync]
// [async]: producer.send() will produce Future, will not block
// [sync]: producer.send() will block
props.put("producer.type"."async");
ProducerConfig config = new ProducerConfig(props);
Producer p = new Producer(ProducerConfig config);
KeyedMessage<K, V> msg = ...;
p.send(KeyedMessage<K, V> message);
Ways for Producer to Pick Partition:
round-robin
priority
based on key or record
using Kafka's default: if key exist, hash the key. If not, round-robin.
Java's producer API
class kafka.javaapi.producer.Producer<K, V> {
public Producer(ProducerConfig config);
/**
* Sends the data to a single topic, partitioned by key, using either the
* synchronous or asynchronous producer.
*/
public void send(KeyedMessage<K, V> message);
/**
* Use this API to send data to multiple topics
*/
public void send(List<KeyedMessage<K, V>> messages);
/**
* Close API to close producer pool connections to all Kafka brokers.
*/
public void close();
}
Example to Construct Producer
// instantiate a Properties object
private Properties kafkaProps = new Properties();
kafkaProps.put("bootstrap.servers","broker1:9092,broker2:9092");
// we plan on using strings for our message key and value
// so use the built-in StringSerializer.
kafkaProps.put("key.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
kafkaProps.put("value.serializer",
"org.apache.kafka.common.serialization.StringSerializer");
// create a new producer by setting the appropriate key and value types
// and passing the Properties object.
producer = new KafkaProducer(kafkaProps);
Ways to Send Message
Fire-and-forget: just send, don't care if arrived
Synchronous send: wait for reply
Asynchronous send: don't wait for reply, but reply trigger callback
Example for message sending: fire and forget
// Create ProducerRecord objects
// - name of the topic (always string)
// - key (must match our serializer and producer objects)
// - value (must match our serializer and producer objects)
ProducerRecord<String, String> record =
new ProducerRecord<>("CustomerCountry", "Precision Products", "France");
try {
// send() method returns a Java Future object with RecordMetadata
// but since we implement fire-and-forget, we don't care about it
producer.send(record);
} catch (Exception e) { // errors before sending
// a SerializationException when it fails to serialize the message
// a BufferExhaustedException or TimeoutException if the buffer is full
// or an InterruptException if the sending thread was interrupted.
e.printStackTrace();
}
Example for message sending: synchronous
try {
// can use to retrieve the offset the message was written to.
RecordMetadata meta = producer.send(record).get();
} catch (Exception e) { // errors before sending
// if the record is not sent successfully to Kafka.
e.printStackTrace();
}
Example for message sending: asynchronous
// To use callbacks, you need a class that implements the
// org.apache.kafka.clients.producer.Callback interface, which has a single
// function called onCompletion().
private class DemoProducerCallback implements Callback {
@Override
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
if (e != null) {
// If Kafka returned an error, onCompletion() will have a nonnull exception.
e.printStackTrace();
}
}
}
ProducerRecord<String, String> record =
new ProducerRecord<>("CustomerCountry", "Biomedical Materials", "USA");
// And we pass a Callback object along when sending the record.
producer.send(record, new DemoProducerCallback());
Message Acking Configuration by Producer:
committed: a message is considered committed when "any required" ISR (in-sync replicas) for that partition have applied it to their data log.
"any required": defined by request.required.acks (Consistency Level, trade latency with data security)
0: producer never waits for an ack from the broker.
1: producer gets an ack after the leader replica has received the data. (Data will lost when leader fails to replicate to other server before it dies)
-1: producer gets an ack after all ISR have received the data.
request.timeout.ms: amount of time the broker will wait trying to meet the ack requirement (But message may be committed even when broker sends timeout error to client)
Batching: producer send multiple Record to multiple Leader at once when the number of Records at hand reach threshold.
improve throughput
data lost if client dies before pending messages sent
Consumer
Consumer
Consumers pull from Kafka:
so consumers to control their pace of consumption
designed for handel average load
Consumers are responsible for tracking their read positions (offsets)
High-level consumer API: takes care of this for you, stores offsets in ZooKeeper
Simple consumer API: nothing provided, it’s totally up to you
Why tracking offset
can replay old message
Consumers can decide to only read a specific subset of partitions for a given topic
Enable batch ingestion tools that, for example, write from Kafka to Hadoop HDFS every hour.
Rebalance: dynamically divide partitions evenly across consumers
Adding more processes/threads will cause Kafka to rebalance, possibly changing the assignment of a partition to a thread (which is dangerous).
When a consumer joins or leaves a consumer group
When a broker joins or leaves
When a topic joins or leaves (via filter createMessageStreamsByFilter())
The assignment of brokers to consumers is dynamic at run-time.
broker can be added or deleted
consumer can be added or deleted
Rebalancing is a normal and expected lifecycle event in Kafka.
Most Ops issues are due to 1) rebalancing and 2) consumer lag. So DevOps must understand what goes on.
Kafka Usage
Microservice + Kafka
Other PubSub Frameworks
Kafka: distributed topic based buffer queue
High Availability
High Throughput: generate 200 billion Record via Kafka
High Scalability: Tens of thousands of data producers, thousands of consumers
High Durability, Fault Tolerance (message still received, even if queue is offline)
Low Latency: 7 million write/s, 35 million read/s
Support large data backlogs (offline ingestion)
Realtime: create new feeds online
Typical Data:
Metrics: operational telemetry data.
Tracking: everything a LinkedIn user does.
Queuing: between LinkedIn apps, e.g., for sending emails.
Big Data Ecosystem Datasheet
Incomplete-but-useful list of big-data related projects packed into a JSON dataset.
AMPcrowd - A RESTful web service that runs microtasks across multiple crowds.
AMPLab G-OLA - a novel mini-batch execution model that generalizes OLA to support general OLAP queries with arbitrarily nested aggregates using efficient delta maintenance techniques.
DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
DistributedR - scalable high-performance platform for the R language.
Drools - a Business Rules Management System (BRMS) solution.
eBay Oink - REST based interface for PIG execution.
Esper - a highly scalable, memory-efficient, in-memory computing, SQL-standard, minimal latency, real-time streaming-capable Big Data processing engine for historical data.
Facebook Corona - Hadoop enhancement which removes single point of failure.
HParser - data parsing transformation environment optimized for Hadoop.
IBM Streams - advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources.
JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
Kryo - Java serialization and cloning: fast, efficient, automatic.
LinkedIn Cubert - a fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop.
Metamarkers Druid - framework for real-time analysis of large datasets.
Microsoft Azure Stream Analytics - an event processing engine that helps uncover real-time insights from devices, sensors, infrastructure, applications and data.
Microsoft Orleans - a straightforward approach to building distributed high-scale computing applications.
Microsoft Project Orleans - a framework that provides a straightforward approach to building distributed high-scale computing applications.
Microsoft Trill - a high-performance in-memory incremental analytics engine.
Netflix Aegisthus - Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family.
Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
Netflix STAASH - language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems.
Netflix Surus - a collection of tools for analysis in Pig and Hive.
Netflix Zeno - Netflix's In-Memory Data Propagation Framework.
Nextflow - Dataflow oriented toolkit for parallel and distributed computational pipelines.
Nokia Disco - MapReduce framework developed by Nokia.
Oryx - is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine learning.
Pachyderm - lets you store and analyze your data using containers..
Parsely Streamparse - streamparse lets you run Python code against real-time streams of data. It also integrates Python smoothly with Apache Storm..
PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
Tigon - a distributed framework built on Apache HadoopTM and Apache HBaseTM for real-time, high-throughput, low-latency data processing and analytics applications.
Netflix S3mper - library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
Palantir AtlasDB - a massively scalable datastore and transactional layer that can be placed on top of any key-value store to give it ACID properties.
Sqrrl - NoSQL databases on top of Apache Accumulo.
Stratio Cassandra - Cassandra index functionality has been extended to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and multivariable, geospatial and bitemporal search.
Tokutek - Tokutek claims to improve MongoDB performance 20x.
Key-value Data Model
Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies..
Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
Couchbase ForestDB - Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie.
Edis - is a protocol-compatible Server replacement for Redis.
ElephantDB - Distributed database specialized in exporting data from Hadoop.
QDB - A fast, high availability, fully Redis compatible store.
RAMCloud - storage system that provides large-scale low-latency storage by keeping all data in DRAM all the time and aggregating the main memories of thousands of servers.
RebornDB - Distributed database fully compatible with redis protocol.
Doradus - Doradus is a REST service that extends a Cassandra NoSQL database with a graph-based data model, advanced indexing and search features, and a REST API.
Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
Faunus - Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster.
GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
GraphX - resilient Distributed Graph System on Spark.
Microsoft Graph Engine - a distributed, in-memory, large graph processing engine, underpinned by a strongly-typed RAM store and a general computation engine.
Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
FoundationDB - distributed database, inspired by F1.
Google F1 - distributed SQL database built on Spanner.
Google Spanner - globally distributed semi-relational database.
H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
eBay Kylin - Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
Amazon Kinesis - real-time processing of streaming data at massive scale.
Amazon Snowball - a petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data into and out of AWS.
AMPLab SampleClean - scalable techniques for data cleaning and statistical inference on dirty data.
Apache BookKeeper - a distributed logging service called BookKeeper and a distributed publish/subscribe system built on top of BookKeeper called Hedwig.
Apache Flume - service to manage large amount of log data.
Apache Samza - stream processing framework, based on Kafla and YARN.
Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
Apache UIMA - Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user.
Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
Heka - open source stream processing software system.
HIHO - framework for connecting disparate data sources with Hadoop.
LinkedIn Camus - Kafka to HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka.
LinkedIn Databus - stream of change capture events for a database.
LinkedIn Gobblin - a framework for Solving Big Data Ingestion Problem.
LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
Linkedin Lumos - bridge from OLTP to OLAP for use it on Hadoop.
Netflix Ribbon - a Inter Process Communication (remote procedure calls) library with built in software load balancers. The primary usage model involves REST calls with various serialization scheme support.
Netflix Suro - data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data based on Chukwa.
Pinterest Secor - is a service implementing Kafka log persistance.
Record Breaker - Automatic structure for your text-formatted data.
Sawmill - extensive log processing and reporting features.
Facebook Iris - a totally ordered queue of messaging updates with separate pointers into the queue indicating the last update sent to your Messenger app and the traditional storage tier.
Serf - decentralized solution for service discovery and orchestration.
Spotify Luigi - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
Amazon Machine Learning - visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology.
AMPLab Splash - a general framework for parallelizing stochastic learning algorithms on multi-node clusters.
AMPLab Velox - a data management system for facilitating the next steps in real-world, large-scale analytics pipelines.
Apache Mahout - machine learning library for Hadoop.
Facebook DeepText - a deep learning-based text understanding engine that can understand with near-human accuracy the textual content of several thousands posts per second, spanning more than 20 languages.
Facebook FBLearner Flow - provides innovative functionality, like automatic generation of UI experiences from pipeline definitions and automatic parallelization of Python code using futures.
fbcunn - Deep Learning CUDA Extensions from Facebook AI Research.
Google DistBelief - software framework that can utilize computing clusters with thousands of machines to train large models.
Google Sibyl - System for Large Scale Machine Learning at Google.
Google TensorFlow - an Open Source Software Library for Machine Intelligence.
H2O - statistical, machine learning and math runtime for Hadoop.
KeystoneML - Simplifying robust end-to-end machine learning on Apache Spark.
LinkedIn FeatureFu - contains a collection of library/tools for advanced feature engineering to derive features on top of other features, or convert a light weighted model into a feature.
Microsoft Azure Machine Learning - is built on the machine learning capabilities already available in several Microsoft products including Xbox and Bing and using predefined templates and workflows.
MLbase - distributed machine learning libraries for the BDAS stack.
MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
Neon - a highly configurable deep learning framework.
nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
OpenAI Gym - a toolkit for developing and comparing reinforcement learning algorithms.
PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
scikit-learn - scikit-learn: machine learning in Python.
Seldon - an open source predictive analytics platform based upon Spark, Kafka and Hadoop.
Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
Sparkling Water - combine H2OÕs Machine Learning capabilities with the power of the Spark platform.
Brooklyn - library that simplifies application deployment and management.
Buildoop - Similar to Apache BigTop based on Groovy language.
Cloudera Director - a comprehensive data management platform with the flexibility and power to evolve with your business.
Cloudera HUE - web application for interacting with Hadoop.
CloudPhysics - collect operational metadata from your virtualized infrastructure, then correlate and analyze it to expose operational hazards and waste that pose a threat to your datacenter performance, efficiency and uptime.
Ganglia Monitoring System - scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them..
Google Borg - job scheduling and monitoring system.
Google Omega - job scheduling and monitoring system.
Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting..
Hortonworks HOYA - application that can deploy HBase cluster on YARN.
Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs..
Marathon - Mesos framework for long-running services.
Minotaur - scripts/recipes/configs to spin up VPC-based infrastructure in AWS from scratch and deploy labs to it.
Myriad - a mesos framework designed for scaling YARN clusters on Mesos. Myriad can expand or shrink one or more YARN clusters in response to events as per configured rules and policies..
Neflix SimianArmy - a suite of tools for keeping your cloud operating in top form.
Netflix Eureka - AWS Service registry for resilient mid-tier load balancing and failover.
Netflix Hystrix - a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
Scaling Data - tracing data center problems to root cause, predict capacity issues, identify emerging failures and highlight latent threats.
Stratio Manager - install, manage and monitor all the technology stack related to the Stratio Platform.
Tumblr Collins - Infrastructure management for engineers.
Amazon Aurora - a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
Stratio Explorer - an Interactive Web interpreter to Apache Crossdata, Stratio Ingestion, Stratio Decision,Markdown, Apache Spark, Apache Spark-SQL and command Shell.
Tamr - standalone tool to catalog all of your enterprise metadata.
Zaloni Bedrock - fully integrated Hadoop data management platform.
Zaloni Mica - self-service data discovery, curation, and governance.
Zillabyte - an API for distributed data computation. Scale with your data..
Data Warehouse
Google Mesa - highly scalable analytic data warehousing system.
IBM BigInsights - data processing, warehousing and analytics.
IBM dashDB - Data Warehousing and Analysis Needs, all in the Cloud.
Microsoft Azure SQL Data Warehouse - businesses access to an elastic petabyte-scale, data warehouse-as-a-service offering that can scale according to their needs.
Microsoft Cosmos - Microsoft's internal BigData analysis platform.
Data Visualization
Arbor - graph visualization library using web workers and jQuery.
CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
Chart.js - open source HTML5 Charts visualizations.
Chartist.js - another open source HTML5 Charts visualization.
Crossfilter - avaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
Cubism - JavaScript library for time series visualization.
Cytoscape - JavaScript library for visualizing complex networks.
D3 - javaScript library for manipulating documents.
DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
Grafana - open source, feature rich metrics dashboard and graph editor for Graphite, InfluxDB & OpenTSDB.
Graphistry - running on GPUs and turns static designs into interactive tools using client/cloud GPU infrastructure and GPU-accelerated languages like Superconductor.
Plot.ly - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots..
Recline - simple but powerful library for building data applications in pure Javascript and HTML.
Redash - open-source platform to query and visualize data.
Sigma.js - JavaScript library dedicated to graph drawing.
Square Cubism.js - aÊD3Êplugin for visualizing time series. Use Cubism to construct better realtime dashboards, pulling data fromÊGraphite,ÊCubeÊand other sources.