Skip to content

Overview

Source: https://www.oreilly.com/content/drawing-a-map-of-distributed-data-systems/

Textbooks

File stores

  • Ambry - Data packaging and management system. Python 2 only. (Source Code)
  • ROOT - A modular scientific software framework. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage. It is mainly written in C++ but integrated with other languages such as Python and R.

Parquet

SQLite

  • SQLite - Best file-based OLTP database.

HDF5

  • PyTables - Pretty good option. Has Pandas bindings through the HDFStore class.
  • h5py - The h5py package is a Pythonic interface to the HDF5 binary data format. More low-level than PyTables.

Benchmarks

Row-oriented

MySQL

MariaDB

  • Fork of MySQL by original MySQL developers.
  • Can access CSV files directly using the CONNECT engine.

PostgreSQL

Column-oriented

MariaDB ColumnStore

  • Probably the best columnstore in the MySQL family.

MonetDB

Infobright

InfiniDB

  • Column-oriented fork of MySQL.
  • Went bankrupt and joined MariaDB.

CitusDB

  • Citus horizontally scales PostgreSQL across commodity servers using sharding and replication. Its query engine parallelizes incoming SQL queries across these servers to enable real-time responses on large datasets.
  • Citus extends the underlying database rather than forking it, which gives developers and enterprises the power and familiarity of a traditional relational database. As an extension, Citus supports new PostgreSQL releases, allowing users to benefit from new features while maintaining compatibility with existing PostgreSQL tools.
  • Citrus cstore-fdw allows creating column-oriented tables in PostgreSQL (GitHub).
  • GitHub, Docs

GreenPlum

  • Pivotal Greenplum is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely geared toward big data analytics, Greenplum is powered by the world’s most advanced cost-based query optimizer delivering high analytical query performance on large data volumes. Based on PostgreSQL.
  • Part of the Open Source Pivotal Big Data Suite
  • GitHub

MemSQL

  • "The Fastest In-Memory Database."
  • MemSQL is a distributed In-Memory Database that lets you process transactions and run analytics in real-time, using SQL.

LucidDB

  • LucidDB is the first and only open-source RDBMS purpose-built entirely for data warehousing and business intelligence. It is based on architectural cornerstones such as column-store, bitmap indexing, hash join/aggregation, and page-level multiversioning. Most database systems (both proprietary and open-source) start life with a focus on transaction processing capabilities, then get analytical capabilities bolted on as an afterthought (if at all). By contrast, every component of LucidDB was designed with the requirements of flexible, high-performance data integration and sophisticated query processing in mind. Moreover, comprehensiveness within the focused scope of its architecture means simplicity for the user: no DBA required.
  • Superseded by Apache Calcite.

Apache Calcite

  • Apache Calcite is a dynamic data management framework.
  • It contains many of the pieces that comprise a typical database management system, but omits some key functions: storage of data, algorithms to process data, and a repository for storing metadata.
  • Java.

Druid

  • Column oriented distributed data store ideal for powering interactive applications
  • Java, GitHub

Array

[image:13 align:right]arrays in database systems, the next frontier?

SciQL

  • An add-on for MonetDB adding support for storing and querying arrays.

SciDB

  • Difficult to compile / install if not enterprise.
  • Data versioning.

RasDaMan

  • rasdaman ("raster data manager") allows storing and querying massive multi-dimensional ?arrays, such as sensor, image, simulation, and statistics data appearing in domains like earth, space, and life science. This worldwide leading array analytics engine distinguishes itself by its flexibility, performance, and scalability. Rasdaman can process arrays residing in file system directories as well as in databases.

Apache Kylin

Hadoop

Distributions

MapR

Cloudera

Hortonworks

  • By the creators of hadoop.

Software

  • Hive -
  • Impala - Cloudera delivers the modern platform for data management analytics.
  • Drill -

Graph

  • dgraph — Native GraphQL Database with graph backend.

NoSQL

  • Spark - Fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Create pipelines where you load things into memory only once. Databricks has offers hosted solution integrated with AWS.
  • Cassandra - Distributed database good for high write workload. Devloped by DataStax.
  • Scylla - Cassandra re-written in C++.

Benchmarks