- Designing Data Intensive Applications: the Big Ideas Behind Reliable Scalable and Maintainable Systems
- Ambry - Data packaging and management system. Python 2 only. (Source Code)
- ROOT - A modular scientific software framework. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage. It is mainly written in C++ but integrated with other languages such as Python and R.
- SQLite - Best file-based OLTP database.
- PyTables - Pretty good option. Has Pandas bindings through the HDFStore class.
- h5py - The h5py package is a Pythonic interface to the HDF5 binary data format. More low-level than PyTables.
- Fork of MySQL by original MySQL developers.
- Can access CSV files directly using the CONNECT engine.
- Foreign data wrappers (FDW)
- Use multicorn to write FDW in Python.
- Use MADlib for scalable in-database analytics.
- Probably the best columnstore in the MySQL family.
- MonetDB is an open source column-oriented database management system.
- SciQL provides support for arrays.
- Allows embedding python code.
- Supports SAM/BAM files.
- Requiers recompiling samtools with
- Requiers recompiling samtools with
- Some work on Data Valuts
- Adds support for external scientific formats.
- Access data just-in-time.
- Data Vaults: a Database Welcome to Scientific File Repositories
- Column-oriented fork of MySQL.
- Infobright Community Edition 4.0.7 was released on 2012, and is based on MySQL 5.1 (very old).
- Enterprise edition has many more features.
- Citus horizontally scales PostgreSQL across commodity servers using sharding and replication. Its query engine parallelizes incoming SQL queries across these servers to enable real-time responses on large datasets.
- Citus extends the underlying database rather than forking it, which gives developers and enterprises the power and familiarity of a traditional relational database. As an extension, Citus supports new PostgreSQL releases, allowing users to benefit from new features while maintaining compatibility with existing PostgreSQL tools.
- Citrus cstore-fdw allows creating column-oriented tables in PostgreSQL (GitHub).
- GitHub, Docs
- Pivotal Greenplum is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely geared toward big data analytics, Greenplum is powered by the world’s most advanced cost-based query optimizer delivering high analytical query performance on large data volumes. Based on PostgreSQL.
- Part of the Open Source Pivotal Big Data Suite
- "The Fastest In-Memory Database."
- MemSQL is a distributed In-Memory Database that lets you process transactions and run analytics in real-time, using SQL.
- LucidDB is the first and only open-source RDBMS purpose-built entirely for data warehousing and business intelligence. It is based on architectural cornerstones such as column-store, bitmap indexing, hash join/aggregation, and page-level multiversioning. Most database systems (both proprietary and open-source) start life with a focus on transaction processing capabilities, then get analytical capabilities bolted on as an afterthought (if at all). By contrast, every component of LucidDB was designed with the requirements of flexible, high-performance data integration and sophisticated query processing in mind. Moreover, comprehensiveness within the focused scope of its architecture means simplicity for the user: no DBA required.
- Superseded by Apache Calcite.
- Apache Calcite is a dynamic data management framework.
- It contains many of the pieces that comprise a typical database management system, but omits some key functions: storage of data, algorithms to process data, and a repository for storing metadata.
- Column oriented distributed data store ideal for powering interactive applications
- Java, GitHub
[image:13 align:right]arrays in database systems, the next frontier?
- An add-on for MonetDB adding support for storing and querying arrays.
- Difficult to compile / install if not enterprise.
- Data versioning.
- rasdaman ("raster data manager") allows storing and querying massive multi-dimensional ?arrays, such as sensor, image, simulation, and statistics data appearing in domains like earth, space, and life science. This worldwide leading array analytics engine distinguishes itself by its flexibility, performance, and scalability. Rasdaman can process arrays residing in file system directories as well as in databases.
- By the creators of hadoop.
- Hive -
- Impala - Cloudera delivers the modern platform for data management analytics.
- Drill -
dgraph— Native GraphQL Database with graph backend.
- Spark - Fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Create pipelines where you load things into memory only once. Databricks has offers hosted solution integrated with AWS.
- Cassandra - Distributed database good for high write workload. Devloped by DataStax.
- Scylla - Cassandra re-written in C++.
- Citus Data cstore_fdw (PostgreSQL Column Store) vs. MonetDB TPC-H Shootout
- MonetDB is faster than column-oriented PostgreSQL (using cstore_fdw).
- Quick comparison of MyISAM, Infobright, and MonetDB
- MonetDB wins in performance.
- Infobright wins in table size.
- Marc Fiume Thesis (page 113)
- Infobright has much smaller tables.
- Data Management and Data Processing Support on Array-Based Scientific Data
- HDF5 vs MonetDB and others.