Introduction
Cassandra is an open source distributed database system that is designed for storing and managing large amounts of data across commodity servers.
- Provide scalability and high availability without compromising performance
- Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platfrom for mission critical data.
- Support replicating across multiple datacenters and providing low latency for your users
- Offer the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.
History
- combined Google BigTable with Amazon Dynamo
Architecture
- Overview
- Peer-to-Peer distributed system
- All nodes the same
- Data partitioned among all nodes in the cluster
- Custom data replication to ensure fault tolerance
- Read/Write anywhere design
- Communication
- Each node communicates with each other through the Gossip protocol, which exchanges information across cluster every second
- A commit log is used on each node to capture write activity
- Data durability is assured
- Data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable)
- Advantage
- Scalability
- Since it uses peer to peer distributed systems, it is easy to add nodes to increase from TB to PB
- Fault-tolerance
- Replication
- Post-relation database
- Since it column index, it has dynamic schema design to allows for much more flexibile data storage
- Data compression
- Uses Google's Snappy data compression algorithm
- Compresses data on a per column family level
Current status
Facebook used to use Cassandra for inbox search. Now they are using HBase for the same.
Reference
No comments:
Post a Comment