Big Data

Tue 25 March 2014 Written by Evi
Evi

RDBMS

RDMBS stands for Relational Database Management System. It is a system that provides a way to store data in the form of tables. The relationship among the data is also stored in the form of tables.
Take for example the following structure of a winery:

 

Winery ID

Winery Name

Address

Region ID

1

Barkan

Some street

3

2

Golan Heights

Some street

1

3

Castel

Some street

1

4

Dalton

Some street

2

5

Yeter

Some street

1

 

 

 

Region ID

Region Name

State

1

Some region

State A

2

Other region

State B

3

New region

State C

 

This simple example illustrates the unique structure of RDBMS which provides Atomicity, Consistency, Isolation and Durability (ACID). These properties are highly important for maintaining the integrity of data that is modified and read by many users at the same time. In order to maintain data integrity the RDBMS model “works hard”. It uses primary keys, foreign keys and indexes that tie together a single data record.

With the massive growth of the World Wide Web and the evolvement of web applications like facebook, twitter, flicker and eBay, software architects realized that the RDMS model is not robust enough. The model’s efforts to keep data in-tact come with a price of index cluttering, difficulties to upgrade and overall slowness.

This evolvement has led software architects to search for a new model – the big data model.   

What is Big Data?

The big data or big table concept addresses the enormous overhead that web applications create. Facebook for example generates 50 TeraBytes of content just from its inbox feature. One of the ways to mitigate this huge overhead is by using a noSQL architecture. Unlike RDBMS, big data systems often provide weak consistency guarantees but they can support massive scale and excellent performance. Big data systems may not require fixed table schemas and usually avoid table dependencies. Let us examine how a typical big data structure looks like:

User Profile

Key

Attributes

1

Name: Evi Rachmilewitz

Gender: Male

Smoking: Yes

Status: Married

Height: 1.86 meters

Work place: emyoli

2

Name: John Mitchel

Gender: Male

Smoking: No

Status: Single

Height: 1.80 meters

Work place: Ford Motors

 

As we can see, big data offers a key-value database that is item-oriented, meaning all relevant data relating to an item are stored in that item. Thus data records are commonly duplicated between items in a table, an attribute that is absolutely against the core idea of RDBMS. With that, data retrieval is straight forward as the need to join a data record from multiple tables eliminates.

Another issue is scaling. If a join operation is needed that depends on shared tables, then replicating data records is hard and blocks easy scaling.

The true fact is that when big players like eBay build massive data driven applications relational databases become untenable.   

When should you consider moving your architecture to big data?

Whenever you realize that your web application is about to create massive data records that are there to serve millions of users, you should look for a big data architecture. Product companies like Google use the big data (or bigTable as referred to by Google) architecture right from the start. Google designed the bigTable architecture so that it could provide fast access to petabytes of data, distributed across thousands of machines. BigTable provides that data storage mechanism for Google Apps and other Google products.

If you an are entrepreneur who is about to launch a production version of your mobile / web application and you have no clue as to whether your app will become a popular app like Instagram for example, you should think about RDBMS as your core architecture and perhaps make some initial assessments on the cost of moving to big data.

Why should you start with RDBMS?

  1. RDBMS has been the de-facto model for years. As a result there are numerous experts who charge a fair price that can assist you with architecting a sustainable model.
  2. RDBMS can definitely handle millions of records. Hence you should not worry if your app grows from hundreds of users to thousands or even several millions. Your RDMBS will probably be able to handle that. With that if you do reach a tipping point of million and above users who are actively engaging with your app, it is time to think about migration to big data.

What are the major big data platforms?

There are several big data solutions and / or platforms out there. I’ll discuss four of them: Apache Hadoop, Mongo DB, Simple DB, and Big Table

 

Apache Hadoop

 

Apache Hadoop is an open-source framework that allows for the processing of large data sets across clusters of computers. It is designed to scale up or down across numerous machines each offering computation power and storage. Some of the names that are using Apache Hadoop are Adobe, AOL, Alibaba, LinkedIn and many others. A sub-project of Apache Hadoop is Cassandra, a scalable database with no single point of failure. One of Cassandra’s strengths is its ability to replicate across multiple datacenters thus providing lower latency for users from various regions. Cassandra is used by Netflix, Twitter, Digg and other companies that deploy web based applications that serve millions of users.

Mongo DB

 

Mongo DB is a document oriented database that removes the core RDBMS requirement of joins. This removal allows for fast reads and writes, easier manageability, agile development with schema-less database and easier scalability options. Mongo DB has a straight forward query language, it allows for indexes and it offers various deployment models ranging from large deployments that include a replica set and multiple configuration servers to small deployments that allow for one replica with no automatic failover. Mongo DB is used by SAP, MTV and craigslist.

 

SimpleDB

 

SimpleDB is a big data solution by Amazon. It offers a flexible non-relational data store system that is unbound by the strict requirement of RDBMS. Like the previous providers, Amazon’s SimpleDB manages multiple distributed replicas of data to enable easy access and durability. Amazon’s SimpleDB works in close conjunction with S3 and EC2 collectively providing the ability to store, process and query data sets in the cloud.

Glue is a popular browser add-on service that uses Amazon’s SimpleDB. Glue’s architects admit that the lack of join operations poses technological challenges for them but with that SimpleDB is designed to scale massive amounts of data utilizing a cloud approach. For Glue this ability is a core requirement hence the decision to go with SimpleDB.

 

Bigtable

 

According to Google, Bigtable is designed to reliably scale petabytes of data on thousands of machines. Bigtable is used by more than sixty Google products and projects including Google Analytics and Google Earth. The Bigtable clusters used by these products span a wide range of configurations, from a handful to thousands of servers and store up to hundreds of terabytes of data. Similar to the above implementations, Bigtable does not support a full relational data model but rather it provides a simple model that supports dynamic control over data and format.