Conventional vs NoSQL
Conventional RDB’s require a lot of maintenance, significant overhead in operating but mainly the biggest problem is that specifying static schema for data is incredibly time-consuming, error-prone, changing frequently it is just so 2000s pre-agile way of operating. Today people want to collect data and decide later what fields to put in, what fields are interesting to look at, what the format will be. People want to explore data and add data in an agile way and then build business rules and intelligence quickly. RDB’s are too hard.
Problems with RDBs
1) Too hard to add new data sources
2) Complex Schema Specifications take too long
3) Changing Schemas is hard and potentially impactful on applications
4) RDBs scale to a limited size which in todays world is simply too small for much of the data we want to look at
5) RDBs and the tables in them each require lots of maintenance and care and feeding to keep healthy and fast
Why NoSQL is a good solution for a lot of data
It is easy to stream data to a Cassandra, HBase or MongoDB database. Frequently it can be a matter of configuring a tool that then feeds the data into the database. Using tools like open source WSO2 BAM and others makes it trivial to add data.
Once you’ve got the data in you can use numerous tools to visualize the data and to determine patterns of interest, metrics, or other sequences that look like candidates to automate or make your system look smarter. This is an iterative process not very easy to anticipate in advance all the ways you might use the data or what things will be interesting later. Feedback from users, customers or partners may result in new insights or new value to the data as time goes on.
Once you’ve found some interesting statistic or event, combination of events you want to automate some behavior. With an RDB you may already have a program you can modify to add the new functionality. That’s a big deal changing and risky. You may write another program. You may decide later to remove the association or to expand it requiring more programmatic changes. With NoSQL you can drive specific automations quickly and easily from new ideas. For instance, you discover a correlation of sales or interest in your product with certain news stories or when people look at an article they seem to go to buy a certain product. You can use any stream of data to correlate, add an event and set up a business process to implement the new idea quickly. Using tools like BAM and Business Process Servers with an event driven architecture you can practically implement such things the same day you think of them.
With SQL RDB if you do it programmatically you will have weeks or months to do it. More important it is very unlikely you will have the data in an RDB in the first place because the cost of keeping data in an RDB is so high that nobody would stream news data or the detail of everything that people look at or every event in your network. So, it is probably impossible to do some of the things you can easily find in NoSQL world.
This easy adaptability to new ideas, new requirements to implement something is characteristic of Agile.
What you need:
To make a very flexible BigData architecture that allows you to build new automation quickly you need a set of open source components in addition to the bigdata database. There are many open source alternatives to the following:
1) Cassandra, HBase or MongoDB
2) Hadoop
3) Hive
4) Pentaho
5) WSO2 BAM (includes adapters for files, capability to configue metrics and new data sources)
6) WSO2 CEP for real time event processing
7) WSO2 Business Process Server (to build processes around the events and correlations, metrics)
The other unspoken advantage of NOSQL is that they are all open source, proven to billions of records / day, scale easily to arbitrary size and are free as far as the software license fees go.
Where SQL still makes sense
When you look at the cost of commercial enterprise databases (Oracle 10 + can cost millions and millions of dollars / year) you have to have an awfully good reason you are putting something in so expensive a storage vehicle. Transactional semantics would seem to be the “key advantage” of RDBs. It is easy to configure tables in Cassandra to have multiple copies to guarantee reliability to whatever level you want. Complex joins are a good reason, the comparable approach in NoSQL is to do map/reduce and put the result set into an RDB. A huge advantage of NoSQL over RDBs is that NoSQL does these queries in parallel over massive data sets that would be impossible with RDB but these usually are not immediately available. Depending on how fast you need the result RDB might be a better choice. If you use NoSQL to use Hadoop Map/Reduce to do the joins then Open Source RDBs are a good solution to store the result set.
A lot of effort has gone into building data wharehouses for data analysis over the last 20 years. A lot of these could be done using NoSQL databases easily depending on the data processing to be done on the results, more scalably and with vastly less cost.
Other things that may be interesting to read on this topic:
look-at-what-google-and-amazon-are-doing-with-databases-thats-your-future
Reblogged this on vienergie.
LikeLike