Hadoop is basically a distributed file system (HDFS) – Hadoop lets you store a large amount of file data on a cloud machines, handling data redundancy etc.
Comparing SQL databases and Hadoop:
Hadoop is a framework for processing data, what makes it better than standard relational databases, the workhorse of data processing in most of today’s applications? One reason is that SQL (structured query language) is by design targeted at structured data. Many of Hadoop’s initial applications deal with unstructured data such as text. From this acpect Hadoop provides a more general paradigm than SQL.
SQL and Hadoop can be complementary, SQL is a query language which can be implemented on top of Hadoop as the execution engine. But in practice, SQL databases tend to refer to a whole set of legacy technologies, with several dominant vendors, optimized for a historical set of applications. Many of these existing commercial databases are a mismatch to the requirements that Hadoop targets.
Comparing Oracle and Hadoop:
Oracle is a marvelous generic database and it can be used for many things.
You can process your web logs and figure out which pathes customers are likely to take through your online store and which are more likely to lead to a sale. You can built a recommendation engine with Oracle. You can definitely do ETL in Oracle.
Just because it is possible to do something, doesn’t mean you should. There are good reasons to use Oracle as your default solution, especially when you are an experienced oracle DBA.
But, do you really want to use Oracle to store millions of emails and scanned documents? I have few customers who do it, and I think it causes more problems than it solves. After you stored the data’s, do you really want to use your network and storage bandwidth so the application servers will keep reading the data from the databases? Big data is…… big. It is best not to move it around too much and run the processing on the servers that store the data. After all, the code takes fewer packets than the data. But, Oracle makes cores very expensive.
Then there’s the issue of actually programming the processing code. If your big data is in Oracle and you want to process it efficiently, PL/SQL is pretty much the only option. PL/SQL is a nice language. I don’t think anyone seriously considers writing their data mining programs in PL/SQL. Once you write your code in Java or Python, to access mostly unstructured data in Oracle, you go through many layers of abstraction that result in slower code.
BigData is partially a license issue – Oracle Database is expensive and MySQL isn’t good at data warehouse stuff. It is partially a storage and network issue of scaling large volumes of data, locality of data is becoming more critical. But I see it mostly as using the right tool for the job – and just because Oracle can do something, doesn’t make it the best way to do it.