Users today are spoiled by the speed and ease of use of publicly available Internet applications. They can search for anything and get instant results from search engines like Google, connect with their friends across the world using social applications like Facebook, find professionals for any skill or expertise as well as jobs on networking applications like LinkedIn, and share their life experiences in real-time using applications like Twitter.
Contrasting with these fast and easy to use applications are conventional enterprise applications that now seem too restrictive, slow and cumbersome. These new applications are based on new technologies that enable such fast and reliable performance. And the best part is this – in spite of the fact that there have been several man years of effort put into them, most of these new technologies are available as open source.
This provides a huge opportunity for enterprises to leverage open source platforms such as Hadoop and related technologies to create reliable and highly scalable data repositories using a distributed set of low cost commodity servers.
Hadoop is capable of processing large amounts of data – running into petabytes – at very high speeds. It is able to scale up the volume of data without significantly affecting the performance due to the fact that both, the storage and processing of data is not centralized in a large database server, but distributed in a scalable network of low cost commodity servers.
As businesses become aware that the Big Data trend is here to stay, publishers are looking for reliable support. There are many forms or reliability, all of which will have an effect on the overall reliability of the instrument and therefore the data collected. Reliability is an essential pre-requisite for validity. It is possible to have a reliable measure that is not valid, however a valid measure must also be reliable.
scalability is the ability of a system, network, or process, to handle growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth. For example, it can refer to the capability of a system to increase total throughput under an increased load when resources (typically hardware) are added.
– Multiple architectures and use cases
– Focus today: using multiple servers, each working on part
of of job, each doing same task job, each doing same task
– Key Challenges:
- Work distribution and orchestration
- Error recovery
- Scalability and management