Big Data, n. (computer technology) any
complex dataset that includes large amounts of dynamic, growing, unstructured
data where relationships between data are frequently inferred rather than
declared.
If you ask Gartner, big data is
best understood by the 3 V’s: volume, velocity and variety. And the Mike 3.0
open source project seemingly tries to confuse everyone by stating that: “Big data can be very small and not all
large datasets are big.” In my most recent glance at Wikipedia, Big Data was
simply defined as any dataset that is “beyond the ability of commonly used
software tools to capture, manage, and process the data within a tolerable
elapsed time.” Like so many buzzwords introduced into the technology dialogue,
Big Data is a marketing term. If you have any doubt just look at what Oracle is
saying. According to Oracle, Big Data is just unstructured metadata lying
around your business that you can analyze, organize and utilize to create
business benefit.
Selling Big Data to business is
going to be a marketing trend that continues for awhile. When reduced to the
common commercial messages, the reality is that most Big Data solutions are
really solutions for dealing with large amounts of unstructured data.
I propose that a more useful
construct than the 3 V’s exists to help define the characteristics of Big Data:
structured vs. unstructured, defined relationships vs. inferred relationships,
static vs. dynamic, and stable vs. growing. In each case, the second term in
the pair is more characteristic of Big Data, but that is not to say that Big
Data does not encompass some data at both ends of the spectrum.
If you look at a dataset such as
the combined real estate multiple listing databases in the United States, for
example, you have a truly complex set of Big Data. At any given time, there are
about 2 million real estate listings in the US. Something like 85% of these
listings are being updated every fifteen minutes (that’s 6.8 million updates
per hour). 85% of all those listings are served up by an application running on
the Magic xpa Application Platform. That’s something like 5.5 million updates per
hour or more than 1500 updates per second. Why so many updates? Because of the
complex interrelationships of data in a real estate multiple listing. It’s not
that the seller is changing the price every fifteen minutes or the agent is changing
the advertising description of the property. But the record is related to all
sorts of interrelated data and metadata as well as community data and averages
that are constantly changing.
In-memory computing and data grid
computing provide an interesting means for dealing with Big Data. The paradigm
shift of using memory as storage allows you to access data randomly with near
zero latency as opposed to sequential disk access methods which require
sequential access for optimal reduction in latency. As more and more enterprise
and cloud architectures incorporate in-memory computing the currently accepted definition
of Big Data then becomes problematic. If Big Data only includes datasets that
are “beyond the ability of commonly used software tools” then the bar for the
Big Data definition is raised by the eventual proliferation of in-memory
approaches. I like the fact that the Big Data buzzword focuses attention on the
problems associated with the proliferation and increasing complexity of data. I
am not convinced, however, that the commonly accepted definitions are useful. Big
Data is any complex dataset that includes large amounts of dynamic, growing,
unstructured data where relationships between data are frequently inferred
rather than declared.
Nice post,Keep updating
ReplyDeleteBig Data Hadoop Training