In reality it didn’t take too long for the Large Hadron Collider (LHC) at CERN to find the Higgs Boson, the particle that the press love to call the God particle. But then in its search it was generating rather large quantities of data; around about 40 Terabytes a day. To put that in context, it is the equivalent of 40,000 HD movies a day. CERN uses a Grid network to make that data, around 15 Petabytes a year, accessible to an 8,000 strong team of physicists distributed across the globe.
That is just one area where huge amounts of data are being generated. Others include more big science such as climate modelling, commerce, and social networking. Of all the data that has been generated since the age of mankind, the vast majority (around 90 percent) of it has been generated in the last 2.5 years.
Naturally this data comes in many different forms. These include text from all the web pages, social media and blogs that are produced every day; images from all the surveillance cameras that populate our streets; conversations that are monitored by government agencies; the millions of financial transactions that are made every minute; the 150 billion email messages that are sent every day; and more. Much of this data, in fact most of it, is unstructured and multidimensional, and this poses real problems when it comes to analysing it and extracting meaningful and useful information.
The ability to do so has huge commercial implications and over recent years huge resources have been put into the development of techniques and systems for doing so. Huge advances have been made, for instance Tesco now sends members of their Clubcard schemes individually tailored offers that match their shopping habits; Amazon is particularly adept at making suggestions as to your next purchase, particularly when it comes to books; and the government agencies are likely to know far more about you than you might wish them to.
Data flow is categorised by its three Vs which are Volume, Variety and Velocity, and big data is simply data that when accumulated is not possible to analyse using traditional tools such as relational databases and SQL. Instead, exceptional ways are needed in order to process it within useful time frames. Clusters of supercomputers running massively parallel software and backed by maybe hundreds of servers are necessary.
However making sense of all that unstructured data and extracting value from it remains a major challenge. Mimecast, a SaaS provider that specialises in email management, is currently developing ways of mining contextual information and knowledge from cloud email and document archives. All that data that is retained by an organization can be viewed as a kind of corporate memory; the trick is to tap into it and extract value.
Of one thing we can be certain; big data is here to stay and will only get bigger. This is a huge challenge, yet it presents huge opportunities in many fields including commercial, scientific, medical and social; but make no mistake, there are also threats, possibly of an Orwellian nature; it all depends on what we do with it.
That is just one area where huge amounts of data are being generated. Others include more big science such as climate modelling, commerce, and social networking. Of all the data that has been generated since the age of mankind, the vast majority (around 90 percent) of it has been generated in the last 2.5 years.
Naturally this data comes in many different forms. These include text from all the web pages, social media and blogs that are produced every day; images from all the surveillance cameras that populate our streets; conversations that are monitored by government agencies; the millions of financial transactions that are made every minute; the 150 billion email messages that are sent every day; and more. Much of this data, in fact most of it, is unstructured and multidimensional, and this poses real problems when it comes to analysing it and extracting meaningful and useful information.
The ability to do so has huge commercial implications and over recent years huge resources have been put into the development of techniques and systems for doing so. Huge advances have been made, for instance Tesco now sends members of their Clubcard schemes individually tailored offers that match their shopping habits; Amazon is particularly adept at making suggestions as to your next purchase, particularly when it comes to books; and the government agencies are likely to know far more about you than you might wish them to.
Data flow is categorised by its three Vs which are Volume, Variety and Velocity, and big data is simply data that when accumulated is not possible to analyse using traditional tools such as relational databases and SQL. Instead, exceptional ways are needed in order to process it within useful time frames. Clusters of supercomputers running massively parallel software and backed by maybe hundreds of servers are necessary.
However making sense of all that unstructured data and extracting value from it remains a major challenge. Mimecast, a SaaS provider that specialises in email management, is currently developing ways of mining contextual information and knowledge from cloud email and document archives. All that data that is retained by an organization can be viewed as a kind of corporate memory; the trick is to tap into it and extract value.
Of one thing we can be certain; big data is here to stay and will only get bigger. This is a huge challenge, yet it presents huge opportunities in many fields including commercial, scientific, medical and social; but make no mistake, there are also threats, possibly of an Orwellian nature; it all depends on what we do with it.