Nowadays we hear a lot about so-called Big Data – but what is it, and why is it so important? Big Data can be defined as a vast pool of information that we generate and make use of every day – whether we know it or not. It comes from our phones, our computers, and the machine sensors which help us in our tech-driven lives.
Devices connected to the Internet of Things produce data; so do cameras monitoring traffic flow, or security systems, weather satellites and a whole host of other devices. This data is used by companies around the world who make decisions based on it, to improve the products or services they provide. Their customer-centric services and our experiences as consumers depend upon it. It’s described as ‘big’ not simply because there’s a lot of it, but because it’s made up of a large variety of often complex data.
Big Data – A Timeline
Computer technology has grown rapidly and exponentially in recent years – and with it, the amount and complexity of data that is generated. It’s hard to believe that the technology needed to send the first spaceships to the moon operated with fewer than 80 kilobytes of data. It has been estimated that the world’s capacity to store data has doubled every three years since the 1980s. The entirety of the world’s data in 1950 could have fitted on one laptop computer; compare that with the present day, where total global data stands at 44 zettabytes (that’s 44 trillion gigabytes). And in 2025, the world is predicted to have generated 163 zettabytes – a staggering amount of data, and the growth shows no signs of slowing.
As a result of this advancement of technology, non-digital systems are becoming less and less viable. Increasingly advanced systems are needed to make use of the data generated by today’s digital society. The era of Big Data that we live in is driven by hugely popular social media platforms, widespread use of smart phones with advanced technical capabilities, and a growing and well-connected Internet of Things.
Why Is Big Data So Important?
Large amounts of complex data are only useful if they can be analysed, understood and acted upon. Real-time, actionable insights are increasingly created by Artificial Intelligence (AI) or other forms of machine-learning. Companies make their data work for them, so they can develop their business models and grasp new opportunities. Companies operating without the benefit of Big Data are essentially operating blind, without sight or sound of their customers and their business environments.
Structured and Unstructured Data Types
There are generally three types of data based on the structure of the information and its ease of indexing; these are structured, unstructured and semi-structured.
- Structured data: data generated from machine logs or financial systems (think of an Excel spreadsheet, for example) which is simple to organise and search is known as structured data. It features easily categorised components, and can be searched using relatively simple algorithms. Large volumes of structured data do not always qualify as Big Data – they may not meet the criteria if they are simple and easily searchable. The programming language SQL (Structured Query Language) has traditionally been used to manage structured data. IBM developed this language with the purpose of letting developers create and manage spreadsheet-type data (also known as relational datasets) which were becoming popular over 30 years ago.
- Unstructured data: this type of data can’t be easily stored in standard row/column databases, and includes data such as audio files, images, customer comments or social media posts. In the recent past, this sort of data had to undergo laborious manual processing before it could be searched and used; often it was too costly to do this, or it took too long, so that the results were already out of date by the time they were generated. Unstructured data is commonly stored in data warehouses, lakes or NoSQL databases.
- Semi-structured data: this is hybrid data – partly structured, and partly unstructured. Examples include emails – the data in the main message text will be unstructured, whereas information such as the sender, time and date is more structured and organisational in nature. Sometimes, a device will create structured data alongside unstructured content. Think of an image taken with a smartphone camera; the image is unstructured but you can see the time and place it was taken and the file type from the structured data it generates. Therefore, modern AI technology must be able to instantly recognise the different types of data, and generate real-time algorithms which can manage and analyse it.
Sources of Big Data
Nowadays, anything from kitchen appliances to drones can generate data. There are three main sources that data can come from: machine data, social data and transactional data.
- Machine data: there has been rapid growth around the world in the number of data-generating machines – everything from sensors monitoring traffic or rail networks, to security system. Devices connected to the Internet of Things (IoT) have sensors which allow them to send and receive digital data. By 2025, it’s estimated that the number of IoT devices on earth will have exceeded 40 billion, accounting for nearly half of the world’s total digital data.
- Social data: as the name suggests, this is data created by social media. It includes images, posts and video content, and this last type is growing particularly quickly with the spread of 5G data networks. Billions of people now regularly use their phones to view video content, and that number is growing all the time.
- Transactional data: data that records customer transactions – whether in banking, retail or other sectors – is amongst the fastest moving and quickest growing type of data in the world. More and more, it is now made up of semi-structured data; it might contain comments or images, increasing how complex it is to manage and utilise.
Defining Big Data – the five V’s
As mentioned above, mere size alone does not mean a dataset qualifies as Big Data. The following five properties must be present for data to truly be Big Data:
- Value: It’s crucial for businesses that their Big Data can produce insights which help them become more resilient and competitive, and thus better able to serve their customers. Business profitability and operational capabilities can be improved significantly by modern Big Data technologies.
- Veracity: Simply being able to generate and store lots of data is not enough on its own; the data needs to be accurate, up to date, and delivered at the right time. Problems of veracity when dealing with structured data are usually things like typos or syntax errors. But for unstructured data, there are many more risks to its veracity. For example, the humans dealing with it may be biased (whether consciously or not), and social noise and provenance can also affect validity and truthfulness of the data.
- Variety: rather than simply being large, a data set that is defined as Big Data must typically comprise structured, unstructured and semi-structured data. This type of data requires modern systems to manage such complex and disparate datasets.
- Velocity: Whereas in the past data had to be put into a traditional database (often entered manually) in order to be retrieved or analysed, nowadays Big Data technology enables the processing and analysing whilst the data is actually being generated. Sometimes it takes only a fraction of a second to do this. Speed is often critical for businesses, who may have to stop fraud or respond to customer demands at pace.
- Volume: due to the large volumes of Big Data that are generated and stored, advanced analytics, often driven by AI technology are needed to fully make use of the data available. Large companies may hold many terabytes of data, and it all needs to be organised, stored and retrieved for the processing to happen.
The Benefits of Big Data
Being able to create relevant insights from raw data both quickly and accurately is what Big Data management solutions are all about. Their benefits include:
- Customer experience: by processing Big Data, companies can draw conclusions which allow them to improve their customers’ experiences. Those companies who grow most effectively are the ones that collect and use their customer experience data.
- Resilience and risk management: Preparing for the unexpected and anticipating risk are two important business strategies that can be achieved by using Big Data. The coronavirus pandemic that emerged in 2020 exposed vulnerabilities in many businesses’ operations and highlighted the need for agility and resilience.
- Cost savings and greater efficiency: Applying advanced Big Data analytics across an organisation’s processes makes it possible to spot inefficiencies and implement solutions quickly.
- Improved competitiveness: Organisations can please their customers, save money, be innovative and improve their products, all through the insights gleaned from Big Data.
- Product and service development: Analysing reviews, trends and other unstructured data is a crucial part of a business’s strategy, made possible through Big Data.
- Predictive maintenance: Consultants McKinsey discovered in a recent survey that when a business analyses Big Data from IoT-connected machines, they can reduce equipment maintenance costs by up to 40%.
AI and Big Data
Data is often seen as the blood that circulates through an AI system. In order to fulfil its function, an AI system must be able to learn from the data, too. The relationship between AI and Big Data is reciprocal; without AI capabilities, Big Data would not have much practical use. And AI itself depends on Big Data to be able to produce the sort of robust and actionable analytics that are needed.
Machine Learning and Big Data
One of the most important things to come from analysis of Big Data is actionable insight. Machine learning technology is ideally placed to define data and pick out pattern within it; the better the datasets being analysed, the more scope there is for the machine learning system to learn and improve its processes.
How Big Data Works
Any business wanting to work with Big Data needs to have systems and processes that are able to collect, store and analyse the data. Only then will the system yield the valuable insight that can be used to improve the business. The three steps to using Big Data successfully are:
- Collect Big Data: standard traditional databases based on disks are simply not suitable for handling Big Data, which will flood in from many different sources. To manage this type of data, you need in-memory databases and software which is designed with Big Data in mind.
- Store Big Data: bespoke, unlimited cloud storage right from the start of the Big Data journey is the best way to store data; simply repurposing existing on-premise storage may not be enough to meet the processing requirements that will arise.
- Analyse Big Data: AI and machine learning help to get the most out of the analysis of Big Data, and are an indispensable part of this process. Insights must also be generated at high speed, and processes used must be capable of self-optimising and learning from experience.
Big Data applications
Large organisations with complex operations usually benefit the most from Big Data, but virtually any business of any size or type can make some gains from it. Some common examples include:
Healthcare: diagnoses are becoming more accurate with the adoption of Big Data analysis by healthcare professionals. Such analysis also enables the healthcare sector to manage risk, react to trends and cut out unnecessary spending. There have also been specific benefits during the Covid-19 pandemic; allowing teams to collaborate and analyse together is transforming the face of clinical science, allowing better treatments and ways of managing illnesses not possible before the advent of Big Data.
Finance: Customer satisfaction and experience has been one of the main beneficiaries of Big Data in the financial service sector. Advanced Big Data management systems also means data is stored more securely. Other areas it can benefit include tax reform, risk analysis, automation, fraud detection and investigation, trade and investment.
Transport and Logistics: Customers now routinely demand speedy next-day delivery for items they buy online; this is made possible in part due to Big Data which can optimise delivery routes, consolidate orders and delivery loads, and implement fuel efficiency measures.
Energy and Utilities: instead of having to rely on outdated analogue meters and manual readings, the introduction of smart meters has meant that utility companies can gain intelligence on customer behaviour and usage. This, in turn, means more accurate pricing and forecasting. And if employees no longer have to read meters themselves, they are free to spend more time carrying out essential tasks such as repairs and upgrades.
Education: the Coronavirus pandemic has meant that remote learning has been relied on more than ever before. But it brings many new challenges, such as how to best provide online learning, as well as analysing and assessing students’ performance remotely. Big Data helps with this; developing blended learning, new assessment methods and personalising the education offering to each student are all made possible through the use of Big Data.
Big Data Technologies
Big Data architecture: just as in the construction industry, architecture and planning is important in the world of Big Data. A blueprint sets out how a business will manage and process their data. Architecture defines the journey the Big Data will go on, over four ‘layers’, from the source of the data, to its storage, analysis and then on to the consumption layer where business intelligence is produced from the results.
Big Data analytics: when it comes to Big Data, the more an organisation is engaged in developing their management capabilities and analytics, the better their business results are. Analytics enable valuable visualisation to happen through the use of data modelling and algorithms.
Big Data and Apache Hadoop: Hadoop is an open-source framework designed to manage the processing of Big Data when it is spread across a network of multiple connected computers. Imagine trying to find something in a large box filled with items, compared to ten smaller boxes with only a few items in each. It will be easier to find what you’re looking for in the smaller, less densely-packed boxes. This is the principle that Hadoop works on. By harnessing the power of an almost unlimited number of computers, it can analyse data in parallel, often using a programming model known as MapReduce.
Data lakes, warehouses and NoSQL: these are ways of storing unstructured and semi-structured data (so called ‘non-traditional datasets) that can’t be stored in spreadsheets, for example. A data lake is simply a pool of raw data waiting to be processed. A data warehouse is where data that has already been processed is kept. And a NoSQL database provides a mechanism for storing and retrieving data that cannot be stored in traditional relational databases. Often, a business will use a combination of lakes, warehouses and NoSQl to manage their data.
In-memory databases: instead of having to retrieve data from a hard disk before processing and analysis, an in-memory database does this entirely in RAM (in effect, a computer’s short-term memory). This type of database is much faster because it uses parallel processing as opposed to single, disk-based database models.