How big is big data? What is the difference between big data and lots of data? What are the best architectures for different businesses? What skills and experience should recruiters and IT department heads be looking for when thinking about evolving a big data solution?
In my main industry, oil and gas, they have been dealing with lots of data since the mid 1980’s (e.g. well logs data which is measured at 6″ intervals, and seismic data has always been voluminous).
As an example, my software product EssRisk store reservoir simulation results in a desktop relational database (H2, an excellent pure Java DBMS). In a single database I might store 50 billion floating point values (1000 time steps x 1000 wells x 50 measurements per well x 1000 simulations). This may be a lot of data, but it can be perfectly well managed in a conventional data DBMS. Performance is not an issue. The trick is to store arrays using object extensions of SQL3 (which are defined in SQL1999 standards but only implemented in some DBMS’s such as Postgres and H2).
I have looked at some Kaggle data, for example the Springleaf data set which is about 150,000 rows and 1200 columns. Hardly big, I can do all the work on a single desktop, and implement algorithms using GPU’s and concurrent Java.
Then you have reservoir simulators (which can be viewed as complex fluid dynamics simulators) which can run 1 billion cell models, using GPU’s in a small server. No trace of any ‘big data’ architecture there. Communication and memory is the key.
There is no doubt that moving to big data architectures introduces complexity and constraints. We start to have distributed data and distributed algorithms, which put limitations on what algorithms are feasible, and we start to use DBMS’s such as Cassandra which many would hesitate to call a DBMS at all as it lacks so much of what would normally be expected in a DBMS. CQL is not SQL, and a DBMS which does not allow ad-hoc updates to data is liable to give you a donkey kick when you turn around.
So, let’s have a discussion. Some starting points:
- Many ‘big data’ projects are implemented in ‘big data’ architectures, when a much simpler tiny data architecture is both possible and more effective. ‘Big data’ architectures are bleeding edge, and there are many surprises in store for those implementing them the first time.
- By ‘tiny data’ I mean conventional architectures using conventional SQL DBMS’s such as MySQL and Postgres, and languages such as Java. It does not exclude distributed physical implementation of databases, nor concurrent and functional programming.
- Software developers love the latest toys and push ‘big data’ solutions because it looks good on their resume.
- Machine learning in big data architectures, such as Spark, use algorithms which are designed for distributed data, whereas there are much better algorithms which will perform much faster on a tiny data architecture. For example, gradient descent optimisation sucks, it was superseded in the 1970’s by quasi-Newton methods, but we seem to be using it again solely because tiny data is being implemented in big data distributed architectures, and quasi-Newton methods are not so amenable for data distributed architectures. Gradient descent methods are easier to parallelise on a distributed system. Solution? Implement tiny data in a tiny data architecture.
- You have thousands of columns in your data, so you can’t use normal linear regression techniques, the matrix becomes too large and inefficient to factorise, you are worried about the ‘curse of dimensionality’, so you have to use steepest descent methods without matrices? Stop, turn around, go back, consider methods to select the subset of features (say 100) which gives the best fit – there has been a long and useful history among statisticians how to do this, and stepwise methods have rightly been left in the dust. Bayesian averaging of multiple small models is something every data scientist should be familiar with.
- Big data DBMS’s like Cassandra force an implementation of denormalised tables, which comes at a high cost of future maintainability, data quality, and size of database. Moreover, it prevents the implementation of flexible generic data models; rather than having a generic framework for storing measurements, we have a separate column for each measurement which seems to go back to the awful ‘spreadsheet as a database’ world of the 1980’s which I thought had been kicked into the long grass.
- Yes, we have lots of data, sensors are taking readings every second, but does this mean we have to use it all in our machine learning algorithms? No, filter the data, summarise the data, transform the data, compress the data, understand the data, recognise that data may have patterns or be repeating. You may find that filtered hourly data holds all the information you need, so big data has turned into tiny data.
- At least consider a simple two stage system. Store the raw data in the first big data stage, and then transform and transfer into a second tiny data stage for analysis.
- There seems to be a mindset which says ‘let’s not bother with trying to improve or tune the algorithms or benefit from established statistical insights, or even try to do some feature engineering or variable selection , we have so much data it will look after itself’ . This is merely the cri de coeur of the uneducated and inexperienced who have oversold big data to management and are hoping for a silver bullet to save them.
So, how do we avoid these pitfalls?
- Some big data projects, but not many, are truly big data and require all the overhead and inefficiencies of big data architectures and algorithms. Good luck, you will need it. Hire the most seasoned, experienced, pragmatic professionals you can, they will be worth every penny.
- Understand the data. Produce a conceptual data model. Understand generic framework data models and ways to implement flexible data management.
- Don’t trust anybody who makes structured data into unstructured data. Are you trying to read unstructured text from reports? Where did the reports come from? Have you asked your suppliers to provide data in standard formats? As an example, within oil and gas the WITSML digital standard is now being used to transfer data from the wellsite to operators, whereas in the past it was a paper system.
- Produce data flow diagrams, define requirements, look at different options, don’t think that ‘agile’ means you can change requirements for a complex system a week before release. It is well known that design flaws can be the most expensive to correct.
- Simplify, simplify, simplify. Why are you using Mongo/Kafka/Cassandra ? Is it really necessary? Why not do everything with Postgres + concurrent Java + H2O ? Can we simplify our time series data into simple arrays? Do we really need to analyse data with 8 significant values every millisecond?
- Question at each point whether you are over engineering
- If you are a startup, start with a tiny data implementation, get it to market, get some revenue, learn, and then maybe consider a future big data implementation
- Do not let big data architects drive the architecture. Use a seasoned software professional project manager who can question everything dispassionately.