Usage of relational database systems is reducing in high latency batch processing or datawarehouse systems. Most of the companies are replacing their SQL batch processing systems with MapReduce processing.
I personally like MapReduce processing because of elastic scalability it provides.
It’s going to be difficult for relational database systems to compete in high latency data processing space.
There are still companies out there who will continue to use SQL batch processing either due to lower data volume or due to characteristics of data.
The way relational database systems works should be changed to better support batch processing and larger data volumes.
Relational database systems should behave differently in batch processing when compared to OLTP systems, OLTP systems need strict ACID properties, whereas batch processing do not need to be ACID compliant.
Here are some thoughts on how relational database systems should process data in a high latency batch processing systems.
Take out transaction logs for data writes
This may sound stupid, but YES, there is no need of writing data to transaction log in a batch processing systems. We can still have transactional file to log schema changes.
Usual workflow of batch processing systems is to get data from different sources, do data cleansing, process data which results into smaller data sets for business analysis.
If there is any data loss due to outage or whatever reason, data is always available at source and can be re-processed. There is no need to have point in time recovery in batch processing databases, which are high latency systems, it means, writes to transaction logs can be avoided, no more log backups.
Only risk is, if data at source is deleted and data loss happens during batch processing, if there are any such cases, developers should keep copy of data in files or in any other form until batch processing is successfully completed. Full database backup after recent processing should be good enough for recovery.
Currently different recovery models are provided by relational database systems, but they make it so difficult to avoid writing to transaction logs.
What happens to transactional support in batch processing?
No more transactional support if there is no transactional log, developers should not depend on explicit transactions, they should avoid exposing any incomplete batch to external users, this can be done using views. Failure scenarios should be handled without depending on rollback mechanism provided by explicit transactions.
Does it slow down writes to data files ?
No, writes response time improves, with transaction logs, data is written to disk twice, once to write to transaction log file and later to write to data file. Currently existing minimal logging avoids writing complete data to transaction log, but systems takes a hit when transaction log backup is taken.
What happens to replication or writes to secondary database ?
Built-in replication depends on transactional logs, it won’t work , it should be re-written or users should build their own process to replicate data to secondary servers.
Users building their custom replication process might be better in some scenarios, they know the tolerant latency / SLA's to get data at secondary database and can design the way it works best for them.
There should be configuration at server level to indicate batch processing or OLTP system. OLTP configuration should support full ACID properties, batch processing configuration should take out ACIDity.