Monday, September 15, 2014

Is ACIDity killing relational database systems ?

Usage of relational database systems is reducing in high latency batch processing  or  datawarehouse systems. Most of the companies are replacing their SQL batch processing systems with MapReduce processing.
I personally like MapReduce processing because of elastic scalability it provides.
It’s going to be difficult for relational database systems to compete in high latency data processing space.
There are still companies out there who will continue to use SQL batch processing either due to lower data volume or due to characteristics of data.


The way relational database systems works should be changed to better support batch processing and larger data volumes.


Relational database systems should behave differently in batch processing when compared to   OLTP systems, OLTP systems need strict ACID properties, whereas batch processing do not need to be ACID compliant.


Here are some thoughts on how relational database systems should process data in a high latency batch processing systems.


Take out transaction logs for data writes
This may sound stupid, but YES, there is no need of writing data to  transaction log in a batch processing systems. We can still have transactional file to log schema changes.

Usual workflow of batch processing systems is to get data from different sources, do data cleansing, process data which results into smaller data sets for business analysis.
If there is any data loss due to outage or whatever reason, data is always available at source and can be re-processed. There is no need to have point in time recovery in batch processing databases, which are high latency systems, it means, writes to transaction logs can be avoided, no more log backups.

Only risk is, if data at source is deleted and data loss happens during batch processing, if there are any such cases, developers should keep copy of data in files or in any other form until batch processing is successfully completed. Full database backup after recent processing should be good enough for recovery.

Currently different recovery models are provided by relational database systems, but they make it so difficult to avoid writing to transaction logs.


What happens to transactional support in batch processing?
No more transactional support if there is no transactional log, developers should not depend on explicit transactions, they should avoid exposing any incomplete batch to external users, this can be done using views. Failure scenarios should be handled without depending on rollback mechanism provided by explicit transactions.


Does it slow down writes to data files ?
No, writes response time improves,  with transaction logs, data is written to disk twice, once to write to transaction log file and later to write to data file. Currently existing minimal logging avoids writing complete data to transaction log, but systems takes a hit when transaction log backup is taken.


What happens to replication or writes to secondary database ?
Built-in replication depends on transactional logs, it won’t work , it should be re-written or users should build their own process to replicate data to secondary servers.
Users building their custom replication process might be better in some scenarios, they know the tolerant latency / SLA's to get data at secondary database and can design the way it works best for them.

There should be configuration at server level to indicate batch processing or OLTP system. OLTP configuration should support full ACID properties, batch processing configuration should  take out ACIDity.




Friday, January 10, 2014

B+ tree data structure explained

It seems there is a lot of confusion among most of developers on data structure used to create indexes in database systems.

Most of the people think indexes in database systems are created using B trees, in reality, B+ tree is the structure used for indexes.

Tables in database systems are only logical structures, in my opinion, it is very important to know the physical data structure used to store data, developers can write efficient code when they know about structure used to store data.

I thought to put a blog post using typical sample data usually stored in database tables, and index it using B+ tree. This post doesn't have code, intention is to provide high level explanation on B+ tree.


Let's take sample data and start building index using B+ tree.




Above is the sample employee data, Employee ID (EID) is the key in data set.
B+ tree is combination of B tree (Index set) and Sequential data set (this is where data pages are sequentially stored).

I'll introduce properties of B tree as we make progress in index creation.

Follow these steps carefully.
1) Insert data for EID (Employee ID) 12 ...

Allocate a block to store employee data, assume this block is 8 KB in size.
Save record for Eid=12 in the allocated block as shown below.
             




This 8 KB block is called data page, which is "+" part of B+ tree.
Let's assume each record in our sample data takes 4 KB, which means maximum two records can be stored in a data page.


Now is the time to introduce important property of a data page.
Data page is also a node with some capacity order, capacity order is the minimum number of records that can be stored in a node. In our sample, data page has capacity order of  one, minimum one record and maximum two records can be stored.
If capacity order of a node is "x" , then minimum number of records in a node are "x" and maximum are "2x" and node cannot be empty.

We need a data structure that points us to correct data page when index key is used, this data structure is B tree. Let's create a B tree with index key 12 and connect it to data page.

Allocate a block of 8 KB in size and store ONLY index key (12) in it, this is a index node, 8 KB block can store many values as it contains only keys in the index node.

Here is a B+ tree with index node and data page. Note: Blue color index node is also 8 KB in size.  Index node in blue color in the below image is a single node B tree and data page is "+" part of B+ tree. Left pointer points to data page containing data for employees with ID less than or equal to 12.

To keep it simple, our index node has capacity order of two, which means minimum two keys and maximum four keys in a node, except root node.
In case of B tree, root node is allowed to have single value, all non-root nodes should have number of keys at-least equivalent to it's capacity order (two in our case for index node).

In our sample, capacity order of data page is one and capacity order of index node is two.

2)Insert record with EID = 1 from our sample set into B+ tree.

Record with EID=1 goes into existing page , there is no change to index pointer.
Data page becomes full with this insertion.

3) Insert next two records with ID's 14 & 13 sequentially.
Here is how tree looks like after two insertions, I'm showing in a single image, insertions happens one after another.
Image of the data page is reduced due to space constraints, othercolumns section represents all columns except key from our sample for each employee.

Existing data page was filled with insertion in step (2), new data page is created to add new records.
These two data pages are linked sequentially in a sorted order of the index key.
Index node is updated with right side pointer pointing to data page containing records with keys greater than 12.

4) Insert data for EID =4.
To keep the data ordered in data pages, record with EID=4 should go into first page, first page is already full, it need a split. When a page split happens, existing records are split into two halves and higher half is moved into new page.

Here is how it looks after splitting first page and new insertion.


Record with EID=12 is moved to new page and EID=4 data is inserted into first page.
Key values in index are adjusted to point to correct data pages.
Pointer to the left of 4 points to data page containing keys less than or equal to 4.
Pointer to the right of 4 and left of 12 points to data page containing keys greater than 4 and less than or equal to 12.
Pointer to the right of 12 points to data page containing keys greater than 12.

5) Follow similar process and insert data for EID's 6,7,8,3 & 5 sequentially from our sample set.
Here is how B+ tree looks after these insertions...

At this stage, root node is full.
Let's see what happens when new record is added.

6) Insert data for EID = 2
Data for EID=2 should go into first page, which is full.
Page split happens, data for EID=3 is moved to new page.

With the addition of new data page, root node should be adjusted, root node at this stage is full.
Capacity order of root node is two, which means it can contain only up to four values.
Above image shows how root node looks like if it can contain five values.
Root node should be split, middle key 5 in green should be made as root, key 5 goes up , keys 2 & 3 which are smaller than 5 should be to the left of root (5) and keys 7 & 12 should be to the right of root (5).

Here is how it looks with page split on root node.


In the above image, nodes enclosed in a orange color triangle forms B tree, which is also called Index set.
Data pages linked sequentially in the logical order of index keys is called Sequential set.
Combination of Index set(B tree) and sequential set of data pages is called B+ tree.

This completes creation of B+ tree using our sample data. If you look at it, tree was built Bottom-Up.
When a tree is implemented, pointers are nothing but the addresses of storage blocks on a disk (you can imagine these as offsets in a file).

In a relational database systems terminology, when someone says leaf level pages of a clustered index are data pages,it means, data page from a sequential set of  B+ tree, as we seen in our sample tree.

Let's see how we can use the tree to query data, here are couple of queries...

Fetch single record
Q1) What's the SSN, Address,City and State of employee with EID = 8?
A) Start at root node, 8 is higher than  key (5) in the root node. Take pointer to the right of key 5  in the root node and go one level down to the node with keys 7 & 12.  EID=8 is greater than 7 and less than 12, take pointer to the right of 7 and go to page containing records for EID's 8 & 12. Load this page into memory and look for record with EID=8, fetch the required columns (data) from the record and return those to client.


Fetch multiple records using range scan.
Q2)What's the SSN, Address,City and State of employees with EID's between 4 and 12 (inclusive)?
A) There are multiple ways to fetch this data, let's use most commonly used access path.

Range scan :
Starting value in the range is 4, access page containing record for EID=4, sequentially load pages starting from page containing EID=4 until we hit a page with EID greater than 12, pages are sequentially ordered, we can stop page load once we see a page with key greater than 12.

In our sample, start at root node, starting value in the range 4 is smaller than key 5 in the root node, take pointer to the left of 5 and go one level down to the index node containing keys 2 & 3. Take pointer to the right of 3 and reach page containing data for EID=4, start loading data pages sequentially until you see page with value greater than 12 (i.e. last page in our sample tree). Process the pages loaded into memory by searching for records with EID's between 4 & 12 and return those to client.

Read ahead operation: Here I would like to mention about read ahead operation, most of the relational databases fetches referenced page and also some adjacent pages along with referenced one. This will avoid additional trips to disk in case of range scan and improves performance, provided data is not fragmented, i.e. physical and logical ordering of data pages are same. You cannot take  advantage of read ahead when data is fragmented.

Read ahead operation in relational databases is similar to spatial locality used while accessing data from main memory.

Some other characteristics of a B+ tree.
  • Index node is also a block of keys, number of keys that can fit in a index block depends on size of node, it's a 8 KB block in our sample. Many keys can fit in a index page and occurrence of page splits in index nodes is less when compared to data pages. This property keeps height of B+ tree short and under control.
  • Presence of key in index node doesn't guarantee presence of data in data pages.When we search for EID=0 in our sample B+ tree, index set takes us to first data page, but first data page doesn't contain EID=0. You need to load the data page by traversing thru the tree and search for key you are interested in the data page. 
  • New page that gets created due to page split may be allocated far away from the page that's being split, but get's connected in a logical order. This causes random reads and affects performance.
  • Keeping data pages physically adjacent to each other in a logical order of keys improves performance.
  • B+ tree enables faster range scans.
  • Most of the database systems connects index nodes and data pages using doubly linked list.
  • If index set is in memory, loading data page from disk will be faster operation. Keeping root node of index in memory improves performance, root node is probed in all operations. 
  • B+ tree is a Bottom-Up tree.

There are various ways to create a system that enables faster access of data from secondary storage, B+ tree is commonly used data structure in the implementation of RDBMS systems.
You can create your own storage system that suits your needs.
Here is a white paper on a storage system called Haystack implemented by Facebook to store their photos.
http://www.stanford.edu/class/cs240/readings/haystack.pdf
Their implementation is simple, I liked it so much and read it probably three times.

You don't need to use relational databases if  it doesn't suit your application, create your own storage system, which can be as simple as a file divided into blocks with an index and you have full control to tweak as it suits your needs.

If you like this post, please do not forget to share it on social networks by clicking buttons below.


Monday, December 23, 2013

Impact of technology on businesses




(If you are not interested to read this post completely, scroll all the way down to read side note)
I was thinking to blog about some experiences I had on the same day sometime back.
I couldn't do it for so long due to time constraints, finally got sometime from my work, family and app developments I do during my free time. One of my app, Koya, is published in windows store, try it out if you are using Windows 8 or higher version. Currently I'm working on another prototype for Windows Phone app, this one is on photos.

I live in Redmond, WA, very close to Redmond Town Center and works as a software developer at Microsoft.


Here are three different experiences I had on the same day
#1 Borders
I used to go to Borders book store located in Redmond Town Center regularly during weekends, used to read books & have coffee, while my kids and wife spends time in play area next to book store.
During one of our visit, I saw all the items in store were put on sale and store was closing. Some of the books were being sold for a dollar or less, even their furniture was put up for sale.
I was feeling sad to see store employees lose their jobs, particularly having seen them regularly at store.

#2 Blockbuster 
On my way back home on Redmond Way, there was another store with huge trademark sign in blue color intact, which was already closed (it's occupied now by Overlake Medical Clinic  urgent care). Yes, this one was Blockbuster video rental store. I used to go there too when they were in business.

#3 Impact of online content on newspaper industry
After reaching home, turned on Netflix on my TV and started searching to watch something. Found a documentary called "Page One: Inside The New York Times", looked interesting and I started to watch this documentary. This one was about how newspaper industry is affected badly due to free online content on internet and have to lay off people due to cost cuttings.


Experiencing these three incidents on the same day, led me to write this blog post.

I started thinking about these three businesses...
All these three businesses were disrupted by evolution of technology, mainly internet.

Blockbuster's online competitor Netflix is doing good.
Border's online competitor Amazon is doing excellent.
News content available on internet is enabling people to get news from multiple sources, now people get latest news digitally without any delay.

Some businesses were closed because they could not adjust to the changing trends in technology. At the same time, technology has created new huge businesses and new jobs. Just imagine number of jobs online retailers might have created, starting from the production of packaging material until the product reaches customers. Sometimes businesses get so much busy at what they do, they miss to see next big thing.

I started putting my thoughts around different things these businesses could do or should have done.

This post is primarily focused on print edition of newspaper.
It is possible some of my thoughts may not be feasible to implement, may even cost more than my assumptions or might have already been implemented.
I did not do any research to write this post, putting down what I have in my mind.

  
Static content of new papers:
Once newspaper is printed, content remains same for each and every reader, probably some local news may differ based on the distribution location.

How can we make this content dynamic based on user's preference, location and time of the day?

Here are some thoughts...
  • Kiosk printing
In addition to printing newspapers at centralized location by huge printing machines, we may need to have small fast printers setup as Kiosks at various locations. These printers should be connected to various different news agency networks, which gets updated regularly with latest news.
Kiosks will be able to provide news  from different news agencies like ,"New York Times ","USA Today", "Wall Street Journal", "Washington Post ", others and even from international news agencies, this will help in sharing costs.
Even this whole Kiosk setup can be franchised to some third parties.

With this setup, expensive printing infrastructure each news agency maintains, can be reduced. This helps in reducing printing costs and also minimizes transportation costs.

  • Newspapers content can be changed during the time of printing.
Kiosk printing makes print edition of newspapers dynamic.
News can be changed during the time of printing.
If there is any breaking news, people don't need to wait until next day to read about it in print edition. They can go to any kiosk, select the newspaper they are interested in, printer prints latest news which includes breaking news.
It helps people who prefers print edition over digital content.

  • Cost of print edition can be changed depending on user's preferences
If reader is interested in sports news, he can buy only sports related news, he will be charged only for the piece of content he is buying. Cost of newspaper is not fixed anymore, it changes based on what you are interested to read. If you are interested to read news about president's speech from "New York Times", just pay for president's speech related news and get it printed.

  • Impact on advertisement.
Advertisement will change completely with this model.
System knows user preferences, location, exact time of the purchase and additional details from user can be requested as optional. 
Targeted advertisement works perfectly with such useful information
Advertisement that gets printed will change based on user information. 
This makes advertisement on print edition more dynamic and targeted. 
Advertisers can even buy slot for only during specific times of the day and location. 
News agencies can even set price to advertise based on number of prints (i.e. cost per 1000 prints, 10K prints etc.).
It means, more number of advertisers and cost of advertisement can be reduced.
  
  • Size of newspaper, kiosk & printer
With the kiosk model, size of newspaper don't need to remain same as we see today. It can be changed as it fits new model. The crux of the solution are faster printers, these printers could be larger than what we see in our offices. Size of kiosk may end up being larger than photo printing kiosks.

 
All these above thoughts will not work for the newspapers that gets delivered at our homes. If the kiosk model works, there are various ways to reduce cost on those ones too. Those thoughts later :)

As a side note:

Amazon's CEO, Jeff Bezos, who is talking about using Drones to ship Amazon products, also owns The Washington Post newspaper.
Drones may work perfectly to take newspapers to customers quickly, irrespective of weather conditions. Drones I'm envisioning  are the ones that carry whole bunch of newspapers and drops one at a time.  There may not be many complaints if there are any accidental drops of newspapers on people, it won't hurt much. Drones buzzing sound will wake up people early in the morning and signals their newspaper is at their doorstep, might be good for some :). 
I think experimenting with Drones to ship products is a great idea, package shipping industry may need to get ready to those changes.

If you like this post, please do not forget to share it on social networks by clicking buttons below.

Koya (Mailbox Analytics)






Koya is a windows store app and it's mission is to keep emails under control.
Currently Koya supports gmail accounts only.
Koya helps users to identify the email accounts that are filling their mailboxes.
It counts number of emails sent by each sender to your account and displays counts in descending order under each folder.
It let's users to delete emails or move emails to different folder.

Explain how Koya works?
When Koya is launched (started) by user, it connects to gmail server and gets senders emaild & uid for each email, simple aggregation is done and results are cached in memory along with the list of uids.  These uid's are used to perform delete and move operations in each folder.


How often Koya deletes and moves emails?
Koya tries to run delete and move operation as soon as it can.
In the current version, Koya checks every 5 seconds (while App is running) to see if there are deletes or moves to perform and kicks-off clean-up operation if required.
Duration of clean-up depends on number of emails to clean-up.

Clean-up refers to either delete or move or both.

How often Koya's data is refreshed?
Koya is not a real time system.
Koya's main mission is to do clean-up on mail server as soon  as it can while app is running.
Koya's data is refreshed EVERY  TWO HOURS, if user resumes app from suspended state(i.e. Windows did not take App out since last run).

It means, Folders List and email counts you see on Koya are not real time and gets updated every two hours ,if App is not closed either manually or by operating system and user resumed App after two hours.

This is done to minimize resource utilization on user's device.
Again, Koya's purpose is to do clean-up on mail server soon rather than provide real time counts or update folders list.

Note: You can always close and restart Koya if you wish to see updated data earlier than scheduled.
Koya always gets data from mail server when it's started or restarted.

How do I know if the new folder is created on mail server ?
1) Fastest way to verify is to login to www.gmail.com and check for the folder.
Here is how  it looks like in the web interface , sometimes you may need to pull down circled arrow.


2)You can see your new folder in Koya's folders list after hour (you can revisit Koya after an hour).


How do I know clean-up of emails is done ?
There are couple of ways to check this..
1) You can go to main page (home page) and click "Deleted/Moved Count" button, it will show the amount of clean-up done. It may take a minute or so to do clean-up and update Action Analysis dashboard.
2)Go to www.gmail.com , login to your account to verify deleted or moved emails.
Deleted emails are moved to Trash.


Koya V2 Released on 12/10/2013

Thanks to users of Windows Store App “Koya (Mailbox Analytics)” for your valuable feedback.
I got many emails with feature requests and feedback on existing features.
Thanks for all your support.

New version of Koya incorporating users feedback is published in Windows Store....
Download latest version of Koya from Windows Store and try it out.

Here are new features in the latest release…
1) Save and Cancel buttons in email counts grid.
2) Koya reads recent 3000 emails when folder has more than 3000 messages, with option to count all emails on users’ request.
3) Notifications within app when clean-up of emails happens.
4) Performance improvements.

Keep your emails under control using Koya.



--Arif Khan




Sunday, December 22, 2013

Random vs Sequential I/O and database systems


Main aim of this post is to set foundation to explain how random I/O's influence database systems.

Before going into the details, we need to understand data storage on physical layer (hard disk). It's important to know storage at physical layer to get better understanding on data access techniques used by RDBMS systems.

Not going into too many details of hard disk, some of the important components of hard disk for this post are: platter, read/write head, tracks and sectors.

Check these links on hard disk working:

Platter (below image) consists of set of tracks (red and black color rings), track is divided into sectors (yellow sections in red track).



Sector is the lowest physical storage unit, typical size of sector is 512 bytes, and data is stored in sectors (there are disks with larger than 512 bytes per sector).

To save a file of size 9000 bytes, two sectors are allocated on the physical layer.

Sequential Access
Let's walk thru a sample (sector size is 512 bytes in our sample)...
1) File is created with initial size of 1024 bytes, two continuous sectors are allocated on outer most track (say track1).
2) Append 1024 bytes to the file, it need two more sectors, these two sectors are allocated adjacent to previous two blocks
3) Append 1024 bytes to the file, it need two more sectors, these two sectors are allocated adjacent to previous four blocks.

Data blocks for file are allocated sequentially on track1.

To read this data, certain operations has to be performed...
 i)   Move read/write head to track1 (seek time)
 ii)  Wait for the sectors to come under read/write head (rotational latency)
 iii) Transfer data to OS (data transfer rate).

Typical seek times are anywhere between 9 to15ms, let's say 10ms in our case.
Rotational latency depends on rpm rate of disk, it typically goes from 2 to 5ms (http://en.wikipedia.org/wiki/Rotational_latency#Rotational_latency ), assume 2ms in our case.
Data transfer rate has increased over the period of time, typical transfer rate goes from 50MB/sec to 170MB/sec.
In our case, assume its 50MB/sec (50 * 1024 KB/sec).

Total cost to read file: 10ms (seek time) + 2ms rotational latency + 0.05ms (time to transfer 3KB of data) = 12.05ms.

Seek time is to move read/write head to track1 (if read/write head is not on track1), then onwards data is read sequentially.

Among all the three costs, seek time is the expensive one.


Random Access
Let's walkthrough the same sample when data is saved randomly

1) File is created with initial size of 1024 bytes, it need two sectors, and these two sectors are allocated on two different tracks (track1 & track100).
2) Append 1024 bytes to the file, it need two more sectors, these two sectors are allocated on two different tracks (track2 & track200).
3) Append 1024 bytes to the file, these two sectors are allocated on two different tracks (track3 & track300).

Data blocks for file are allocated randomly at different locations.


Read/Write head movement sequence: track1->track100->track2->track200->track3->track300

Sequence of physical operations and cost for each operation:
A) Cost for track1: 10ms (seek time) + 2ms rotational latency + 0.001ms (time to transfer 512 bytes) =12.001ms

B) Read/Write head is on track1, it should be moved to track100 to read second block, which means seek time is involved.
Cost for track100: 10ms (seek time) +2ms rotational latency+0.001ms (time to transfer 512 bytes) =12.001ms

C) Read/Write head is on track100, it should be moved to track2 to read third block.
Cost for track2: 10ms (seek time) +2ms rotational latency+0.001ms (time to transfer 512 bytes) =12.001ms

D) Read/Write head is on track2, it should be moved to track200 to read fourth block.
Cost for track200: 10ms (seek time) +2ms rotational latency+0.001ms (time to transfer 512 bytes) =12.001ms

E) Read/Write head is on track200, it should be moved to track3 to read fifth block.
Cost for track3: 10ms (seek time) +2ms rotational latency+0.001ms (time to transfer 512 bytes) =12.001ms

F) Read/Write head is on track3, it should be moved to track300 to read sixth block.
Cost for track300: 10ms (seek time) +2ms rotational latency+0.001ms (time to transfer 512 bytes) =12.001ms

Total cost to read file = 12.001+12.001+12.001+12.001+12.001+12.001--->72.006ms

Access time has increased by six times when data is randomly distributed.

This is the cost you pay when data is fragmented.

There has been lot of improvements in data transfer rate over the years, there is only small percentage of improvement in seek times due to mechanical movement of read/write head.
Improvements in data transfer rates helps systems like map-reduce which scans data sequentially.


My experience in database systems has been mostly with Microsoft SQL Server, when I refer to any specific aspects of database systems in my posts, it'll be with reference to SQL Server.

Fragmentation is the silent killer of query performance in database systems. If the execution plan looks perfect as far as physical access to data goes, and still query takes longer to execute, most likely your data is fragmented either in index or heap.

Fragmentation leads to random I/O's and affects the performance of query.

Continue to my next post by following link http://technicalpostsbyarifkhan.blogspot.com/2013/12/influence-of-random-ios-on-key-look-ups.html to know more on how random I/O's influence database systems.


If you like this post, please do not forget to share it on social networks by clicking buttons below.

Random I/O and key look-ups


If you haven't read my previous post on sequential vs random access, please go to this link

http://technicalpostsbyarifkhan.blogspot.com/2013/12/random-vs-sequential-io-and-effects-on.html  before proceeding further on this post.

This post is primarily focused on key look-ups when non clustered index is used to access data using Microsoft SQL Server as a database system.

Here is the key difference between clustered and non-clustered indexes.
Clustered Index: Leaf level pages of a clustered index are data pages (all columns of a row and possibly multiple rows in a page).

Non Clustered Index: Leaf level pages of a non-clustered index has non-clustered index key and a reference to its row in a data page. This reference is a clustered index key if there is a clustered index, otherwise reference to physical location of a row.

I can spend whole lot of time describing clustered and non-clustered indexes, that's not the purpose of this post.

Data consistency 
Clustered and non-clustered indexes are two separate physical structures, maintaining consistency among these two separate structures is the key property of RDBMS systems, which is 'C for consistency' in ACID properties.
When data is written to a table, it's the responsibility of RDBMS system to update all indexes to maintain consistency, if update to any index fails, write operation fails and data is rolled back to previous state.


Now let's go to the core issue of this post i.e. key lookups used in SQL Server.

Here is the sample script to simulate key lookups...

CREATE TABLE DBO.TestNCIUsage (ci INT NOT NULL
                                        , c2 int not null
                                        , c3 int not null
                                        , c4 int not null
                                        , c5 varchar(100)
                                        , nci int not null
   , unqid uniqueidentifier default (newid()))                                  
                                                                     
--insert sample data
DEClARE  @i int =1
                    ,@nciValue int = 2000000
                   
WHILE (@i<2000000)
BEGIN
                    INSERT INTO DBO.TestNCIUsage(ci
                    ,c2
                    ,c3
                    ,c4
                    ,c5
                    ,nci)

 SELECT CAST(RAND()*10000000 AS INT) AS ci --random number
                    ,2000 AS c2
                    ,3000 AS c3
                    ,4000 AS c4
                    ,'SQL SERVER 2012 TESTING' AS c5
                    ,@nciValue
                   
                    SET @nciValue -=1;
                    SET @i+=1;
END

--cluster index is on randomly generated value
CREATE CLUSTERED INDEX CIX1 ON DBO.TestNCIUsage(ci)

CREATE NONCLUSTERED INDEX NCI_TestNCIUsage__nci ON TestNCIUsage(nci)

Cluster index on table is on ci and non-clustered index is on nci column.
Table is populated with 2M rows, value for ci is randomly generated and nci is a decreasing value staring from 2M.


1) Key lookups
select nci,c2 from DBO.TestNCIUsage
where nci between 1 and 4888

In the above select statement, columns nci & c2 are selected, nci has non-clustered index, but c2 is NOT in index.

Here is the execution plan for the above query..




If we analyze this plan, two index structures are used to fetch data from disk and nested loops are used to join data.
Data access starts with non-clustered index, gets nci column values between 1 and 4888, each nci value in non-clustered index also has it's corresponding clustered index key (ci) at the leaf level, this clustered index key (ci) is used to traverse thru clustered index's B+ tree and fetches c2 from data pages, this is what is called key lookup.

Here is where previous post on sequential vs random access helps...
Each lookup into clustered index to get c2 may potentially end up with a random read.
This random read is not because of fragmentation, it's because each key look-up has to traverse thru clustered index B+ tree and fetch data from pages that are not logically ordered. Key lookups in the sample select statement may cause 4888 random reads, random reads are the most expensive I/O operations and the cost of each random read is calculated in previous post.

Now let's change the select statement

2) Clustered index scan
select nci,c2 from DBO.TestNCIUsage
where nci between 1 and 4889

Here is the execution plan when upper bound in the where clause is changed to 4889.





Apparently, SQL Server optimizer has estimated too many random reads when value is changed to 4889 in the where clause, and decided to go for clustered index scan (which is almost similar to table scan).
It turns out, scanning whole 2M rows table sequentially is cheaper than causing 4889 random reads.

Key lookups causes random reads and random reads are the most expensive operations.

Key lookups are fine if you are reading small set of rows, it becomes problematic when you are fetching more data. General threshold seems to be around 5000 rows, more than 5000 tends to make SQL Server to go for clustered index scan, even if the table size is way larger than sample shown in this post. These numbers are approximate values, you need to test with your data.

Usage of non-clustered indexes with key look-ups works fine in transactional systems, where reads are smaller in size. It's an issue with data warehouses where large reads are performed, non-clustered index will be inefficient if the index is not covering all the columns used in the query while you are reading large number of rows. 

There are lot of other factors that determines cost of I/O operations, like size of row, number of rows in a table & number of pages. We are not going into that discussion in this post.

There are couple of ways to make non clustered index cover all the selected columns
i) Include all columns as part of non-clustered index key, which is inefficient and may not work in case of unique indexes.
ii) Use SQL Server's INCLUDE clause to make columns as part of leaf level page instead of index key.

Key points from my two posts..
(1) Minimize fragmentation of data.
(2) Minimize operations that cause random reads.
(3) Non-clustered index is inefficient for large reads, if the index is not covering, this will be worse, you are paying cost to update index in write operations, but not using it in reads.

I hope these two posts were helpful.


If you like this post, please do not forget to share it on social networks by clicking buttons below.