Monday, December 23, 2013

Impact of technology on businesses




(If you are not interested to read this post completely, scroll all the way down to read side note)
I was thinking to blog about some experiences I had on the same day sometime back.
I couldn't do it for so long due to time constraints, finally got sometime from my work, family and app developments I do during my free time. One of my app, Koya, is published in windows store, try it out if you are using Windows 8 or higher version. Currently I'm working on another prototype for Windows Phone app, this one is on photos.

I live in Redmond, WA, very close to Redmond Town Center and works as a software developer at Microsoft.


Here are three different experiences I had on the same day
#1 Borders
I used to go to Borders book store located in Redmond Town Center regularly during weekends, used to read books & have coffee, while my kids and wife spends time in play area next to book store.
During one of our visit, I saw all the items in store were put on sale and store was closing. Some of the books were being sold for a dollar or less, even their furniture was put up for sale.
I was feeling sad to see store employees lose their jobs, particularly having seen them regularly at store.

#2 Blockbuster 
On my way back home on Redmond Way, there was another store with huge trademark sign in blue color intact, which was already closed (it's occupied now by Overlake Medical Clinic  urgent care). Yes, this one was Blockbuster video rental store. I used to go there too when they were in business.

#3 Impact of online content on newspaper industry
After reaching home, turned on Netflix on my TV and started searching to watch something. Found a documentary called "Page One: Inside The New York Times", looked interesting and I started to watch this documentary. This one was about how newspaper industry is affected badly due to free online content on internet and have to lay off people due to cost cuttings.


Experiencing these three incidents on the same day, led me to write this blog post.

I started thinking about these three businesses...
All these three businesses were disrupted by evolution of technology, mainly internet.

Blockbuster's online competitor Netflix is doing good.
Border's online competitor Amazon is doing excellent.
News content available on internet is enabling people to get news from multiple sources, now people get latest news digitally without any delay.

Some businesses were closed because they could not adjust to the changing trends in technology. At the same time, technology has created new huge businesses and new jobs. Just imagine number of jobs online retailers might have created, starting from the production of packaging material until the product reaches customers. Sometimes businesses get so much busy at what they do, they miss to see next big thing.

I started putting my thoughts around different things these businesses could do or should have done.

This post is primarily focused on print edition of newspaper.
It is possible some of my thoughts may not be feasible to implement, may even cost more than my assumptions or might have already been implemented.
I did not do any research to write this post, putting down what I have in my mind.

  
Static content of new papers:
Once newspaper is printed, content remains same for each and every reader, probably some local news may differ based on the distribution location.

How can we make this content dynamic based on user's preference, location and time of the day?

Here are some thoughts...
  • Kiosk printing
In addition to printing newspapers at centralized location by huge printing machines, we may need to have small fast printers setup as Kiosks at various locations. These printers should be connected to various different news agency networks, which gets updated regularly with latest news.
Kiosks will be able to provide news  from different news agencies like ,"New York Times ","USA Today", "Wall Street Journal", "Washington Post ", others and even from international news agencies, this will help in sharing costs.
Even this whole Kiosk setup can be franchised to some third parties.

With this setup, expensive printing infrastructure each news agency maintains, can be reduced. This helps in reducing printing costs and also minimizes transportation costs.

  • Newspapers content can be changed during the time of printing.
Kiosk printing makes print edition of newspapers dynamic.
News can be changed during the time of printing.
If there is any breaking news, people don't need to wait until next day to read about it in print edition. They can go to any kiosk, select the newspaper they are interested in, printer prints latest news which includes breaking news.
It helps people who prefers print edition over digital content.

  • Cost of print edition can be changed depending on user's preferences
If reader is interested in sports news, he can buy only sports related news, he will be charged only for the piece of content he is buying. Cost of newspaper is not fixed anymore, it changes based on what you are interested to read. If you are interested to read news about president's speech from "New York Times", just pay for president's speech related news and get it printed.

  • Impact on advertisement.
Advertisement will change completely with this model.
System knows user preferences, location, exact time of the purchase and additional details from user can be requested as optional. 
Targeted advertisement works perfectly with such useful information
Advertisement that gets printed will change based on user information. 
This makes advertisement on print edition more dynamic and targeted. 
Advertisers can even buy slot for only during specific times of the day and location. 
News agencies can even set price to advertise based on number of prints (i.e. cost per 1000 prints, 10K prints etc.).
It means, more number of advertisers and cost of advertisement can be reduced.
  
  • Size of newspaper, kiosk & printer
With the kiosk model, size of newspaper don't need to remain same as we see today. It can be changed as it fits new model. The crux of the solution are faster printers, these printers could be larger than what we see in our offices. Size of kiosk may end up being larger than photo printing kiosks.

 
All these above thoughts will not work for the newspapers that gets delivered at our homes. If the kiosk model works, there are various ways to reduce cost on those ones too. Those thoughts later :)

As a side note:

Amazon's CEO, Jeff Bezos, who is talking about using Drones to ship Amazon products, also owns The Washington Post newspaper.
Drones may work perfectly to take newspapers to customers quickly, irrespective of weather conditions. Drones I'm envisioning  are the ones that carry whole bunch of newspapers and drops one at a time.  There may not be many complaints if there are any accidental drops of newspapers on people, it won't hurt much. Drones buzzing sound will wake up people early in the morning and signals their newspaper is at their doorstep, might be good for some :). 
I think experimenting with Drones to ship products is a great idea, package shipping industry may need to get ready to those changes.

If you like this post, please do not forget to share it on social networks by clicking buttons below.

Koya (Mailbox Analytics)






Koya is a windows store app and it's mission is to keep emails under control.
Currently Koya supports gmail accounts only.
Koya helps users to identify the email accounts that are filling their mailboxes.
It counts number of emails sent by each sender to your account and displays counts in descending order under each folder.
It let's users to delete emails or move emails to different folder.

Explain how Koya works?
When Koya is launched (started) by user, it connects to gmail server and gets senders emaild & uid for each email, simple aggregation is done and results are cached in memory along with the list of uids.  These uid's are used to perform delete and move operations in each folder.


How often Koya deletes and moves emails?
Koya tries to run delete and move operation as soon as it can.
In the current version, Koya checks every 5 seconds (while App is running) to see if there are deletes or moves to perform and kicks-off clean-up operation if required.
Duration of clean-up depends on number of emails to clean-up.

Clean-up refers to either delete or move or both.

How often Koya's data is refreshed?
Koya is not a real time system.
Koya's main mission is to do clean-up on mail server as soon  as it can while app is running.
Koya's data is refreshed EVERY  TWO HOURS, if user resumes app from suspended state(i.e. Windows did not take App out since last run).

It means, Folders List and email counts you see on Koya are not real time and gets updated every two hours ,if App is not closed either manually or by operating system and user resumed App after two hours.

This is done to minimize resource utilization on user's device.
Again, Koya's purpose is to do clean-up on mail server soon rather than provide real time counts or update folders list.

Note: You can always close and restart Koya if you wish to see updated data earlier than scheduled.
Koya always gets data from mail server when it's started or restarted.

How do I know if the new folder is created on mail server ?
1) Fastest way to verify is to login to www.gmail.com and check for the folder.
Here is how  it looks like in the web interface , sometimes you may need to pull down circled arrow.


2)You can see your new folder in Koya's folders list after hour (you can revisit Koya after an hour).


How do I know clean-up of emails is done ?
There are couple of ways to check this..
1) You can go to main page (home page) and click "Deleted/Moved Count" button, it will show the amount of clean-up done. It may take a minute or so to do clean-up and update Action Analysis dashboard.
2)Go to www.gmail.com , login to your account to verify deleted or moved emails.
Deleted emails are moved to Trash.


Koya V2 Released on 12/10/2013

Thanks to users of Windows Store App “Koya (Mailbox Analytics)” for your valuable feedback.
I got many emails with feature requests and feedback on existing features.
Thanks for all your support.

New version of Koya incorporating users feedback is published in Windows Store....
Download latest version of Koya from Windows Store and try it out.

Here are new features in the latest release…
1) Save and Cancel buttons in email counts grid.
2) Koya reads recent 3000 emails when folder has more than 3000 messages, with option to count all emails on users’ request.
3) Notifications within app when clean-up of emails happens.
4) Performance improvements.

Keep your emails under control using Koya.



--Arif Khan




Sunday, December 22, 2013

Random vs Sequential I/O and database systems


Main aim of this post is to set foundation to explain how random I/O's influence database systems.

Before going into the details, we need to understand data storage on physical layer (hard disk). It's important to know storage at physical layer to get better understanding on data access techniques used by RDBMS systems.

Not going into too many details of hard disk, some of the important components of hard disk for this post are: platter, read/write head, tracks and sectors.

Check these links on hard disk working:

Platter (below image) consists of set of tracks (red and black color rings), track is divided into sectors (yellow sections in red track).



Sector is the lowest physical storage unit, typical size of sector is 512 bytes, and data is stored in sectors (there are disks with larger than 512 bytes per sector).

To save a file of size 9000 bytes, two sectors are allocated on the physical layer.

Sequential Access
Let's walk thru a sample (sector size is 512 bytes in our sample)...
1) File is created with initial size of 1024 bytes, two continuous sectors are allocated on outer most track (say track1).
2) Append 1024 bytes to the file, it need two more sectors, these two sectors are allocated adjacent to previous two blocks
3) Append 1024 bytes to the file, it need two more sectors, these two sectors are allocated adjacent to previous four blocks.

Data blocks for file are allocated sequentially on track1.

To read this data, certain operations has to be performed...
 i)   Move read/write head to track1 (seek time)
 ii)  Wait for the sectors to come under read/write head (rotational latency)
 iii) Transfer data to OS (data transfer rate).

Typical seek times are anywhere between 9 to15ms, let's say 10ms in our case.
Rotational latency depends on rpm rate of disk, it typically goes from 2 to 5ms (http://en.wikipedia.org/wiki/Rotational_latency#Rotational_latency ), assume 2ms in our case.
Data transfer rate has increased over the period of time, typical transfer rate goes from 50MB/sec to 170MB/sec.
In our case, assume its 50MB/sec (50 * 1024 KB/sec).

Total cost to read file: 10ms (seek time) + 2ms rotational latency + 0.05ms (time to transfer 3KB of data) = 12.05ms.

Seek time is to move read/write head to track1 (if read/write head is not on track1), then onwards data is read sequentially.

Among all the three costs, seek time is the expensive one.


Random Access
Let's walkthrough the same sample when data is saved randomly

1) File is created with initial size of 1024 bytes, it need two sectors, and these two sectors are allocated on two different tracks (track1 & track100).
2) Append 1024 bytes to the file, it need two more sectors, these two sectors are allocated on two different tracks (track2 & track200).
3) Append 1024 bytes to the file, these two sectors are allocated on two different tracks (track3 & track300).

Data blocks for file are allocated randomly at different locations.


Read/Write head movement sequence: track1->track100->track2->track200->track3->track300

Sequence of physical operations and cost for each operation:
A) Cost for track1: 10ms (seek time) + 2ms rotational latency + 0.001ms (time to transfer 512 bytes) =12.001ms

B) Read/Write head is on track1, it should be moved to track100 to read second block, which means seek time is involved.
Cost for track100: 10ms (seek time) +2ms rotational latency+0.001ms (time to transfer 512 bytes) =12.001ms

C) Read/Write head is on track100, it should be moved to track2 to read third block.
Cost for track2: 10ms (seek time) +2ms rotational latency+0.001ms (time to transfer 512 bytes) =12.001ms

D) Read/Write head is on track2, it should be moved to track200 to read fourth block.
Cost for track200: 10ms (seek time) +2ms rotational latency+0.001ms (time to transfer 512 bytes) =12.001ms

E) Read/Write head is on track200, it should be moved to track3 to read fifth block.
Cost for track3: 10ms (seek time) +2ms rotational latency+0.001ms (time to transfer 512 bytes) =12.001ms

F) Read/Write head is on track3, it should be moved to track300 to read sixth block.
Cost for track300: 10ms (seek time) +2ms rotational latency+0.001ms (time to transfer 512 bytes) =12.001ms

Total cost to read file = 12.001+12.001+12.001+12.001+12.001+12.001--->72.006ms

Access time has increased by six times when data is randomly distributed.

This is the cost you pay when data is fragmented.

There has been lot of improvements in data transfer rate over the years, there is only small percentage of improvement in seek times due to mechanical movement of read/write head.
Improvements in data transfer rates helps systems like map-reduce which scans data sequentially.


My experience in database systems has been mostly with Microsoft SQL Server, when I refer to any specific aspects of database systems in my posts, it'll be with reference to SQL Server.

Fragmentation is the silent killer of query performance in database systems. If the execution plan looks perfect as far as physical access to data goes, and still query takes longer to execute, most likely your data is fragmented either in index or heap.

Fragmentation leads to random I/O's and affects the performance of query.

Continue to my next post by following link http://technicalpostsbyarifkhan.blogspot.com/2013/12/influence-of-random-ios-on-key-look-ups.html to know more on how random I/O's influence database systems.


If you like this post, please do not forget to share it on social networks by clicking buttons below.

Random I/O and key look-ups


If you haven't read my previous post on sequential vs random access, please go to this link

http://technicalpostsbyarifkhan.blogspot.com/2013/12/random-vs-sequential-io-and-effects-on.html  before proceeding further on this post.

This post is primarily focused on key look-ups when non clustered index is used to access data using Microsoft SQL Server as a database system.

Here is the key difference between clustered and non-clustered indexes.
Clustered Index: Leaf level pages of a clustered index are data pages (all columns of a row and possibly multiple rows in a page).

Non Clustered Index: Leaf level pages of a non-clustered index has non-clustered index key and a reference to its row in a data page. This reference is a clustered index key if there is a clustered index, otherwise reference to physical location of a row.

I can spend whole lot of time describing clustered and non-clustered indexes, that's not the purpose of this post.

Data consistency 
Clustered and non-clustered indexes are two separate physical structures, maintaining consistency among these two separate structures is the key property of RDBMS systems, which is 'C for consistency' in ACID properties.
When data is written to a table, it's the responsibility of RDBMS system to update all indexes to maintain consistency, if update to any index fails, write operation fails and data is rolled back to previous state.


Now let's go to the core issue of this post i.e. key lookups used in SQL Server.

Here is the sample script to simulate key lookups...

CREATE TABLE DBO.TestNCIUsage (ci INT NOT NULL
                                        , c2 int not null
                                        , c3 int not null
                                        , c4 int not null
                                        , c5 varchar(100)
                                        , nci int not null
   , unqid uniqueidentifier default (newid()))                                  
                                                                     
--insert sample data
DEClARE  @i int =1
                    ,@nciValue int = 2000000
                   
WHILE (@i<2000000)
BEGIN
                    INSERT INTO DBO.TestNCIUsage(ci
                    ,c2
                    ,c3
                    ,c4
                    ,c5
                    ,nci)

 SELECT CAST(RAND()*10000000 AS INT) AS ci --random number
                    ,2000 AS c2
                    ,3000 AS c3
                    ,4000 AS c4
                    ,'SQL SERVER 2012 TESTING' AS c5
                    ,@nciValue
                   
                    SET @nciValue -=1;
                    SET @i+=1;
END

--cluster index is on randomly generated value
CREATE CLUSTERED INDEX CIX1 ON DBO.TestNCIUsage(ci)

CREATE NONCLUSTERED INDEX NCI_TestNCIUsage__nci ON TestNCIUsage(nci)

Cluster index on table is on ci and non-clustered index is on nci column.
Table is populated with 2M rows, value for ci is randomly generated and nci is a decreasing value staring from 2M.


1) Key lookups
select nci,c2 from DBO.TestNCIUsage
where nci between 1 and 4888

In the above select statement, columns nci & c2 are selected, nci has non-clustered index, but c2 is NOT in index.

Here is the execution plan for the above query..




If we analyze this plan, two index structures are used to fetch data from disk and nested loops are used to join data.
Data access starts with non-clustered index, gets nci column values between 1 and 4888, each nci value in non-clustered index also has it's corresponding clustered index key (ci) at the leaf level, this clustered index key (ci) is used to traverse thru clustered index's B+ tree and fetches c2 from data pages, this is what is called key lookup.

Here is where previous post on sequential vs random access helps...
Each lookup into clustered index to get c2 may potentially end up with a random read.
This random read is not because of fragmentation, it's because each key look-up has to traverse thru clustered index B+ tree and fetch data from pages that are not logically ordered. Key lookups in the sample select statement may cause 4888 random reads, random reads are the most expensive I/O operations and the cost of each random read is calculated in previous post.

Now let's change the select statement

2) Clustered index scan
select nci,c2 from DBO.TestNCIUsage
where nci between 1 and 4889

Here is the execution plan when upper bound in the where clause is changed to 4889.





Apparently, SQL Server optimizer has estimated too many random reads when value is changed to 4889 in the where clause, and decided to go for clustered index scan (which is almost similar to table scan).
It turns out, scanning whole 2M rows table sequentially is cheaper than causing 4889 random reads.

Key lookups causes random reads and random reads are the most expensive operations.

Key lookups are fine if you are reading small set of rows, it becomes problematic when you are fetching more data. General threshold seems to be around 5000 rows, more than 5000 tends to make SQL Server to go for clustered index scan, even if the table size is way larger than sample shown in this post. These numbers are approximate values, you need to test with your data.

Usage of non-clustered indexes with key look-ups works fine in transactional systems, where reads are smaller in size. It's an issue with data warehouses where large reads are performed, non-clustered index will be inefficient if the index is not covering all the columns used in the query while you are reading large number of rows. 

There are lot of other factors that determines cost of I/O operations, like size of row, number of rows in a table & number of pages. We are not going into that discussion in this post.

There are couple of ways to make non clustered index cover all the selected columns
i) Include all columns as part of non-clustered index key, which is inefficient and may not work in case of unique indexes.
ii) Use SQL Server's INCLUDE clause to make columns as part of leaf level page instead of index key.

Key points from my two posts..
(1) Minimize fragmentation of data.
(2) Minimize operations that cause random reads.
(3) Non-clustered index is inefficient for large reads, if the index is not covering, this will be worse, you are paying cost to update index in write operations, but not using it in reads.

I hope these two posts were helpful.


If you like this post, please do not forget to share it on social networks by clicking buttons below.