Wednesday, 18 July 2012

Optimized single pass -random record selection from a flat file


Problem Statement: 
Consider we have a file which has millions of unique records in it. We need to process these records in random order following some distribution process i.e. randomly select a unique record from the file every time, process it using any one of the available processing method and ensure that the final output records are spread as per the required distribution. How to achieve it in Single pass of the input file?

Possible Solutions:
a    a.    Create an index which can be used for picking a random record, and then corresponding entire record can be fetched from the input file and processed. But this is somewhat similar to double pass.

b    b. Another way is to add some extra field to all the records such as a running counter. The counter values i.e. 0 to number of records available in file can be stored in a list. This list can be shuffled to get a random order of values in it. Now follow the random order in the shuffled list and fetch the corresponding records from the input file for processing.

In above methods, processing speed will be slow if we do sequential searching to fetch the selected random record from the input file for processing every time. This can be avoided by using fixed record lengths/structure which can be used in calculating the exact record data location in the file. But this results in wastage of memory which can be gained by usage of variable record length. So how to achieve unique random record selection from a flat file with good speed and that too in a single pass?

If you know the solution, please post your reply in the comment. Stay tuned to know the solution in the next post.

Monday, 9 July 2012

Insert Vs Update Query


Assume that I maintain a huge denormalized table in the database which is used to maintain the details of user’s login session. Right now I am not interested in maintaining the user’s history, though it can be made as a future enhancement. It is expected that the system will be accessed by 1000+ users everyday more than once. Is it faster to have the user’s record updated each time when he/she logs into my system or is it better to insert new rows into my system for every session of his/her?

I believe we need to look this from two perspectives, the speed of the present operation and second the impact it might cause on the other future operations. Generally, the insert operation involves three or four stages at max, i.e. create a record, check for referential integrity or primary key constraints/insert trigger checks, third store the record in the database and fourth make an entry in respective indices. Moreover, we are not deleting the old records as database size is not a constraint, otherwise we need to think about ‘insert……..on duplicate key update’ scenario too. There is huge possibility in near future that our database size can grow out of control, and index size will not fit in our buffer pool. In that case we need to resort to distributed data storage or otherwise called data sharding which involves partitioning and replication. But smart sharding is tough to achieve and best solution available till now depends on the usage statistics to identify the hotspots.

Whereas the update involves search operation as a subset work. First you need to search for the individual record, fetch it, update it, check for the update trigger/referential integrity constraints, and then store it in the database and finally update the indices too if they are affected. There are lots of faster index structures like B+ tree which make the search operation less time consuming, but still it is an overhead.

Thus, I feel we can come to the conclusion that insert a new record would be better than update for logging in one record at a time when database size is not a constraint. But when we examine it from the other perspective, i.e. when a select/fetch operation has to be done, then fetching one matching record would be faster than many. This would play a huge role in web related systems where we display user information. So, it can be handled by attaching new field to the record which will keep track of the record creation time-stamp which can be used as a select criterion to fetch the latest/active record.

But does this hold good in case of bulk updates i.e. update without condition? This time update scores over insert, as it is mere operation of changing selected set of fields over all the records in the table. It is just like one multiple row insert when compared to multiple single row insert of related data.

There can be only one winner in a competition and he has to be an all-rounder or the guy who meets most of our needs. Thus in today’s world, memory has become insignificant whereas speed isn’t, so Insert has emerged victorious!