Wednesday, 18 July 2012

Optimized single pass -random record selection from a flat file


Problem Statement: 
Consider we have a file which has millions of unique records in it. We need to process these records in random order following some distribution process i.e. randomly select a unique record from the file every time, process it using any one of the available processing method and ensure that the final output records are spread as per the required distribution. How to achieve it in Single pass of the input file?

Possible Solutions:
a    a.    Create an index which can be used for picking a random record, and then corresponding entire record can be fetched from the input file and processed. But this is somewhat similar to double pass.

b    b. Another way is to add some extra field to all the records such as a running counter. The counter values i.e. 0 to number of records available in file can be stored in a list. This list can be shuffled to get a random order of values in it. Now follow the random order in the shuffled list and fetch the corresponding records from the input file for processing.

In above methods, processing speed will be slow if we do sequential searching to fetch the selected random record from the input file for processing every time. This can be avoided by using fixed record lengths/structure which can be used in calculating the exact record data location in the file. But this results in wastage of memory which can be gained by usage of variable record length. So how to achieve unique random record selection from a flat file with good speed and that too in a single pass?

If you know the solution, please post your reply in the comment. Stay tuned to know the solution in the next post.

Monday, 9 July 2012

Insert Vs Update Query


Assume that I maintain a huge denormalized table in the database which is used to maintain the details of user’s login session. Right now I am not interested in maintaining the user’s history, though it can be made as a future enhancement. It is expected that the system will be accessed by 1000+ users everyday more than once. Is it faster to have the user’s record updated each time when he/she logs into my system or is it better to insert new rows into my system for every session of his/her?

I believe we need to look this from two perspectives, the speed of the present operation and second the impact it might cause on the other future operations. Generally, the insert operation involves three or four stages at max, i.e. create a record, check for referential integrity or primary key constraints/insert trigger checks, third store the record in the database and fourth make an entry in respective indices. Moreover, we are not deleting the old records as database size is not a constraint, otherwise we need to think about ‘insert……..on duplicate key update’ scenario too. There is huge possibility in near future that our database size can grow out of control, and index size will not fit in our buffer pool. In that case we need to resort to distributed data storage or otherwise called data sharding which involves partitioning and replication. But smart sharding is tough to achieve and best solution available till now depends on the usage statistics to identify the hotspots.

Whereas the update involves search operation as a subset work. First you need to search for the individual record, fetch it, update it, check for the update trigger/referential integrity constraints, and then store it in the database and finally update the indices too if they are affected. There are lots of faster index structures like B+ tree which make the search operation less time consuming, but still it is an overhead.

Thus, I feel we can come to the conclusion that insert a new record would be better than update for logging in one record at a time when database size is not a constraint. But when we examine it from the other perspective, i.e. when a select/fetch operation has to be done, then fetching one matching record would be faster than many. This would play a huge role in web related systems where we display user information. So, it can be handled by attaching new field to the record which will keep track of the record creation time-stamp which can be used as a select criterion to fetch the latest/active record.

But does this hold good in case of bulk updates i.e. update without condition? This time update scores over insert, as it is mere operation of changing selected set of fields over all the records in the table. It is just like one multiple row insert when compared to multiple single row insert of related data.

There can be only one winner in a competition and he has to be an all-rounder or the guy who meets most of our needs. Thus in today’s world, memory has become insignificant whereas speed isn’t, so Insert has emerged victorious!

Saturday, 14 April 2012

Concurrency & Parallel processing

How different is concurrent from parallel in computer science concepts? I got similar doubt on difference between function and method  which I was able to understand from my critiques comments. This doubt arose when Prof Tony Giavargis explained in his Software for embedded systems class that ‘Concurrent is an illusion of being parallel’. How far is this true, how far are they different?

Can concurrency be achieved using a single processor(share resources) whereas parallel processing requires at least two processors(independent resources)? Does this mean multithreading/multitasking is an extended concept of concurrency while distributed processing is derived from parallel processing? Is concurrency subset of Parallel processing? Is there any better way to distinguish them? What are the objectives met by adopting them - is it high availability & speed? Both involve communication cost for synchronization and aggregation of results, former acquires less communication overhead than later, but doesn't guarantee the same level of speed as later. Does this play a crucial role in deciding which one should be used?

Please share your opinion in the comments section below, if possible with example.

Reference: http://www.linux-mag.com/id/7411/

Sunday, 1 April 2012

Reverse string with reduced time complexity?


I was asked to write a program for reversing a string in java using brute force method. I wrote the very general answer which uses string length and charAt() in a for loop. This method has O(n) time complexity, so I was asked to reduce its time complexity. We can use two counters, one running from first position of the string and second from last position of the string running till the middle position of the string, construct two reversed substrings which can then be concatenated to produce complete reversed string. But this will give only O(n/2) which is same as O(n). Are there any another approach to reduce the time complexity for the above program?

Please describe your approach in the below comments section.

Friday, 9 March 2012

Modern age Horoscope: Data Mining


Everybody is curious to know to what is in store for them in future and surprised to see some strangers know our past. But it is not always so, sometimes it makes us nervous and feel insecure too. Some people consider this age old fortune telling and horoscope to be a crap. But science has made significant progress in this field too and this has helped many industries especially the marketing segment. Today with such advanced statistic and number crunching formulas, Petabyte databases, sophisticated information retrieval techniques, highly co-ordinated distributed systems, one can easily deduce what your past was and what your future most likely would be. This is very important for vendors to concentrate their ad campaign on potential buyers like a directional antenna, instead of being an omnidirectional. You can see this being implemented in modern search engine results, ads flashed in web pages you visit, ad mail in your mail box, suggestion provided by eCommerce sites, etc. I hope soon matrimonial and dating sites would adopt this and help saving many marriages by bringing close most compatible souls. All are customized for you in lieu of making you to feel a special person based on the data you shared about your past.

It was rightly termed as Modern Age – Age of information. If we have or gather sufficient raw data, we can convert them into potential information using above mentioned scientific tools and expand our presence in the market. To get this raw data and that too voluntarily from you, so that you don’t sue them for privacy infringement, the vendors try to bribe you with special discounts. This often takes place in case of retail stores; where they give you a special card using which you can get certain discount on special items they sell. You fall for it, use it every time you purchase items to get that minimal discount and this sucks or entices you into vicious cycle of purchase and re-purchase. In this way the vendor can collect data about your past purchase trend and can market items you might need in future which he identified using the data mining tools. Thus they save money in unnecessary ad campaign and also retain their customer base. Retaining customer base is a crucial factor as it provides the vendor ‘word of mouth’ marketing free of cost. Whatever be the advancement done by science in the field of media, gossiping and reference have their place sealed forever.

It is good to feel special, but not always. I came across this funny incident at yahoo about how Target retailer was able to deduce that one of their customers, a high school teen was pregnant before her dad knew about it. They had sent some special discount mails to her about items needed for next stage in pregnancy after analyzing her past purchases. This had enraged her father who called them and demanded an apology. Later on when the truth came to light, he had called them back and apologized. The social networks too have business operations on their user information, since nothing comes for free in this world of capitalism. It would be great if vendors do not exploit their customer information, privacy in the name of marketing and governments direct them in their usage. Responsibility lies on customer/user too, and they should be careful about what information they share and how they share it. So let’s acknowledge this age’s horoscope power – range of its reach and be wary of it.

Wednesday, 8 February 2012

Farhannitrate & Prerajulisation || Bisentilator& Devenkatrination revealed


Almost all of us would have watched the movie ‘3 idiots’or its tamil version – ‘Nanban’. We would have certainly enjoyed the teaching comedy scene of Ranchod(Farhannitrate & Prerajulisation) or Pari (Bisentilator& Devenkatrination). It was funny to see all students including the dean fumbling with the book to decipher those terms. If there were any software engineers in the class they could have easily found the answer for the toughie posed by the protagonist of the film well within 30 seconds.

A small application of database concept would have done the trick. Are you able to recollect what it might be? If you had deduced the answer then you must be a software engineer. Yes, it is the index page of book. If anyone had skimmed thru the index page, they could have easily realized that those were junk terms. This shows the application power of software engineers who go for optimized results which is often called as smart work!

Tuesday, 10 January 2012

Wacky summary of Gale Shapley Algorithm


Prerequisite: You need to know what this algorithm is about; else it will appear to you another aimless paper which you tried to understand during the eleventh hour of your course.

Attentively listening to a lecture on this algorithm, one can realize how true this algorithm is. Following are the facts which could be observed:

1.       It is male dominated society at outside, so only man proposes.
2.       It is female dominated society at inside, so only female has the power to reject an offer. She can keep the relationship alive or break it.
3.       Male are always greedy, they run behind the best. Hence many guys will propose to the same best lady and try their luck. It happens mostly to all Indian cinema Heroines.  Female are always pragmatic and are very cautious in choosing their partner. As a result man gets his best (strictly literal) valid partner and a woman settles with her worst (not strictly literal) valid partner.
4.       Due to unknown reasons if invalid partners enter into a relationship it is always weak, unstable and will dissolve as soon as one (mostly female) gets a better offer.This is often the story line of many Indian TV soaps.

Thus this algorithm has demystified the secret of stable relationship by strongly recommending to choose a partner who rates you equally as you do him/her. So next time avoid being filmy, apply GS algorithm before proposing to increase your success rate.
 
“Will you Marry me………….?”

Reference: Chapter 1, Kleinberg & Tardoss Book on Algorithm
Pic Source: Internet