Wednesday, 18 July 2012

Optimized single pass -random record selection from a flat file


Problem Statement: 
Consider we have a file which has millions of unique records in it. We need to process these records in random order following some distribution process i.e. randomly select a unique record from the file every time, process it using any one of the available processing method and ensure that the final output records are spread as per the required distribution. How to achieve it in Single pass of the input file?

Possible Solutions:
a    a.    Create an index which can be used for picking a random record, and then corresponding entire record can be fetched from the input file and processed. But this is somewhat similar to double pass.

b    b. Another way is to add some extra field to all the records such as a running counter. The counter values i.e. 0 to number of records available in file can be stored in a list. This list can be shuffled to get a random order of values in it. Now follow the random order in the shuffled list and fetch the corresponding records from the input file for processing.

In above methods, processing speed will be slow if we do sequential searching to fetch the selected random record from the input file for processing every time. This can be avoided by using fixed record lengths/structure which can be used in calculating the exact record data location in the file. But this results in wastage of memory which can be gained by usage of variable record length. So how to achieve unique random record selection from a flat file with good speed and that too in a single pass?

If you know the solution, please post your reply in the comment. Stay tuned to know the solution in the next post.

No comments:

Post a Comment