Problem Statement:
Consider we have a file which has
millions of unique records in it. We need to process these records in random
order following some distribution process i.e. randomly select a unique record
from the file every time, process it using any one of the available processing
method and ensure that the final output records are spread as per the required
distribution. How to achieve it in Single pass of the input file?
Possible Solutions:
a a. Create an index which can be used for picking a
random record, and then corresponding entire record can be fetched from the
input file and processed. But this is somewhat similar to double pass.
b b. Another way is to add some extra field to all
the records such as a running counter. The counter values i.e. 0 to number of
records available in file can be stored in a list. This list can be shuffled to
get a random order of values in it. Now follow the random order in the shuffled
list and fetch the corresponding records from the input file for processing.
In above methods, processing speed will be slow if we do
sequential searching to fetch the selected random record from the input file
for processing every time. This can be avoided by using fixed record
lengths/structure which can be used in calculating the exact record data
location in the file. But this results in wastage of memory which can be gained
by usage of variable record length. So how to achieve unique random record
selection from a flat file with good speed and that too in a single pass?
If you know the solution, please post your reply in the comment. Stay tuned to know the solution in the next post.