This is the fifth challenge in the series of programming job interview challenge. As usual, we’ll provide an answer to the previous challenge and give you a new challenge to keep you busy this week. Other then commenting the solution, I remind you that you can post the solution on your blog and let us know about it using our contact form. we’ll link back to your solution on our next challenge. I would also like to ask you to comment on the post where the question was asked: for example, comments with answers or questions regarding the solution to challenge #4 should be on posted to challenge #4 only, to prevent confusions.

First lets see the solution. almost all of you got it right, that was easy. The issue here was to provide the most simple solution so there are two winners:

    return 24-x;
    return x^22;

More nice solutions but with one more operation:

    return (x+10)%20;
    return -1*x+24;

Note for those who used 119/x – you must also check for division by zero!

Thanks for Ngu Soon Hui , Jendrik Illner, Keith Rousseau, Honest Illusion  and Jon von Gillern (Loved that highlight text to reveal the answer you did!) who provided answers in their blogs!

The first ten to comment with a correct answer were: Kirill Sorokin, Jacek, Menno, Chris Jobson, Poul Foged Nielsen , Anton Irinev  , Eric  , pavan kumar , drea and Haha

You’ll be surprised by the amount of original and interesting solutions, go ahead and read the comments There are more than 200!!!.

Well, last week was an easy one. This week is a whole new story as the riddle is a bit more challenging challenge…

Today’s challenge:

You are asked to sort a collection of records. The records are stored in a file on the disk, and the size of the file is much bigger than available memory you can work with. Assume a record size is 10 bytes and file size is 1GB. you read and write data to/from disk in units of pages where each page is 1KB big. You have few (less than 10) available pages of physical memory. Assume you are given a method to compare two records.

How can you sort the records under the described conditions?

* Hint – Using the virtual memory is not the correct answer.

Note: no source code is required, you can provide your algorithm in words or pseudo code. If something is unclear dont hesitate to comment and we will clarify.

The solution is published now and can be found at the 6th challenge

Tags :

26 Responses to “A Programming Job Interview Challenge #5 – Records Sorting”

  1. Kirill Sorokin

    Said on May 26, 2008 :

    I believe D. Knuth had described a couple of algorithms for this in his Art of Programming.. Some of them even took into consideration the constraints of not being able to randomly access data (i.e. we have a tape instead of a HDD). Anyway, I do not remember them. :)

    My take is the following (I assume you can create and access other files on disk except for the original one):

    You will need two temporary files: {intermediate} and {sorted}. I’ll refer to the input file as {input}. Both {sorted} and {intermediate} are considered to be empty at the beginning (i.e. not containing any records).

    While there are still records in {input}:
    – read a page of records. Sort it with whatever algorithm you like. So in result you have a sorted page of records in memory.
    – move {sorted} to {intermediate}
    – read from {intermediate} record by record (this can definitely be done even if you can only read page by page) and merge with the sorted page, outputting to {sorted}. The merge is also pretty standard: you start at the beginning of both pages and proceed record by record outputting the one which is “smaller” among both page/array’s “heads”.

    In the end you will have a sorted array in {sorted}, so move {sorted} to {input}.

    This satisfies the memory constraints, as you have 1 page read from {input}, 1 page read from {intermediate} and finally 1 page being prepared to write to {sorted}.

  2. Poul Foged Nielsen

    Said on May 26, 2008 :

    Assuming that you have time and diskspace enough, use bubblesort;

    Iterate over all records reading two at a time
    If (record 1 > record 2)
    Swap record 1 & record 2
    Register that you made a change
    Write the two records to disk to a new file
    If any changes were made in latest run, start over on the newly created file.

    Return the last created file, delete all other temporary files.

  3. Jonathan Gilbert

    Said on May 26, 2008 :

    Many algorithms can work with constrained space, as they need only compare and swap 2 records at a time (bubble sort, shell sort, selection sort, insertion sort, quick sort), but the algorithm of choice is probably going to be the merge sort because of its good interaction with system caches (a result of linear reads and writes).

    If 10 pages are available, then the sort can be “primed” by sorting 10 pages in memory at a time, so that the “initial” file consists of pre-sorted blocks of 10. This, of course, requires adjusting the algorithm so that it works bottom-up, putting the data back together from blocks of an expected size, instead of top-down, dividing the data in half and putting those arbitrarily-sized halves back together. The bottom-up algorithm becomes most efficient when the size of the blocks is a power of two, so depending on the speed of the storage, it could be faster to pre-sort blocks of 8 instead of blocks of 10. This advantage does not apply if the routine is primarily I/O bound, though, as it is stated to be in this question.

  4. Jonas Christensen

    Said on May 26, 2008 :

    A simple but disk heavy solution could be a bubble sort like algorithm.

    step 1. Load as many pages as you can, sort them and write them to disk and load the next set of pages and sort. Do that for the hole file.

    step 2. Now swap the highest pages with the lowest. Ex: assume you could load 8 pages. You sorted page 1,2,3,4,5,6,7,8 and the next set 9,10,11,12,13,14,15,16. Swap pages 5,6,7,8 with 9,10,11,12. (13,14,15,16 with 17,18,19,20 ect)Do that for the hole file.

    Do step 1 and step2 again. Lets assume that we have 1 Million pages and we could load 8. Then we need todo this 124999 times and the list would be sorted.

  5. Heiko Hatzfeld

    Said on May 26, 2008 :

    I think this would call for a merge sort.

    You could futher optimize it by reading 1 K records at a time, sorting them in place, and then writing them to alternating tmp files (Only during the initialisation step

    So pseudo code:

    Read 1 K records while not EOF
    Sort in place using Quicksort
    Dump to alternating Tmp Files

    repeat until only 1 final file exists…:
    Open the 2 current Tmp Files
    Mergesort tmp Files to 2 new alternating tmp files (switching target file after pass * 2 * 1000 records)
    Delete old tmp files

    This aproach should minimize Disk IO, and the Data should be ull sorted after reading it for 21 Times. If we “skip” the quicksort step, then we would be forced to read the data 31 times, which would mean about 32% more disk IO (Which is most likely our bottleneck)

  6. Kris Warkentin

    Said on May 26, 2008 :

    Really enjoying the problems so far but I’m a little disappointed with this one. I prefer ‘thinking’ problems to ‘research’ ones. Sorting is a well studied problem that is, by and large, considered to be “solved”. Pretty much any introductory C.S. book has several well-known and efficient in-place sorting algorithms such as heapsort or mergesort along with sample implementations. That being said, keep up the good work.

  7. Tim C

    Said on May 26, 2008 :

    Sounds like the classic merge sort case. Sort what you can in memory (using whatever sort you want) then write the results out to disk. Say you end up with 16 files that are each sorted. Take those two at a time and merge sort them. Now you have 8 files are are each sorted. Etc. Go until you have two files that are each sorted. Merge those together into your final sorted file.

  8. Chris Jobson

    Said on May 26, 2008 :

    Step 1: Read in the first n pages (where n is the maximum number of pages you can sort in the available memory), sort them using your favourite sort algorithm, and write them out to a “first level” temporary file.

    Step 2: Read in the next n pages, sort them and write them out to a “first level” temporary file.

    Step 3: Repeat step 2 until there are no more pages left to read.

    Step 4: Merge the first m temporary files (where m is one less than the number of pages of physical memory available) into a new sorted “second level” temporary file. (Do this by reading in one page from each of the m files, then copying the records in order into the spare page of physical memory, writing out this page as necessary and reading in the next page of each of the m files as necessary.)

    Step 5: Repeat step 4 for the next m “first level” temporary files, until all the “first level” temporary files have been processed.

    Step 6: Keep repeating steps 4 and 5 with the “p’th level” temporary files to create a smaller number of sorted larger “(p+1)’th level” temporary files until eventually you end up with a single sorted file.

  9. Per

    Said on May 26, 2008 :

  10. Jonas Christensen

    Said on May 26, 2008 :

    The solution I posted dont work need a small bug fix(will update it tomorrow).

    Anyway.. How much space are there on the Storage media and is it allowed to make temp files on it? Or will that be considered virtual memory?

  11. Jim

    Said on May 26, 2008 :

    Use Pointers!!!

  12. Shahar A

    Said on May 27, 2008 :

    Hi Jonas,
    You have unlimited disk space and temp files are allowed.

  13. Martin Ankerl

    Said on May 27, 2008 :

    Here is how I would approach this problem. It was just a quick direct writeup of the thoughts that popped up in my mind:

    Easiest solution: use Unix’s sort tool, with -S option to specify max memory usage.
    If this is not allowed, the second easiest solution is to implement a simple Merge sort, and use multiple passes with one temporary file until everything is sorted.
    If speed is an issue, you can have a pre-pass before the merge sort that loads as much data junks into memory as possible, sorts this junk with quicksort, and write each sorted junk back into the temporary file.
    If you want to go even faster, you can go crazy with 10 passes at once and use a sorting network to merge these 10 streams. Here comes the point where you should benchmark before coding, because caching, harddisk properties etc. might have an enormous influence.

    I have written a blog entry about this entry here:

  14. Sh4DoW

    Said on May 27, 2008 :

    1) you can load into memory the maximum number of pages that the memory allows
    2) you can order the records in memory
    3) you can write, this new order block named “run” to the disk
    4) you can do this for all datas.
    5) Now you have a number n of order “run” in the disk
    6) you can load into memory the only first page of “x”(x less than 10) run, and you can read the first record of each page in memory and simply order this.
    7) When the number of record ordered reaches a page, write this page in the disk.
    6-7 6-7 6-7…..
    8) … You continue with the algorithm until you get to disk all the records sorted.

    Sorry for my English. Bye.

  15. Edward Shen

    Said on May 27, 2008 :

    The traditional approach seems to be mergesort. It will work. However, with some profiling, a “skipped list” building would appear to do nicely as well.

  16. Angel

    Said on May 27, 2008 :

    Pick your favorite sorting algorithm.
    Replace the memory reading operations by reading the record from file into a variable and assigments to directly write into disk.

    So it’ll be really slow, as you’re using the disk as memory (several orders of magnitude slower), but memory footprint is really low. In a extreme case you could work storing only two records in memory at a time.

    For better efficiency, you should use a mergesort algorithm, so records can be grabbed in groups for partial sorting.

  17. Mark Bradshaw

    Said on May 28, 2008 :

    I would use a 2 sort-type solution. I would chunk up the data logically into sections that would fill the memory available (say 8 pages of data). On a grand scale I’d do something like a bubble sort on the logical sections.

    Starting with the first 8 pages I’d load them into memory. Once the pages were loaded I’d sort them using an efficient sort algorithm that worked in place, like a heap sort, so that extra memory wouldn’t be used and it’d be fairly quick. Since memory is tight we’re constrained in the sort options we can use.

    When that 8 page section is sorted, I’d write the first half (4 pages) to disk, shift the 2nd half down into the first half’s place in memory, and load another 4 pages at the end. I’d then resort the entire 8 page section.

    I’d continue this process until the end of the file was reached, and then restart at the beginning, basically a bubble sort. On each iteration I’d stop processing an additional 4 pages from the end (1 logical unit), since as I sort the sections through the highest records will move toward the end until the final logical unit is made up of the highest records in the file.

    I decided on this approach due to the relative speed of memory and disk. I wanted to avoid reading/writing to disk as much as possible, and so am favoring pulling as much data into memory at a time as I can. I believe the heap sort is one of the most efficient and quickest sorts when you have to sort in place. Since the entire data set can’t be fit into memory I think you’d have to treat it logically as smaller bits which can be sorted individually. Unfortunately this necessitates using some of the more inefficient sorts, such as a bubble or insert sort.

  18. Heiko Hatzfeld

    Said on May 28, 2008 :

    Since my Pseudocode was kinda ugly, I wrapped up a real implementation.

    I am using C#.
    My “Objects” are some plain old number strings, which contain 10 digits, padded with ‘0’s
    The Data is stored in a plain textfile. This is NOT the most optimized version, since I am creating a gazillion strings in my code, but thats not the point of this prototype.

    How is this Chunk of Code working?

    It will read an input file (c:\temp\mess.txt), and split it into 2 seperate output files. Its doing this by reading chunks of Data that would fit into memory (MaxInitRecords = 100), and sorting those records in its memory.

    MaxIntRecords defines the maximum Number of records that can be held in memory at a given time. Those chunks are written to alternate files.

    So we arrive at a point with 2 tmp files, each containing blocks of MaxInitRecords that are sorted.
    Example for max=3:
    1 3
    3 4
    5 7
    2 3
    4 5
    9 11

    After that we start to merge those 2 files into 2 new temp files. Each file will now contain twice the amount of sorted records.

    We repeat this until we have not written any data into the 2nd tmp file. This means we have sorted all data.

    So… Thats about it… Here is the code:

    P.s.: No error handling, since we live in a perfect word… Okok… I just didnt have time to implement it

    – CreateDataFile will create an input file to sort.

    using System;
    using System.IO;

    namespace Dev102
    internal class Interview5
    /// Maximum Number of Records that could be held in Memory
    /// Simulates the memory constrain
    private const int MaxInitRecords = 100;

    /// Number of test Records
    private const int totalRecords = 100000;
    private const string RandomFile = @”C:\Temp\Mess.txt”;
    const string targetFile = @”C:\Temp\Mess_Cleaned.txt”;
    private const int SizeWithNewline = 12;

    public static void CreateDataFile()
    if (File.Exists(RandomFile))
    using (StreamWriter output = File.CreateText(RandomFile))
    Random rnd = new Random();
    for (int i = 0; i < totalRecords; i++)

    public static void SortRandomData()
    string tmpFileA = Path.GetTempFileName();
    string tmpFileB = Path.GetTempFileName();

    using (StreamReader source = File.OpenText(RandomFile))
    StreamWriter outputA = File.CreateText(tmpFileA);
    StreamWriter outputB = File.CreateText(tmpFileB);
    StreamWriter current = outputA;

    while (! source.EndOfStream)
    int MaximumMemoryLimit = SizeWithNewline*MaxInitRecords;
    char[] initialBlock = new char[MaximumMemoryLimit];
    //Read the First Block of Data to sort.
    //This uses more Memory then allowed (Stores the input twice),
    //but I am using the simple version so I can process the whole chunk at once
    //If you are picky, then imagine I loop over the input, and read one line at a time ;)
    int dataRead;
    dataRead = source.ReadBlock(initialBlock, 0, MaximumMemoryLimit);

    //Handle “short” Input… Trim the Array down so it only holds the elements that are actually read
    if (dataRead 0)
    Console.Out.WriteLine(“Run Nr. = {0}”, Runs);
    //Remove Old tmp Files
    //Open Files for Reading and Writing
    StreamReader readerA = File.OpenText(fileNames[0]);
    StreamReader readerB = File.OpenText(fileNames[1]);
    StreamWriter writerA = File.CreateText(fileNames[2]);
    StreamWriter writerB = File.CreateText(fileNames[3]);
    StreamWriter current = writerA;
    string currentRecordA = String.Empty, currentRecordB = String.Empty,lastRecord = String.Empty;
    while ((!(readerA.EndOfStream && readerB.EndOfStream)))
    //Read new Record
    if (currentRecordA == string.Empty && !readerA.EndOfStream)
    currentRecordA = readerA.ReadLine();
    if (currentRecordB == string.Empty && !readerB.EndOfStream)
    currentRecordB = readerB.ReadLine();
    if (CompareRecords(currentRecordA, currentRecordB, lastRecord))
    lastRecord = currentRecordA;
    currentRecordA = string.Empty;
    else if (CompareRecords(currentRecordB, currentRecordA, lastRecord))
    lastRecord = currentRecordB;
    currentRecordB = string.Empty;
    current = (current == writerA) ? writerB : writerA;
    lastRecord = string.Empty;
    //ONLY One record might be left…. Either A or B, but not both
    if (currentRecordB != String.Empty)
    if (currentRecordA != String.Empty)
    DisposeStreams(writerA, readerA, readerB, writerB);

    File.Move(fileNames[2], targetFile);
    //Remove Source Files
    //Remove Empty File
    if (File.Exists(fileNames[3]))

    /// Disposes the streams.
    /// The writer A.
    /// The reader A.
    /// The reader B.
    /// The writer B.
    private static void DisposeStreams(StreamWriter writerA, StreamReader readerA, StreamReader readerB, StreamWriter writerB)

    /// Record comparer. Checks if Records A is smaller then B, and detects when its time for a “switch”
    /// The current record A.
    /// The current record B.
    /// The last record.
    private static bool CompareRecords(string currentRecordA, string currentRecordB, string lastRecord)

    //Dont write Empty Records
    if (currentRecordA == String.Empty)
    return false;
    //We have to switch files… Both are smaller then last record
    if (currentRecordA.CompareTo(lastRecord) < 0 && currentRecordB.CompareTo(lastRecord) = 0;
    if (currentRecordA.CompareTo(currentRecordB) = 0 || lastRecord == String.Empty;
    return currentRecordB.CompareTo(lastRecord) <0;

    /// Swaps the filenames and get new TMP filenames.
    /// The file names.
    private static void SwapFilenamesAndGetTmpFilenames(string[] fileNames)
    fileNames[0] = fileNames[2];
    fileNames[1] = fileNames[3];
    fileNames[2] = Path.GetTempFileName();
    fileNames[3] = Path.GetTempFileName();

    /// Removes the old TMP files.
    /// The file names.
    private static void RemoveOldTmpFiles(string[] fileNames)
    if (File.Exists(fileNames[0]))
    if (File.Exists(fileNames[1]))

  19. Heiko Hatzfeld

    Said on May 28, 2008 :

    Since the Blog hates source code… Here is an online Version:

  20. Alex

    Said on May 29, 2008 :

    Read entire file once to determine max & min, keep those in memory
    (20 bytes of memory used)

    Programmatically determine a number of buckets which would be required to contain the range of values between max & min, at 70 items per bucket.
    so ((max – min) / ((max – min)/56)).
    This number should assume distinct values, as we will be keeping structs in each file containing the record (10 bytes) + a “count” of that record’s appearences in the file (4 byte integer) + a node for the linked list (4 byte pointer)
    The number “56” comes from 1024 bytes memory / 18 byte structs.

    using those max & min values, read through the file of records again, 100 records at a time.

    For each record in the file
    determine which bucket it goes in
    if a file corresponding to that bucket (e.g “2347456.txt”) exists
    open it for writing
    create file
    open it for writing

    Check the file for a linked list of structs.
    If one isn’t found:
    The file is new.
    Create a new linked list, and add a struct with (rec = current record, count=1) as the head.
    If one IS found:
    Search through the linked list. If the record is already in there, increment the counter. otherwise, add a struct to the end, (rec = current record, count = 1).

    Now, all the records have been sorted into “buckets”, each bucket represented as a temp file.

    Open a new file, “sorted.txt”, for writing.
    Specify a memory range “tempPage”.

    For each bucket file “bucket”
    sort the linked list according to values of “record”.
    for each node in the linked list
    for(int i = 0; i count; i++)
    write (node->record) to tempPage
    if( temp page is full)
    write temp page to sorted.txt
    wipe temp page.

  21. Heiko Hatzfeld

    Said on June 2, 2008 :

    “Heiko Hatzfeld provided a nice solution in his blog including c# implementation, notice you used 4 pages in the MergeSort stage, you could write one output file each time it is full.”

    Yes… But opening/closing the streams all the time would make it more complex then it needs to be. If I can keep 2 streams open, and by doing so simplify my code, then I think it is worth it. If i realy must reduce the footprint even more, then the code could be rewritten, but since I am still within the limits, i see no need to complicate things.

  22. Shahar A

    Said on June 2, 2008 :

    You solution is very good, i just pointed this minor issue – you passed the interview :-)

  23. Aaron Arnone

    Said on July 25, 2010 :

    I would suggest the following:

    1. Read a random 500 records from the file. Sort them using quicksort.

    2. Since the file is 1 GB, no more than 2 pages should be between any two records of the sorted list above. Ten pages existing between any two of these records is pretty much impossible.

    3. Read through the file, adding any records that exist between record 1 and 2 in the above sorted list to memory. Sort that list using quicksort, then write back to the output file on the disk. Repeat until the entire file is processed. If for any reason the 10 page limit is exceeded on a segment (statistically, it won’t happen) go to step 1 to pick a new set of random segments to work with.

    This approach is a LOT faster than bubble sort and does almost all of the sorting in memory.

3 Trackback(s)

  1. May 26, 2008: Honest Illusion : DEV102's Programming Job Interview Challenge #5
  2. May 26, 2008: Arjan`s World » LINKBLOG for May 26, 2008
  3. May 27, 2008: Martin Ankerl » Blog Archive » Job Interview Question: Sorting Records

Post a Comment