[PATCH v10] make maps in parallel

Changed default number of threads to be 1. If you specify --max-jobs without a value, you get one thread per core. --max-jobs=N means use N threads. With regard to comparing the output with known good maps to see if the parallel processing is corrupting anything, one problem is that the files contain timestamps. I have test code that zeros the time stamps and have been able to compare the output from different runs. What I have seen is that sometimes there are differences that appear to be due to the order in which the labels are written to the output file. If only the order is changing that is harmless but it would be nice to understand how it's happening (I have a theory about this, yet to be proven). --------- Now preserves order in which files are combined (thanks Steve for the tweak). --------- Now serialises reading of style files and map source to avoid reentrancy issue in GType. Reworked top-level loop that waits for the parallel jobs to complete. Appears to use a lot less CPU and could possibly influence the weird problems some were reporting on Windows/Mac - please retest with this version. Steve, I haven't incorporated your changed options handling stuff yet but will do in the future if (a) you don't commit it separately and (b) we can fix the reliability issues with this parallelisation code. --------- Now respects --num-jobs again (broken in last patch). --------- Now reports exceptions in the worker threads. --------- Here's a better fix than last night's effort for the problem where the mapname and description for each job were getting clobbered due to the way that the command args are processed. Each job now gets a "snapshot" of the command args so it doesn't matter if they subsequently get changed. --------- Whoops! fixed a bad bug whereby each map was being output to the same file. Not sure if the fix is very elegant but at least it's not being silly any more. Now limits the default value of max-jobs to 4 no matter how many cores you have as further testing shows that having more threads just burns CPU cycles but doesn't actually finish any quicker. I guess the memory system is limiting the performance and the CPUs are spinning waiting for access. Now showing a real speedup of around 240% (my earlier higher claim was based on CPU usage and I now realise that was erroneous, sorry). -------- Now defaults to creating a thread per core so without doing anything you should see a speedup on a SMP box when processing multiple maps. You can use --max-jobs=N to limit the concurrency - you may want to specify that if you can't increase the VM size to what is required. However, it occurs to me that if you can afford a box with more than 2 cores, then you can probably afford a reasonable amount of memory (otherwise, what's the point in having more cores?) Added help blurb. -------- OK, let it not be said that I don't listen to others! The attached patch provides support for making maps in parallel. By default, the behaviour is the same as before but if you specify --num-threads=N where N is greater than 1, it will process N maps at the same time and then combine the results (if required). Don't forget to increase the heap size appropriately. A quick test on the big box shows good speedup - specifying --num-threads=4 and 2GB VM size. I was seeing better than 380% utilisation with 8 cores in use. I suspect the performance limitation here will be VM size and memory system bandwidth. BTW - I don't think num-threads is actually the best name for the option, so please suggest alternatives. Cheers, Mark

Further testing shows that you can get different output on subsequent runs on the same input (ignoring the time stamps). So far, all of the differences appear to be caused by data being output in a different order rather than the data itself being different. So, I believe that the resulting maps are still good. You should understand that the ordering of some of the map's data structures is not important as long as all the cross-references match up. I believe I have fixed the order in which labels are output and a future patch will incorporate the changes needed to do that. With my current test map, I still see the occasional change in the order in which NOD data is written but I haven't determined why that happens yet. So it would be good if people with multi-core boxes can continue to try out this patch (v10) and report any badness (or goodness!) Cheers, Mark

I've been using this patch, and I've not seen any problems - although I've not travelled far from home recently, so I can't say I've really stressed it.

And some performance numbers for you. First, the baseline run: /-------- | real 14m57.309s | user 16m40.606s | sys 0m9.813s \-------- Then with --max-jobs: /-------- | real 6m54.229s | user 18m14.016s | sys 0m13.125s \-------- This is on an Intel Q6600 (a Core2 Quad) @ 2.4GHz with 6GB of RAM.

Hi Toby,
And some performance numbers for you. First, the baseline run:
/-------- | real 14m57.309s | user 16m40.606s | sys 0m9.813s \--------
Then with --max-jobs:
/-------- | real 6m54.229s | user 18m14.016s | sys 0m13.125s \--------
This is on an Intel Q6600 (a Core2 Quad) @ 2.4GHz with 6GB of RAM.
Thanks for the numbers. They are similar to what I see on my mega machine (i.e. the realtime speedup is around x2) with some increase in the user time. I have plotted real time versus number of cores (aka --max-jobs) and for the v10 patch the time diminishes as expected as the number of cores increases. Going from 1 to 4 cores provides a worthwhile speedup but 4-8 doesn't help very much and above 8 makes no difference. It's interesting to note how machines differ. I have an AMD dual core job that is only clocked at 1G and it outperforms (when running parallel mkgmap) a machine that uses a dual core 3GHz Intel pentium 4. In terms of raw CPU power the Intel machine (a very cheap Acer job) is actually quite fast but as soon as you try and do anything useful with it that involves accessing memory it slows right down to a crawl. Cheers, Mark

On Tue, May 19, 2009 at 10:33:01AM +0100, Mark Burton wrote:
Further testing shows that you can get different output on subsequent runs on the same input (ignoring the time stamps).
So far, all of the differences appear to be caused by data being output in a different order rather than the data itself being different. So, I believe that the resulting maps are still good. You should understand that the ordering of some of the map's data structures is not important as long as all the cross-references match up.
Has anyone written a "lint" program for *.img files that would validate all the cross-references? Or a program to pretty-print the data structures in sorted format? The sorted pretty-print should be identical across runs. Marko

Hi Marko,
Has anyone written a "lint" program for *.img files that would validate all the cross-references? Or a program to pretty-print the data structures in sorted format? The sorted pretty-print should be identical across runs.
There are various programs around for printing out the stuff in IMG files. I have a hacked version of imgdecode that is somewhat more useful than the original and there is always the "display" code written by Steve. I am simply using cmp -l to do a byte-for-byte comparison of the files generated by various runs of mkgmap and then using imgdecode and display to locate where the differences reside. Cheers, Mark

Mark Burton wrote:
Changed default number of threads to be 1. If you specify --max-jobs without a value, you get one thread per core. --max-jobs=N means use N threads.
With regard to comparing the output with known good maps to see if the parallel processing is corrupting anything, one problem is that the files contain timestamps. I have test code that zeros the time stamps and have been able to compare the output from different runs.
I have just run this against todays great_britain.osm and using --max-jobs I had a runtime of 397s and without was 676s. This is on my dual core laptop with 4GB RAM. The 2 files created had exactly the same number of bytes and in my admittedly very brief and not overly scientific testing intertile routing seems to be the same for both. I haven't dug any deeper than that though as I'm meant to be working :) Paul

Hi Paul,
I have just run this against todays great_britain.osm and using --max-jobs I had a runtime of 397s and without was 676s. This is on my dual core laptop with 4GB RAM. The 2 files created had exactly the same number of bytes and in my admittedly very brief and not overly scientific testing intertile routing seems to be the same for both. I haven't dug any deeper than that though as I'm meant to be working :)
Yes, get back to work, you slacker! Many thanks for the feedback. That's a good speedup for a 2 core machine. When you have the time, please also try the quick-distance patch. Cheers, Mark

Yes, get back to work, you slacker!
Many thanks for the feedback. That's a good speedup for a 2 core machine.
When you have the time, please also try the quick-distance patch.
real 4m13.508s user 4m39.725s sys 0m16.205s Time duration: 254 secs. real 3m25.462s user 5m21.408s sys 0m15.697s Time duration: 205 secs. The first is without --max-jobs and the second is with and both are a considerable improvement than before the quick-distance patch Cheers Paul

Hi Paul, Thanks for the figures.
real 4m13.508s user 4m39.725s sys 0m16.205s Time duration: 254 secs.
real 3m25.462s user 5m21.408s sys 0m15.697s Time duration: 205 secs.
The first is without --max-jobs and the second is with and both are a considerable improvement than before the quick-distance patch
Yes, it's a good time saver and when it's combined with the multicore patch I'm seeing around 350% speedup! Cheers, Mark

On Mon, May 18, 2009 at 4:42 PM, Mark Burton <markb@ordern.com> wrote:
Changed default number of threads to be 1. If you specify --max-jobs without a value, you get one thread per core. --max-jobs=N means use N threads.
I have also tested this patch on my Windows machine: the error which I previously reported regarding missing files no longer occurs. Sorry for not responding earlier, but I was away. A superficial examination of the map revealed no noticeable differences or problems compared to maps compiled without the parallel code. I'll also test later on with Mac OS. Thanks! The patch looks good so far.

Hi Clinton,
I have also tested this patch on my Windows machine: the error which I previously reported regarding missing files no longer occurs. Sorry for not responding earlier, but I was away.
No problem, glad it has fixed the issue.
A superficial examination of the map revealed no noticeable differences or problems compared to maps compiled without the parallel code. I'll also test later on with Mac OS.
OK.
Thanks! The patch looks good so far.
Excellent, we now have several reports that the missing file problem has been fixed (methinks that there is either a bug in the Java Futures stuff or it works somewhat differently than the documentation suggests). Cheers, Mark

On Tue, May 19, 2009 at 6:37 PM, Mark Burton <markb@ordern.com> wrote:
A superficial examination of the map revealed no noticeable differences or problems compared to maps compiled without the parallel code. I'll also test later on with Mac OS.
I can also confirm that the patch appears to work correctly on Mac OS X. I tested with this patch and with the "quick distance calculation v1" patch also applied. On a 2 GB, 2 GHz Intel Core 2 duo machine, a combined map of France, Germany, Switzerland, and Italy took about 45 minutes to compile. Without these patches, it took about 1 hour and 40 minutes. Sorry, I didn't test the patches separately, so I don't have data on the individual performance improvements. At any rate, with superficial testing, both patches seem to be stable. Thanks! Cheers.

Hi Clinton,
I can also confirm that the patch appears to work correctly on Mac OS X. I tested with this patch and with the "quick distance calculation v1" patch also applied.
On a 2 GB, 2 GHz Intel Core 2 duo machine, a combined map of France, Germany, Switzerland, and Italy took about 45 minutes to compile. Without these patches, it took about 1 hour and 40 minutes.
Good, that's a x2 speedup.
Sorry, I didn't test the patches separately, so I don't have data on the individual performance improvements.
Don't worry about that, I am more concerned with correctness at this time.
At any rate, with superficial testing, both patches seem to be stable. Thanks!
That's good. If you're happy to keep using the patches, please do and that will give you a chance to spot any problems. Cheers, Mark
participants (6)
-
Clinton Gladstone
-
Mark Burton
-
Marko Mäkelä
-
Martin Marinus
-
Paul
-
Toby Speight