On Fri, Jun 11, 2010 at 6:52 PM, Chris Miller <chris_overseas@hotmail.com> wrote:
Hi Scott,
 
This probably explains why
no one from the mkgmap/splitter community noticed or commented on your original
post on the osm-dev list - if they're anything like me they don't pay too
much attention to what's going on there so I guess it got missed).


It was ready to get feedback, but not yet ready IMHO to be used.

As far as the splitter is concerned, I'd be happy to use it internally instead
of the current cache format given the speed boost it appears to offer. I've
always considered the cache to be a short lived/temporary thing anyway, purely
as an aid to speeding up the split, so anything that helps that is a win.

 
Using a binary format between the splitter and mkgmap would speed things
up immensely too, in fact we've discussed it a bit in the past on the mailing
list. Before we jump in too deep however, can you comment on how well your
format has been accepted/intergrated into Osmosis or any other osm projects,
and how stable the file format now? (ie, has it been battle tested? ;)

No use that I know of. I got busy after my initial emails, so I haven't had time to clean it up and package it as a true osmosis plugin. After I do that, then the chance that it may get more use increase, especially if other tools are engineered to use it. I've been using it for my own use for the last couple of months.
 
If
it looks like becoming a standard/supported format in the osm community then
supporting it as both input and output formats in the splitter is a bit of
a no-brainer, likewise as an input format for mkgmap (Steve and co, correct
me if I'm wrong!). I admit I'm a bit wary of the complexity of the format
and the potential maintenance effort required, especially if the format isn't
used elsewhere in the osm community other than splitter and mkgmap.





I've had a look at this patch - I'd be interested to hear your thinking behind
the changes you made (because I'm too lazy to try and get my head around
it by studying the code!). Are you sure it's behaving as you intended?

Yes.
 
The
--max-threads parameter is being ignored, instead you're creating as many
threads as there are areas in a pass (up to 255). This seems to result in
fairly unpredictable bursts in activity, sometimes quite agressive, but generally
most of the threads are sitting idle.

Yes, to some extent the activity is variable. However it seems that the current design has the worker threads effectively busywaiting, trying each of the queues to see if there is data in it to process?
 
In my benchmarks I'm seeing your patch
run a bit quicker and with higher CPU utilisation than the r109 code. It's
about 20% faster for a UK extract, 9% faster for the planet (on a core i7).
Note that WanMil's original patch uses (cores - 1) for maxthreads.

I suspect the reason the planet doesn't do so well as a smaller area is because
the planet has more areas in the split and so takes longer to fill up a given
'bundle' to trigger some processing.

I suspect it is from a different cause. A planet split has proportionally more serial code than a UK extract. On a planet, summing over each pass, each node is sent to all 800 areas to see if it is to be included. On UK, each node is only sent to the 20 or so areas. Now that reading is so fast, I strongly suspect the system is bottlenecked in this serial code.

This reduces the amount of CPU used
on average, since we'll get more idle periods (when many areas are still
filling their buffers), but the maximum possible work being done when buffers
fill up is still throttled by the number of cores.


I don't believe so. On a core i7, the system is rarely using more than 2 cores, which indicates that parallel code cannot be a bottleneck. In the steady state, a bundle can be processed in a few ms each (my core2 duo can do 500k nodes/sec), the aggregate unprocessed work of all incomplete bundles is a second or two of CPU time. This work is not avoided, only delayed. In the steady-state, this is lost in the noise. Reducing the bundle size will only change the delay before the work is done, at a cost of increasing the number of memory barriers in the concurrent queue code.

I know people leave the splitter
running as a background task on their desktop PC so if we're periodically
bursting 10+ threads at full CPU I can imagine they might get a bit annoyed
at times. Thoughts?


From so many cores being idle, I don't see how we could be getting bursts of 10+ threads unless the serial code gets sped up. This could be a risk if the density-map idea goes in and massively reduces the serial work. If this worries you, maybe hold off on this part of the patch until after the densitymap is in, or put it in and be prepared to revert later?

A couple of ideas I have here... perhaps reducing the bundle size will help
keep the workers fed (especially when there is a larger number of areas),
and also using a threadpool of max-threads to do the work on, rather than
one thread per area, would help prevent CPU starvation for other applications,
without adversely effecting performance?

If CPU starvation is a problem, can't the splitter be set to run with a lower priority?

Scott