Increasing the performance of whole-planet splits.

Hello! I would like to submit the first of a set of performance optimizations to the splitter that improve the performance and reduce memory usage. The end result is that a core 2 duo can split the entire planet in 3.5 hours with -Xmx1700m and without using any temporary space. (67 minutes on a 8gb core i7) Everything is up on github. I'm sending the first set of patches now. Perhaps the biggest change is that I've made the splitter accept my binary OSM format. I created a binary OSM format that is 30-50% smaller than the bzipped planet (about 5.3gb without metadata) and 5x-10x faster to parse than the gzipped planet. As that format is substantially smaller than the splitter's cache, and to simplify my patch, I removed the cache before refactoring the splitter. My original announcement is at http://lists.openstreetmap.org/pipermail/dev/2010-April/019370.html Since that post I have fixed the issue of unwanted extra precision. I want your thoughts on the binary format before submitting it. Please have a look and tell me what you think. In this email I am submitting some other optimizations and patches. The performance numbers I quote assume that I was using the binary format; I am unsure what benefit they have on trunk. In IntIntMap, the current offset value for rehashing is way to low, given the dense nature of OSM nodeID values and a sequential insertion order. This problem is most obvious on a whole-planet split. It results in huge numbers of collisions; requiring 30+ probes on average. Note when benchmarking that the collision rate is irregular throughout the planet file; with a --max-nodes=4000000 whole planet split, there's virtually no performance differences on the first 90M nodes, yet by 120M nodes, using a random 31-bit prime number is over 3x faster. My multithread changes reduced the real and user CPU time by 30-50% on a uk extract. Other tests, e.g., a whole planet, show almost no effect. I think this is because a lot of my changes increase the efficiency of code that is parallelized, which paradoxically ends up reducing the relative benefits of parellelization. The single biggest remaining slowdown is that for each point, the splitter sends each Node to each area (represented by an OSMWriter), which writes it in the output file if it is in the bounds of the region (with overlap), or does nothing otherwise. On a whole planet, with 840 areas and 550M nodes, that is over 5 trillion bounds-checks that are not run in parallel. An index that can can map a node to the regions that may contain it could avoid 99% of these checks. Scott

Hello Scott,
Hello! I would like to submit the first of a set of performance optimizations to the splitter that improve the performance and reduce memory usage.
Fantastic!
Perhaps the biggest change is that I've made the splitter accept my binary OSM format.
It'll be a few days before I have time to look at this in much detail and give you my thoughts but from what I can see it sounds promising. We actually use protobufs at my job for shunting around vast amounts of data so I'm pretty familiar with the pros and cons of them. One concern is that even if your binary format becomes reasonably standardised in the OSM world we'll still need to maintain .osm support for the forseeable future. Perhaps that can be tackled by swapping out the existing cache format with your protobuf format. Even that I'm not sure about in the longer term; to tackling some of the harder problems with the splitter will require more than just a linear protobuf or cache format. Data structures such as Btree indexes and so on will likely be required, though I'm sure we can cross that bridge when we come to it.
some other optimizations and patches. The performance numbers I quote assume that I was using the binary format; I am unsure what benefit they have on trunk.
From what I've seen so far of your patches, most of them look beneficial to the trunk too :)
In IntIntMap, the current offset value for rehashing is way to low,
This is a very good catch, I'll get this checked in. Your 'node too many' fix sounds like a good one too, as do the 'inifinite recursion' fixes. I'll get these checked in as soon as I've had a chance to test them. I'm not so sure about your lazy HashMap initialisation in Element.java, though it does highlight a bug that was introduced by the recent r109 multithreaded patch! Prior to patch r109, Element objects were never recreated, rather .reset() was called to just reset their state for each new node/way/rel. r109 changed that to create new instances each time, hence some of the object churn you were probably seeing that lead you to your patch. I'll have a think about to do here and probably run a few benchmarks to see what works best (eg keeping one current Node/Way/Rel per thread might be best). [As an aside, note that in your lazy HashMap patch Collections.EMPTY_SET results in an unchecked assignment warning. That of course is no big deal, easily fixed with Collections.<String, String>emptyMap().entrySet(). The sting though is that the call to Collections.EMPTY_SET.iterator() actually creates a new iterator object each time(!). Better would be to create a single static immutable iterator and return that]
My multithread changes reduced the real and user CPU time by 30-50% on a uk extract. Other tests, e.g., a whole planet, show almost no effect. I think this is because a lot of my changes increase the efficiency of code that is parallelized, which paradoxically ends up reducing the relative benefits of parellelization.
This sounds promising, I'll look at what you've done further in the next couple of days and hopefully get it committed this weekend.
The single biggest remaining slowdown is that for each point, the splitter sends each Node to each area (represented by an OSMWriter), which writes it in the output file if it is in the bounds of the region (with overlap), or does nothing otherwise. On a whole planet, with 840 areas and 550M nodes, that is over 5 trillion bounds-checks that are not run in parallel. An index that can can map a node to the regions that may contain it could avoid 99% of these checks.
This is something I've been aware of for a long time but haven't found the time to tackle yet. One idea I have is to generate an r-tree in memory and run the node comparisons against that to find out what area(s) each node falls into. I'd estimate it would reduce the number of comparisons required from 840/node to less than 10. I suspect that would be a win over generating a node to area index which would be expensive to create intially, plus expensive in terms of disk access (or alternatively, require lots of RAM). Also, with a bit of effort these r-tree comparisons could be parallelised too. I'll follow up in a while once I've had a bit more time to investigate and respond. Many many thanks for your work on all this, very much appreciated! Chris

It'll be a few days before I have time to look at this in much detail and give you my thoughts but from what I can see it sounds promising. We actually use protobufs at my job for shunting around vast amounts of data so I'm pretty familiar with the pros and cons of them. One concern is that even if your binary format becomes reasonably standardised in the OSM world we'll still need to maintain .osm support for the forseeable future.Perhaps that can
be tackled by swapping out the existing cache format with your protobuf
format.
I envisioned it as an alternate format that could exist in parallel with the osm, but also be smaller and faster. I still need to clean up the osmosis patches and turn them into a plugin.
Even that I'm not sure about in the longer term; to tackling some of the harder problems with the splitter will require more than just a linear protobuf or cache format. Data structures such as Btree indexes and so on will likely be required, though I'm sure we can cross that bridge when we come to it.
What harder problems do you think the splitter needs solved? I'm not so sure about your lazy HashMap initialisation in Element.java,
though it does highlight a bug that was introduced by the recent r109 multithreaded patch! Prior to patch r109, Element objects were never recreated, rather .reset() was called to just reset their state for each new node/way/rel.
The reason for this was the costs of initializing were showing up in the profile.
r109 changed that to create new instances each time, hence some of the object churn you were probably seeing that lead you to your patch. I'll have a think about to do here and probably run a few benchmarks to see what works best (eg keeping one current Node/Way/Rel per thread might be best).
May I suggest holding off on this for a bit? The branch that adds support for reading the binary file format refactors this interface so that node/way/rel objects are passed directly instead of using an interface that looks like an XML callbacks. [As an aside, note that in your lazy HashMap patch Collections.EMPTY_SET
results in an unchecked assignment warning. That of course is no big deal, easily fixed with Collections.<String, String>emptyMap().entrySet(). The sting though is that the call to Collections.EMPTY_SET.iterator() actually creates a new iterator object each time(!). Better would be to create a single static immutable iterator and return that]
Good catch. Should I respin the patch or will you?
My multithread changes reduced the real and user CPU time by 30-50% on a uk extract. Other tests, e.g., a whole planet, show almost no effect. I think this is because a lot of my changes increase the efficiency of code that is parallelized, which paradoxically ends up reducing the relative benefits of parellelization.
This sounds promising, I'll look at what you've done further in the next couple of days and hopefully get it committed this weekend.
The single biggest remaining slowdown is that for each point, the splitter sends each Node to each area (represented by an OSMWriter), which writes it in the output file if it is in the bounds of the region (with overlap), or does nothing otherwise. On a whole planet, with 840 areas and 550M nodes, that is over 5 trillion bounds-checks that are not run in parallel. An index that can can map a node to the regions that may contain it could avoid 99% of these checks.
This is something I've been aware of for a long time but haven't found the time to tackle yet. One idea I have is to generate an r-tree in memory and run the node comparisons against that to find out what area(s) each node falls into.
josm already includes a quadtree implementation. I wish it could be disentangled and re-used across the different OSM applications. It'd be a perfect solution here.
I'd estimate it would reduce the number of comparisons required from 840/node to less than 10. I suspect that would be a win over generating a node to area index which would be expensive to create intially, plus expensive in terms of disk access (or alternatively, require lots of RAM). Also, with a bit of effort these r-tree comparisons could be parallelised too.
Perfection isn't necessary. How about a 64x64 array. Divide the world into 4096 regions. Each region contains a list of all areas --- including their overlap --- that intersect it. Then, for each node, you only need to iterate the OSMWriters in that region. Getting the round-offs right would be tricky, but it should be good enough filter out almost all non-matching areas. I'll follow up in a while once I've had a bit more time to investigate and
respond. Many many thanks for your work on all this, very much appreciated!
Chris
Thanks for a useful program. Glad I could help make it better. Scott

Hello Scott,
I envisioned it as an alternate format that could exist in parallel with the osm, but also be smaller and faster. I still need to clean up the osmosis patches and turn them into a plugin.
OK sounds like we're both thinking along the same lines then. Note that the current cache is basically an alternate format too which is (mostly) interchangable with an .osm file. That's why I suggested swapping the current cache with your binary format as an option.
What harder problems do you think the splitter needs solved?
One example is the current handling of ways and relations that cross tile boundaries. This is tackled via the overlap handling, which is really just a hack to try to hang on to enough information so mkgmap will always be able to eg cut ways correctly as they cross the boundaries. For the most part it works well, but there are still cases where it isn't enough. Ideally for a given tile we'd hang on to ALL the nodes that made up (certain) ways and relations, regardless of how far outside the tile the nodes fell. That would eliminate some problem cases that mkgmap currently has to guess at fixing, or can't deal with at all. Hanging on to all the nodes is hard, because on a dataset the size of the planet it's not easy to quickly find all the nodes (including their associated metadata) that belong to a given way or relation.
The reason for this was the costs of initializing were showing up in the profile.
Do you mean the cost of initializing the Element instances? If so then yes, that's something that I'd caught previously in profiling too, hence why they were being reused and reset (prior to r109 that is :).
May I suggest holding off on this for a bit? The branch that adds support for reading the binary file format refactors this interface so that node/way/rel objects are passed directly instead of using an interface that looks like an XML callbacks.
Sure thing, I'll get the other patches tested and checked in first.
Good catch. Should I respin the patch or will you?
I've already made the change locally, and given that it's an improvement over r109 I'll check it anyway but I think it might be worth revisiting this later to see if it's worth recycling/pooling the Element objects somehow. The threading & buffering obviously make this trickier and maybe it'll prove not to be worth it.
josm already includes a quadtree implementation. I wish it could be disentangled and re-used across the different OSM applications. It'd be a perfect solution here.
Not quite perfect I think... a quadtree doesn't allow overlapping rectangles whereas an r-tree does, which would allow us to deal with the overlap much more efficiently.
Perfection isn't necessary. How about a 64x64 array. Divide the world into 4096 regions. Each region contains a list of all areas --- including their overlap --- that intersect it. Then, for each node, you only need to iterate the OSMWriters in that region. Getting the round-offs right would be tricky, but it should be good enough filter out almost all non-matching areas.
Agreed, what you describe would still be a good improvement. This is basically how the DensityMap works already so that could probably/possibly be tweaked and reused. I still prefer the raw performance (and challenge!) of the r-tree approach but realistically it'll be some time before I get around to implementing that, so would be more than happy to go with the above in the meantime.
Thanks for a useful program. Glad I could help make it better.
Don't thank me, thank Steve for starting this project in the first place! I'm much like you, just got involved because I had an itch to scratch, and it looked like an interesting problem :) Chris

The reason for this was the costs of initializing were showing up in the profile.
Do you mean the cost of initializing the Element instances? If so then yes, that's something that I'd caught previously in profiling too, hence why they were being reused and reset (prior to r109 that is :).
No, the tag storage. Most nodes don't have tags.
josm already includes a quadtree implementation. I wish it could be disentangled and re-used across the different OSM applications. It'd be a perfect solution here.
Not quite perfect I think... a quadtree doesn't allow overlapping rectangles whereas an r-tree does, which would allow us to deal with the overlap much more efficiently.
AFAICT, their quadtree lets you import objects that have a bounding-box, which gets stored at the deepest node wholly containing the bounding box. This would be enough.
Perfection isn't necessary. How about a 64x64 array. Divide the world into 4096 regions. Each region contains a list of all areas --- including their overlap --- that intersect it. Then, for each node, you only need to iterate the OSMWriters in that region. Getting the round-offs right would be tricky, but it should be good enough filter out almost all non-matching areas.
Agreed, what you describe would still be a good improvement. This is basically how the DensityMap works already so that could probably/possibly be tweaked and reused.
That was exactly what I thought too.
I still prefer the raw performance (and challenge!) of the r-tree approach but realistically it'll be some time before I get around to implementing that, so would be more than happy to go with the above in the meantime.
If you do, could I offer a suggestion? Define interfaces for every concept it uses, including coordinates and bounding boxes. That way it can be used in other OSM software. About a month ago, I lamented that josm's quadtreeimplementation can be refactored it to support anything that has a getBBox() and isUsable() functions, but using it requires importing josm's notion of bounding boxes and coordinates which brings in dependencies on java.awt.geom.Rectangle2D, and org.openstreetmap.josm.Main (via LatLon).:
Thanks for a useful program. Glad I could help make it better.
Don't thank me, thank Steve for starting this project in the first place! I'm much like you, just got involved because I had an itch to scratch, and it looked like an interesting problem :)
Well Steve, thank you for the good program! Scott

Hello Scott,
No, the tag storage. Most nodes don't have tags.
Ah ok fair enough. Still the same problem source, caused by the lack of object recycling.
AFAICT, their quadtree lets you import objects that have a bounding-box, which gets stored at the deepest node wholly containing the bounding box. This would be enough.
So multiple objects are allowed on each node then I guess? Regardless, a density map or r-tree approach would be preferable anyway. A density map because you can jump straight to the set of candidate areas without a traversal, or r-tree because the tree's answer would be exact without any further tests required. The limitation I see with a density map approach is that parts of the planet that have the highest node density would by definition contain the smallest split areas. This means the density map would need to be high enough resolution that we didn't have to test against too many candidate split areas in these dense parts (especially since these would be the most common tests). To reduce the memory requirements, sets of candidate areas would probably need to be pooled and reused across multiple points on the density map. I hope that makes sense, I'm finding it hard to my thoughts into words due to a lack of precise terminology!
If you do, could I offer a suggestion? Define interfaces for every concept it uses, including coordinates and bounding boxes. That way it can be used in other OSM software.
Sure, though the implementation I had in mind is nowhere near a full featured one. No dynamic inserts/updates/deletes, no rebalancing etc which would greatly restrict its general purpose use. Not to say that couldn't be added later if someone wanted to put the effort in but I don't see myself doing it unless it turns out to be needed in the splitter.
About a month ago, I lamented that josm's quadtreeimplementation can be refactored it tosupport anything that has a getBBox() and isUsable() functions, butusing it requires importing josm'snotion of bounding boxes and coordinates which brings in dependencies on java.awt.geom.Rectangle2D, andorg.openstreetmap.josm.Main (via LatLon).:
Definitely something to bear in mind. Shouldn't be a problem, I usually try to avoid such dependencies anyway. Cheers, Chris

On Fri, Jun 11, 2010 at 4:17 AM, Chris Miller <chris_overseas@hotmail.com>wrote:
The limitation I see with a density map approach is that parts of the planet that have the highest node density would by definition contain the smallest split areas. This means the density map would need to be high enough resolution that we didn't have to test against too many candidate split areas in these dense parts (especially since these would be the most common tests).
A 256x256 map would have 65536 cells about 120km on a side. Looking at the overview map of a max-nodes=1000000 split in qlandkarte, which shows tile boundaries,.I'd guess the worst case would be 10-15 tiles in a cell. That would prune away at least 97.5% of the bounds-checks. With the planet file, at this nodecount and cell size, an R-tree traversal would likely have more boundschecks than that. I could see an r-tree being a win if tilesizes got a lot smaller, say people started to run max-nodes=50000 splits, or spitted files with insane node densities, such as contour lines. To reduce the memory requirements, sets of candidate areas
would probably need to be pooled and reused across multiple points on the density map.
I wouldn't worry too much about memory. Absolute worst case is each cell has all 255 areas, which would be 4 bytes * 255 areas IntList * (256*256 cells) =64mb. Use a ByteList and it's 16mb. Bitvector and it's 2mb. More realistically, I'd expect it to be 1/10th of the worst case. Scott

Hello Scott, I was doing some similar back-of-envelope calcs earlier today and came to a similar conclusion. I do think there's a lot of people splitting quite a way below 1,000,000 nodes/tile, but even still a map's far easier to implement than rtrees and is a big win over what we have now so it's probably not worth worrying too much about where the crossover point is.
A 256x256 map would have 65536 cells about 120km on a side. Looking at the overview map of a max-nodes=1000000 split in qlandkarte, which shows tile boundaries,.I'd guess the worst case would be 10-15 tiles in a cell. That would prune away at least 97.5% of the bounds-checks.
With the planet file, at this nodecount and cell size, an R-tree traversal would likely have more boundschecks than that. I could see an r-tree being a win if tilesizes got a lot smaller, say people started to run max-nodes=50000 splits, or spitted files with insane node densities, such as contour lines.
To reduce the memory requirements, sets of candidate areas would probably need to be pooled and reused across multiple points on the density map. I wouldn't worry too much about memory. Absolute worst case is each cell has all 255 areas, which would be 4 bytes * 255 areas IntList * (256*256 cells) =64mb. Use a ByteList and it's 16mb. Bitvector and it's 2mb. More realistically, I'd expect it to be 1/10th of the worst case.
Scott

Hi Scott,
I created a binary OSM format that is 30-50% smaller than the bzipped planet (about 5.3gb without metadata) and 5x-10x faster to parse than the gzipped planet. As that format is substantially smaller than the splitter's cache, and to simplify my patch, I removed the cache before refactoring the splitter. My original announcement is at http://lists.openstreetmap.org/pipermail/dev/2010-April/019370.html Since that post I have fixed the issue of unwanted extra precision. I want your thoughts on the binary format before submitting it.
I've finally had a bit of time to read over your original post and the followups to it. (Note that, as I understand it, the OSM dev mailing list and Osmosis don't really have any direct relationship with the splitter or mkgmap. mkgmap is a tool originally written by Steve Ratcliffe purely to port osm data to Garmin devices, and the splitter was an additional tool he wrote to help achieve that goal and it has evolved from there. This probably explains why no one from the mkgmap/splitter community noticed or commented on your original post on the osm-dev list - if they're anything like me they don't pay too much attention to what's going on there so I guess it got missed). I really like what you've done and can see the benefits of having a compact & fast binary format when it comes to processing large amounts of osm data. I can also see you've put a lot of thought and effort into how the file is structured. Having something like this as a standard for tools like mkgmap and splitter would be very beneficial in terms of performance and interoperability. As far as the splitter is concerned, I'd be happy to use it internally instead of the current cache format given the speed boost it appears to offer. I've always considered the cache to be a short lived/temporary thing anyway, purely as an aid to speeding up the split, so anything that helps that is a win. Using a binary format between the splitter and mkgmap would speed things up immensely too, in fact we've discussed it a bit in the past on the mailing list. Before we jump in too deep however, can you comment on how well your format has been accepted/intergrated into Osmosis or any other osm projects, and how stable the file format now? (ie, has it been battle tested? ;) If it looks like becoming a standard/supported format in the osm community then supporting it as both input and output formats in the splitter is a bit of a no-brainer, likewise as an input format for mkgmap (Steve and co, correct me if I'm wrong!). I admit I'm a bit wary of the complexity of the format and the potential maintenance effort required, especially if the format isn't used elsewhere in the osm community other than splitter and mkgmap. Also, I see you complained about the lack of common libraries/data structures between Osmosis, splitter and mkgmap. I guess it's the way it is because they've all evolved independently with no real incentive to extract common code. If your file format were to become standardised, there's no reason why the splitter couldn't be refactored so it was based off a standard library and data structures. Hopefully other projects could too, but I'm not involved in them so I'll leave that aspect for others to comment on separately.
My multithread changes reduced the real and user CPU time by 30-50% on a uk extract. Other tests, e.g., a whole planet, show almost no effect. I think this is because a lot of my changes increase the efficiency of code that is parallelized, which paradoxically ends up reducing the relative benefits of parellelization.
I've had a look at this patch - I'd be interested to hear your thinking behind the changes you made (because I'm too lazy to try and get my head around it by studying the code!). Are you sure it's behaving as you intended? The --max-threads parameter is being ignored, instead you're creating as many threads as there are areas in a pass (up to 255). This seems to result in fairly unpredictable bursts in activity, sometimes quite agressive, but generally most of the threads are sitting idle. In my benchmarks I'm seeing your patch run a bit quicker and with higher CPU utilisation than the r109 code. It's about 20% faster for a UK extract, 9% faster for the planet (on a core i7). Note that WanMil's original patch uses (cores - 1) for maxthreads. I suspect the reason the planet doesn't do so well as a smaller area is because the planet has more areas in the split and so takes longer to fill up a given 'bundle' to trigger some processing. This reduces the amount of CPU used on average, since we'll get more idle periods (when many areas are still filling their buffers), but the maximum possible work being done when buffers fill up is still throttled by the number of cores. A couple of ideas I have here... perhaps reducing the bundle size will help keep the workers fed (especially when there is a larger number of areas), and also using a threadpool of max-threads to do the work on, rather than one thread per area, would help prevent CPU starvation for other applications, without adversely effecting performance? I know people leave the splitter running as a background task on their desktop PC so if we're periodically bursting 10+ threads at full CPU I can imagine they might get a bit annoyed at times. Thoughts? Chris

On Fri, Jun 11, 2010 at 6:52 PM, Chris Miller <chris_overseas@hotmail.com>wrote:
Hi Scott,
This probably explains why no one from the mkgmap/splitter community noticed or commented on your original post on the osm-dev list - if they're anything like me they don't pay too much attention to what's going on there so I guess it got missed).
It was ready to get feedback, but not yet ready IMHO to be used. As far as the splitter is concerned, I'd be happy to use it internally
instead of the current cache format given the speed boost it appears to offer. I've always considered the cache to be a short lived/temporary thing anyway, purely as an aid to speeding up the split, so anything that helps that is a win.
Using a binary format between the splitter and mkgmap would speed things up immensely too, in fact we've discussed it a bit in the past on the mailing list. Before we jump in too deep however, can you comment on how well your format has been accepted/intergrated into Osmosis or any other osm projects, and how stable the file format now? (ie, has it been battle tested? ;)
No use that I know of. I got busy after my initial emails, so I haven't had time to clean it up and package it as a true osmosis plugin. After I do that, then the chance that it may get more use increase, especially if other tools are engineered to use it. I've been using it for my own use for the last couple of months.
If it looks like becoming a standard/supported format in the osm community then supporting it as both input and output formats in the splitter is a bit of a no-brainer, likewise as an input format for mkgmap (Steve and co, correct me if I'm wrong!). I admit I'm a bit wary of the complexity of the format and the potential maintenance effort required, especially if the format isn't used elsewhere in the osm community other than splitter and mkgmap.
I've had a look at this patch - I'd be interested to hear your thinking behind the changes you made (because I'm too lazy to try and get my head around it by studying the code!). Are you sure it's behaving as you intended?
Yes.
The --max-threads parameter is being ignored, instead you're creating as many threads as there are areas in a pass (up to 255). This seems to result in fairly unpredictable bursts in activity, sometimes quite agressive, but generally most of the threads are sitting idle.
Yes, to some extent the activity is variable. However it seems that the current design has the worker threads effectively busywaiting, trying each of the queues to see if there is data in it to process?
In my benchmarks I'm seeing your patch run a bit quicker and with higher CPU utilisation than the r109 code. It's about 20% faster for a UK extract, 9% faster for the planet (on a core i7). Note that WanMil's original patch uses (cores - 1) for maxthreads.
I suspect the reason the planet doesn't do so well as a smaller area is because the planet has more areas in the split and so takes longer to fill up a given 'bundle' to trigger some processing.
I suspect it is from a different cause. A planet split has proportionally more serial code than a UK extract. On a planet, summing over each pass, each node is sent to all 800 areas to see if it is to be included. On UK, each node is only sent to the 20 or so areas. Now that reading is so fast, I strongly suspect the system is bottlenecked in this serial code. This reduces the amount of CPU used
on average, since we'll get more idle periods (when many areas are still filling their buffers), but the maximum possible work being done when buffers fill up is still throttled by the number of cores.
I don't believe so. On a core i7, the system is rarely using more than 2 cores, which indicates that parallel code cannot be a bottleneck. In the steady state, a bundle can be processed in a few ms each (my core2 duo can do 500k nodes/sec), the aggregate unprocessed work of all incomplete bundles is a second or two of CPU time. This work is not avoided, only delayed. In the steady-state, this is lost in the noise. Reducing the bundle size will only change the delay before the work is done, at a cost of increasing the number of memory barriers in the concurrent queue code. I know people leave the splitter
running as a background task on their desktop PC so if we're periodically bursting 10+ threads at full CPU I can imagine they might get a bit annoyed at times. Thoughts?
From so many cores being idle, I don't see how we could be getting bursts of 10+ threads unless the serial code gets sped up. This could be a risk if the density-map idea goes in and massively reduces the serial work. If this worries you, maybe hold off on this part of the patch until after the densitymap is in, or put it in and be prepared to revert later?
A couple of ideas I have here... perhaps reducing the bundle size will help
keep the workers fed (especially when there is a larger number of areas), and also using a threadpool of max-threads to do the work on, rather than one thread per area, would help prevent CPU starvation for other applications, without adversely effecting performance?
If CPU starvation is a problem, can't the splitter be set to run with a lower priority? Scott

Hi Chris, hi Scott, I like new ideas and the vital discussion about your new concepts!! This is good for the evolution of the OSM project!
On Fri, Jun 11, 2010 at 6:52 PM, Chris Miller <chris_overseas@hotmail.com <mailto:chris_overseas@hotmail.com>> wrote:
Hi Scott,
...
Using a binary format between the splitter and mkgmap would speed things up immensely too, in fact we've discussed it a bit in the past on the mailing list. Before we jump in too deep however, can you comment on how well your format has been accepted/intergrated into Osmosis or any other osm projects, and how stable the file format now? (ie, has it been battle tested? ;)
No use that I know of. I got busy after my initial emails, so I haven't had time to clean it up and package it as a true osmosis plugin. After I do that, then the chance that it may get more use increase, especially if other tools are engineered to use it. I've been using it for my own use for the last couple of months.
So that's the typical chicken-and-egg-situation. As far as I understand mkgmaps design supports multiple input formats. At the moment most of the effort is spent to the OSM format but it should be possible to write some new reader classes to support the new format. There will be work to do to move some basic handling from the OSM-Reader so that it can be used by the new reader but refactorings are not a no-go reason.
If it looks like becoming a standard/supported format in the osm community then supporting it as both input and output formats in the splitter is a bit of a no-brainer, likewise as an input format for mkgmap (Steve and co, correct me if I'm wrong!). I admit I'm a bit wary of the complexity of the format and the potential maintenance effort required, especially if the format isn't used elsewhere in the osm community other than splitter and mkgmap.
I've had a look at this patch - I'd be interested to hear your thinking behind the changes you made (because I'm too lazy to try and get my head around it by studying the code!). Are you sure it's behaving as you intended?
Yes.
The --max-threads parameter is being ignored, instead you're creating as many threads as there are areas in a pass (up to 255). This seems to result in fairly unpredictable bursts in activity, sometimes quite agressive, but generally most of the threads are sitting idle.
Yes, to some extent the activity is variable. However it seems that the current design has the worker threads effectively busywaiting, trying each of the queues to see if there is data in it to process?
Yes that's true. I observed in tests that all threads could be fed with data all the time on my 4 thread system. But of course this must not be true for all systems. It's easy to write a patch to add some sleep-time to the busywaiting if no data is available. Please let me know if this patch is needed. A better patch would be to use an ExecutorService with a thread pool.
In my benchmarks I'm seeing your patch run a bit quicker and with higher CPU utilisation than the r109 code. It's about 20% faster for a UK extract, 9% faster for the planet (on a core i7). Note that WanMil's original patch uses (cores - 1) for maxthreads.
One requirement I tried to achieve is the possiblity to define the exact number of threads used by the splitter. I think this is interesting and important for server hosted map providers (Lambertus, AllInOne etc.) who are not alone on their host. Splitter uses cores-1 for the writing threads to ensure that the reading thread always has one core. This might be improved by different thread priorities (but all my readings about Java thread priorities told me: it's quite useless... Do you know more?)
I suspect the reason the planet doesn't do so well as a smaller area is because the planet has more areas in the split and so takes longer to fill up a given 'bundle' to trigger some processing.
I suspect it is from a different cause. A planet split has proportionally more serial code than a UK extract. On a planet, summing over each pass, each node is sent to all 800 areas to see if it is to be included. On UK, each node is only sent to the 20 or so areas. Now that reading is so fast, I strongly suspect the system is bottlenecked in this serial code.
Sounds like the need for a profiling session...
This reduces the amount of CPU used on average, since we'll get more idle periods (when many areas are still filling their buffers), but the maximum possible work being done when buffers fill up is still throttled by the number of cores.
I don't believe so. On a core i7, the system is rarely using more than 2 cores, which indicates that parallel code cannot be a bottleneck. In the steady state, a bundle can be processed in a few ms each (my core2 duo can do 500k nodes/sec), the aggregate unprocessed work of all incomplete bundles is a second or two of CPU time. This work is not avoided, only delayed. In the steady-state, this is lost in the noise. Reducing the bundle size will only change the delay before the work is done, at a cost of increasing the number of memory barriers in the concurrent queue code.
I know people leave the splitter running as a background task on their desktop PC so if we're periodically bursting 10+ threads at full CPU I can imagine they might get a bit annoyed at times. Thoughts?
From so many cores being idle, I don't see how we could be getting bursts of 10+ threads unless the serial code gets sped up. This could be a risk if the density-map idea goes in and massively reduces the serial work. If this worries you, maybe hold off on this part of the patch until after the densitymap is in, or put it in and be prepared to revert later?
A couple of ideas I have here... perhaps reducing the bundle size will help keep the workers fed (especially when there is a larger number of areas), and also using a threadpool of max-threads to do the work on, rather than one thread per area, would help prevent CPU starvation for other applications, without adversely effecting performance?
If CPU starvation is a problem, can't the splitter be set to run with a lower priority?
That should be possible. But from my experience well dimensioned thread pools are a better solution. WanMil
Scott
participants (3)
-
Chris Miller
-
Scott Crosby
-
WanMil