splitter - performance and memory

First off, let me say a quick thanks to everyone involved in this project. I've only just discovered it recently but wish I had found it much earlier, it really is incredible to see what has been done so far. I've downloaded the complete planet-090715.osm.bz2 file, and have been looking at splitting it. I read the description and limitations of the splitter.jar tool but decided to give it a go anyway since I have a 64 bit OS with 6GB ram. Unfortunately it still failed with a -Xmx5200m heap. I have a 16GB machine at work I could try it on that might work, however instead I decided to take a look at the source code to see if there's any possibility of reducing the memory requirements. I've only spent a short time looking at the code, but as far as I can tell the whole first step (computing the areas.list file) is using far more memory than it actually needs. The SplitIntMap (which is what takes up all the memory) isn't required here for two reasons. One is that the code never retrieves entries via .get(), rather it only uses an iterator so a list/array would suffice. Second, the node IDs aren't used in this stage so we don't even need to parse them let alone hold on to them. Assuming we replace the SplitIntMap with a wrapper around an int[] (or multiple int[] to mitigate the double-memory-on-copy problem), we'd be looking at memory savings of >50%. Does that make sense or have I missed something? If it sounds sensible I'd be happy to have a go at implementing it. Also, given the nature of the algorithm it wouldn't be too hard on performance if the lat+long values were written out to disk rather than held in memory which would mean splitting the whole dataset would be feasible even on a 32bit machine. I haven't yet looked at possibilities for tuning the second step but I assume that some sort of map/lookup is still required here. I figure there's a few options - perform multiple passes processing a subset of the splits at a time (limited by the total number of nodes we can hold in memory), optimise the existing data structures further, page some out to disk, etc. I was also thinking a little about performance. Given the enormous size of the full .osm file, I'd suggest a move away from SAX over to a pull parser (http://www.extreme.indiana.edu/xgws/xsoap/xpp/mxp1/index.html). It's even faster than SAX and uses very little memory. In my job we use it to parse many GB of XML daily with very good results. Another idea is to parallelise the code by running parts of the split on different threads to take advantage of multi-core CPUs. Possibly the best gain here would be when writing the files since gzip compression is fairly CPU intensive. What do people think? I'm happy to work on the above though I must confess up front that my spare time is quite limited so please don't expect too much too soon! Chris

By all means go ahead. I think many people would welcome a Splitter that can handle the whole world. I think that Steve Ratcliffe also announced that he would be working on new version but he will no doubt respond. If your expectation of less then 50% memory usage is correct then you should be able to handle the world on your 6GB machine as North and South America (~2/3 of the data) can be split using 8GB ram with the current splitter. Chris Miller wrote:
First off, let me say a quick thanks to everyone involved in this project. I've only just discovered it recently but wish I had found it much earlier, it really is incredible to see what has been done so far.
I've downloaded the complete planet-090715.osm.bz2 file, and have been looking at splitting it. I read the description and limitations of the splitter.jar tool but decided to give it a go anyway since I have a 64 bit OS with 6GB ram. Unfortunately it still failed with a -Xmx5200m heap. I have a 16GB machine at work I could try it on that might work, however instead I decided to take a look at the source code to see if there's any possibility of reducing the memory requirements.
I've only spent a short time looking at the code, but as far as I can tell the whole first step (computing the areas.list file) is using far more memory than it actually needs. The SplitIntMap (which is what takes up all the memory) isn't required here for two reasons. One is that the code never retrieves entries via .get(), rather it only uses an iterator so a list/array would suffice. Second, the node IDs aren't used in this stage so we don't even need to parse them let alone hold on to them. Assuming we replace the SplitIntMap with a wrapper around an int[] (or multiple int[] to mitigate the double-memory-on-copy problem), we'd be looking at memory savings of >50%.
Does that make sense or have I missed something? If it sounds sensible I'd be happy to have a go at implementing it. Also, given the nature of the algorithm it wouldn't be too hard on performance if the lat+long values were written out to disk rather than held in memory which would mean splitting the whole dataset would be feasible even on a 32bit machine.
I haven't yet looked at possibilities for tuning the second step but I assume that some sort of map/lookup is still required here. I figure there's a few options - perform multiple passes processing a subset of the splits at a time (limited by the total number of nodes we can hold in memory), optimise the existing data structures further, page some out to disk, etc.
I was also thinking a little about performance. Given the enormous size of the full .osm file, I'd suggest a move away from SAX over to a pull parser (http://www.extreme.indiana.edu/xgws/xsoap/xpp/mxp1/index.html). It's even faster than SAX and uses very little memory. In my job we use it to parse many GB of XML daily with very good results. Another idea is to parallelise the code by running parts of the split on different threads to take advantage of multi-core CPUs. Possibly the best gain here would be when writing the files since gzip compression is fairly CPU intensive.
What do people think? I'm happy to work on the above though I must confess up front that my spare time is quite limited so please don't expect too much too soon!
Chris
_______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

If your expectation of less then 50% memory usage is correct then you should be able to handle the world on your 6GB machine as North and South America (~2/3 of the data) can be split using 8GB ram with the current splitter.
Unfortunately the 50% estimated saving is only for the first stage, the split will still fail a bit later when it creates the split files. I'll try and look at addressing that too, there's a few different options. Thanks for the info about the memory requirements, useful to know - I don't have any real data myself yet!

Hi
What do people think? I'm happy to work on the above though I must confess up front that my spare time is quite limited so please don't expect too much too soon!
It all sounds good. Well worth trying each idea out to see if it makes an improvement. Although the .get() is not used at the moment I thinks that counting the ways in each area may be more accurate, in which case you will need the node-ids again. I also have plans to remove the limitation that an element can only be in four areas. Regards, ..Steve

It all sounds good. Well worth trying each idea out to see if it makes an improvement.
OK thanks, I'll go ahead with some of those changes then and post a patch or two once I have something useful for you to look at. I've already created a SplitIntList to replace the SplitIntMap and am running it now with the whole dataset, 5GB heap. Will let you know what happens... So far it's been running ~30 mins and appears to be using ~40% of the memory SplitIntMap was, which is about in line with what I'd expect. One CPU core is pegged at ~70%, I'm assuming due to the XML parsing though I haven't profiled it yet to verify that (I'm running against an uncompressed osm file). If that is indeed the case, the next thing I'll try is replacing SAX with XPP and see what improvement that brings.
Although the .get() is not used at the moment I thinks that counting the ways in each area may be more accurate, in which case you will need the node-ids again.
I'm at a bit of a disadvantage here as I don't yet really know much about the file format (though I'm looking at http://wiki.openstreetmap.org/wiki/OSM_Protocol_Version_0.6 and the raw XML currently). You're saying that by splitting based on the number of ways rather than number of nodes would provide a more balanced split? Hmm OK, in which case it will probably make more sense in the long run to build an index on disk. I'll hopefully get around to looking at doing something like that for the second step anyway and it should be reusable for both.
I also have plans to remove the limitation that an element can only be in four areas.
OK - I haven't dug deep enough into the code to understand this limitation yet but I don't think any of the changes I'm considering will affect this. Cheers, Chris

Good news - with my changes the areas.list file was generated fine, peaking at maybe a 3GB heap. I've attached a patch with the changes. Note there's also a unit test which requires TestNG though you can just exclude that from your IDEA project if you like. I also use IDEA though haven't included the changes I made to the IDEA project files, not really sure what's best to do wrt that. I've also attached the areas.list file that was generated when run against the full dataset in case that's useful to anyone. Chris

On 19/07/09 23:58, Chris Miller wrote:
Good news - with my changes the areas.list file was generated fine, peaking at maybe a 3GB heap.
Cool. It is also worth noting something that osmcut (an other map splitter) does. It has two modes; one where a map is used for when the input file is small and another where a list is used and the node-id is the index into the list. The later is used for splitting the whole planet file. The second method uses less memory where the memory wasted by node-ids that are not in use is less than the amount of memory used by storing the node-id plus the memory wasted in free space in the hash maps. As currently most nodes are used (about 88%) this is a big win on the whole planet file, while still giving access to the node ids. I've committed the patch. ..Steve

It is also worth noting something that osmcut (an other map splitter) does. It has two modes; one where a map is used for when the input file is small and another where a list is used and the node-id is the index into the list. The later is used for splitting the whole planet file.
Ah I didn't even realise there was another map splitter! Thanks for the pointer and info. I've already been considering various ways to trade off memory & performance based on the size and content of the dataset (including using an indexed array), but comparing my ideas with what osmcut does looks helpful.
The second method uses less memory where the memory wasted by node-ids that are not in use is less than the amount of memory used by storing the node-id plus the memory wasted in free space in the hash maps. As currently most nodes are used (about 88%) this is a big win on the whole planet file, while still giving access to the node ids.
Makes sense. From what I can see osmcut appears to be only subdividing into evenly sized tiles though, rather than quadtree partitioning. The advantage they have there is they only need to hold 2 bytes/node (as a tile ID) rather than 4 bytes/node (lat+lon), and they don't need to perform a secondary pass to determine which way belongs in which tile. That's OK though, it's a more interesting challenge taking the quadtree approach :)
I've committed the patch.
Thanks! The next patch I produce will likely include a third party XPP parser jar file and associated changes to the IDEA project file. What's the best way to deal with this - create a build.xml so non-IDEA users can build it too? Chris

Hi Steve, I have some questions about what you think should be the correct behaviour of the splitter in certain situations: 1) If a way covers more than one area, should ALL nodes for the way belong to all of those areas, or should each area only contain the nodes that lie within it, regardless of ways? Currently the latter seems to be the case but that feels wrong to me. 2) Does the XML have to contain nodes first, then ways, then relations, or can they be intermingled? (I think osmcut produced files like this) If it is permitted, what are the consequences of intermingling them for other tools etc? I know that as it stands splitter.jar expects them to be in node/way/relation order and wouldn't work very well with intermingled input, especially forwards references, but if it's not a big deal then it may help the performance of the splitter (especially if we should be dealing with 1 above). Chris
On 19/07/09 23:58, Chris Miller wrote:
Good news - with my changes the areas.list file was generated fine, peaking at maybe a 3GB heap.
Cool.
It is also worth noting something that osmcut (an other map splitter) does. It has two modes; one where a map is used for when the input file is small and another where a list is used and the node-id is the index into the list. The later is used for splitting the whole planet file.
The second method uses less memory where the memory wasted by node-ids that are not in use is less than the amount of memory used by storing the node-id plus the memory wasted in free space in the hash maps. As currently most nodes are used (about 88%) this is a big win on the whole planet file, while still giving access to the node ids. I've committed the patch.
..Steve

Hi Chris On 22/07/09 12:17, Chris Miller wrote:
1) If a way covers more than one area, should ALL nodes for the way belong to all of those areas, or should each area only contain the nodes that lie within it, regardless of ways? Currently the latter seems to be the case but that feels wrong to me.
You need at least one node outside the area for each time a way crosses the boundary, so that mkgmap can cut the way on the boundary. Or else you need to manufacture the nodes that sit exactly on the boundary yourself. The later option is practically not possible(*) as you need to know if something is a line or a polygon to do it correctly. The current splitter works by including all nodes in an extended area and letting mkgmap split the ways on the declared boundary. So it could fail if the spacing between nodes is larger than the extra overlaping area. Fortunately, large node spacings are impractical for all sorts of tools so don't occur much.
2) Does the XML have to contain nodes first, then ways, then relations, or can they be intermingled? (I think osmcut produced files like this) If it is permitted, what are the consequences of intermingling them for other tools etc? I know that as it stands splitter.jar expects them to be in node/way/relation
Well for maximum compatibility they should be in that order. Splitter doesn't actually require them to be in that order, as long as nodes occur before the ways that need them. I added this because there is some software that intermingles them like that. (*) I am considering being able to read a style file into splitter so that you could much more accurately estimate the size of the resulting files. In which case you could know if a way was a polygon in the given style. ..Steve

It is good to have a more accurate mechanism for determining the optimal tile size but, imho, it is better to have a splitter that can handle the planet dump on more moderate machines. The advantages of more people providing Garmin maps outweigh the little extra handwork to get all the tiles rendered and still have reasonable large tile sizes. So I would really welcome a version of splitter that uses less memory. People who are looking for (partially) hand optimized area lists can download them form http://garmin.na1400.info/routable.php (down to the bottom of that page). Steve Ratcliffe wrote:
Hi
What do people think? I'm happy to work on the above though I must confess up front that my spare time is quite limited so please don't expect too much too soon!
It all sounds good. Well worth trying each idea out to see if it makes an improvement.
Although the .get() is not used at the moment I thinks that counting the ways in each area may be more accurate, in which case you will need the node-ids again.
I also have plans to remove the limitation that an element can only be in four areas.
Regards,
..Steve _______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

It is good to have a more accurate mechanism for determining the optimal tile size but, imho, it is better to have a splitter that can handle the planet dump on more moderate machines. The advantages of more people providing Garmin maps outweigh the little extra handwork to get all the tiles rendered and still have reasonable large tile sizes. So I would really welcome a version of splitter that uses less memory.
I can't see any reason why the Splitter can't be optimised to the point where it will handle the planet file on any modest 32bit machine, and that is my eventual goal. My first round of changes are just "quick-fixes" really, but already they've given good results. With the patch I posted you should be able to generate the areas.list file slightly faster and with far less memory than before (the actual splitting hasn't changed yet though). Last night I also moved the code over to using a different XML parser that has resulted in some further gains. Here's where I currently stand, when splitting the united_kingdom.osm.bz2 file (165MB) on a Core i7 CPU: Original splitter.jar - generate areas.list: 4m 24s Original splitter.jar - splitting stage: 6m 02s Original splitter.jar - total time taken: 10m 26s New splitter.jar, generate areas.list: 3m 20s (~60% memory reduction for this stage) New splitter.jar, splitting stage: 5m 41s (still requires lots of memory) New splitter.jar, total time taken: 9m 01s I've got plenty of ideas on ways to further reduce memory and hopefully also improve performance, stay tuned. Chris
participants (3)
-
Chris Miller
-
Lambertus
-
Steve Ratcliffe