Splitter --cache parameter

I've just checked in some changes to the splitter that add a new --cache parameter. This is designed to speed up the splitting process, especially on large splits that require multiple passes over the .osm file, or in situations where you run the splitter several times on the same .osm file with different parameters each time. If you want to enable the disk caching, you specify the cache location as follows: --cache=<directory> This will cause the splitter to generate several files in the specified directory during the first stage of the split (the areas.list calculation). These files contain the same information as the source .osm file(s) do, but in an optimised format that allow subsequent passes over the data to happen much more quickly. The more passes that happen in the second stage of the split, the greater the speedup you will see. Some benchmarks on my PC have shown the following speed improvements when running against uncompressed .osm files: 1 pass - 5% faster 2 passes - 25% faster 3 passes - 35% faster 4 passes - 40% faster 5 passes - 45% faster If you are using compressed .osm files (bz2 compression especially), the speed improvement should be greater still, since the decompression will only need to happen once rather than on each pass. Note however that these figures are very approximate; the actual performance will vary depending on your disk and CPU speed, the particular map being processed, and what other disk and CPU activity is taking place on your PC at the same time. In some cases you might find that splits that only require a single pass will run faster without the disk cache enabled. The disk cache can also be used across multiple runs of the splitter, as long as you are splitting the same .osm file(s) each time. For example suppose you ran a splitter as follows: java -Xmx4000m -jar splitter.jar --cache=. --max-nodes=1500000 europe.osm If you then run mkgmap and discover the max-nodes setting is too high, you can run the splitter again with a lower max-nodes value like so: java -Xmx4000m -jar splitter.jar --cache=. --max-nodes=1200000 Because the cache files already exist for europe.osm as a result of the first run, there's no need to specify europe.osm on the rerun. The data will be loaded from the cache instead and the split will run much faster. Be careful to delete the cache files if you want to rerun the splitter on a different .osm file, otherwise the previously cached data will be used from the original .osm file instead. (I'll probably add a check for this situation, but there's nothing in place to prevent it just yet.) Note that the disk cache can require a lot of disk space, typically about 20-25% of the space the uncompressed .osm file takes up. For example the 27GB europe.osm file generates a cache of just over 5GB. The --cache parameter is entirely optional. if you don't specify it, the splitter will work in exactly the same way it did previously. I hope the above explanation makes sense. Any questions, comments or suggestions are welcome. Cheers, Chris

Something else I probably should have mentioned. Enabling the disk cache does NOT reduce the memory required to perform the split, though it does make multiple passes during the second stage much quicker, and the more passes that are used (via smaller --max-areas values) the less memory required during that stage. I'm still looking into the best way to reduce the memory required during the first (area subdivision) stage, since this is the one thing still preventing people from splitting the planet on a 32 bit VM. Chris CM> I've just checked in some changes to the splitter that add a new CM> --cache parameter. This is designed to speed up the splitting

Chris Miller wrote:
Something else I probably should have mentioned. Enabling the disk cache does NOT reduce the memory required to perform the split, though it does make multiple passes during the second stage much quicker, and the more passes that are used (via smaller --max-areas values) the less memory required during that stage.
I'm still looking into the best way to reduce the memory required during the first (area subdivision) stage, since this is the one thing still preventing people from splitting the planet on a 32 bit VM.
I don't know what change made it possible, but I finally succeeded to process all of North/South America with the latest splitter and 3.9 GB heap space. I used the cache option and max-node=1.2 million. I've tried this a few times before with older splitter versions, but this is the first time the split finished without fatal errors. So I finally have correct tiles for North America (at least the ones that got rendered by Mkgmap successfully!): http://garmin.na1400.info/routable.php I guess it's time for me to start hacking on a areas.kml editing tool to get rid of all those red tiles :-) Or is the node density code already functional?

L> I don't know what change made it possible, but I finally succeeded to L> process all of North/South America with the latest splitter and 3.9 L> GB heap space. I used the cache option and max-node=1.2 million. I've L> tried this a few times before with older splitter versions, but this L> is the first time the split finished without fatal errors. So I L> finally have correct tiles for North America (at least the ones that L> got rendered by Mkgmap successfully!): L> http://garmin.na1400.info/routable.php That's great news. I've made various small changes to the splitter over the past couple of weeks in an attempt to squeeze as much performance while using as little memory as possible, but nothing too significant so perhaps it's just a combination of everything that has just got you under the threshold. L> I guess it's time for me to start hacking on a areas.kml editing tool L> to get rid of all those red tiles :-) Or is the node density code L> already functional? Have you tried the .kml import? I checked this in a few days ago - you can just pass in a .kml file instead of areas.list, and everything should behave as it would with the areas.list file. The node density stuff isn't done yet, but yesterday I found time to do some fairly big internal refactorings of the splitter in preparation for this. It hopefully won't take me too much longer to get the density map working, albeit with nodes only initially. This 'nodes only' approach will still split in exactly the same way as the current splitter, the big advantage it that it will require much much less memory to do so without any performance cost. Once I have the nodes only approach working I'll start to look at calculating densities for the ways and relations too. At this stage I can't see how to avoid having to make some big performance/memory tradeoffs to get that working though, so I'll probably keep it optional. Chris

Chris Miller a écrit : Hi Chris,
If you want to enable the disk caching, you specify the cache location as follows:
--cache=<directory> The --cache parameter is entirely optional. if you don't specify it, the splitter will work in exactly the same way it did previously.
I hope the above explanation makes sense. Any questions, comments or suggestions are welcome.
I'm testing it against Switzerland. 1st run : With the option --cache enable, these are the results : 20h 20mns 17s - On va decouper la carte switzerland.osm.bz2 en plusieurs morceaux (beginning) 20h 23mns 35s - On a fini le decoupage de la carte switzerland.osm.bz2. (end) (local time) 2nd run : Same as 1st run, so --cache enable, but I've not removed the previously created node.xxx, ways.xxxx and relations.xxxx files created. -rw-r--r-- 1 fm users 8200 2009-08-23 20:22 nodes.bin.keys -rw-r--r-- 1 fm users 69494780 2009-08-23 20:22 nodes.bin -rw-r--r-- 1 fm users 6481 2009-08-23 20:22 ways.bin.keys -rw-r--r-- 1 fm users 25194598 2009-08-23 20:22 ways.bin -rw-r--r-- 1 fm users 1312 2009-08-23 20:22 relations.bin.roles -rw-r--r-- 1 fm users 1693 2009-08-23 20:22 relations.bin.keys -rw-r--r-- 1 fm users 474197 2009-08-23 20:22 relations.bin For that same run, it gives me : 20h 28mns 08s - On va decouper la carte switzerland.osm.bz2 en plusieurs morceaux (begin) 20h 29mns 05s - On a fini le decoupage de la carte switzerland.osm.bz2. (end) Problem : it doesn't want to write new nodes.xxxx etc... files. The files are the same as before. It seems that splitter sees the already created files, and doesn't want to create new one by overwritting the previous ones. This gave me a problem, as I run splitter first against Andorra ( ;-) very tiny osm file), and then after against Switzerland. Splitter used the nodes.xxx ways... and relations.xxx files created for Andorra. 3rd run : without the option --cache enable : 20h 35mns 33s - On va decouper la carte switzerland.osm.bz2 en plusieurs morceaux 20h 39mns 51s - On a fini le decoupage de la carte switzerland.osm.bz2. So it much faster. Francois --

Hi Francois, thanks for the feedback. Sounds like the caching makes quite a difference for you which is great news. In your case I'd say that most of the gain is because the cache prevents having to uncompress the .bz2 files multiple times - that's a very time consuming process. As for the problem you talked about with changing to different .osm files, it's something I'm already aware of and mentioned in my earlier mail: "Be careful to delete the cache files if you want to rerun the splitter on a different .osm file, otherwise the previously cached data will be used from the original .osm file instead. (I'll probably add a check for this situation, but there's nothing in place to prevent it just yet.)" Basically if you specify --cache and there's already some cache files in existence, they'll get used and any .osm files that you specify on the command line will be ignored. Until I address this, you could try creating different directories to use as a cache for each .osm file you want to process. eg: java -Xmx2000m -jar splitter.jar --cache=switzerland switzerland.osm.bz2 java -Xmx2000m -jar splitter.jar --cache=andorra andorra.osm.bz2 Hope that helps, Chris f> Chris Miller a écrit : f> f> Hi Chris, f>
If you want to enable the disk caching, you specify the cache location as follows:
--cache=<directory> The --cache parameter is entirely optional. if you don't specify it, the splitter will work in exactly the same way it did previously. I hope the above explanation makes sense. Any questions, comments or suggestions are welcome.
f> I'm testing it against Switzerland. f> f> 1st run : With the option --cache enable, these are the results : f> f> 20h 20mns 17s - On va decouper la carte switzerland.osm.bz2 en f> plusieurs f> morceaux (beginning) f> 20h 23mns 35s - On a fini le decoupage de la carte f> switzerland.osm.bz2. f> (end) f> (local time) f> 2nd run : Same as 1st run, so --cache enable, but I've not removed f> the f> previously created node.xxx, ways.xxxx and relations.xxxx files f> created. f> -rw-r--r-- 1 fm users 8200 2009-08-23 20:22 nodes.bin.keys f> -rw-r--r-- 1 fm users 69494780 2009-08-23 20:22 nodes.bin f> -rw-r--r-- 1 fm users 6481 2009-08-23 20:22 ways.bin.keys f> -rw-r--r-- 1 fm users 25194598 2009-08-23 20:22 ways.bin f> -rw-r--r-- 1 fm users 1312 2009-08-23 20:22 relations.bin.roles f> -rw-r--r-- 1 fm users 1693 2009-08-23 20:22 relations.bin.keys f> -rw-r--r-- 1 fm users 474197 2009-08-23 20:22 relations.bin f> For that same run, it gives me : f> 20h 28mns 08s - On va decouper la carte switzerland.osm.bz2 en f> plusieurs f> morceaux (begin) f> 20h 29mns 05s - On a fini le decoupage de la carte f> switzerland.osm.bz2. f> (end) f> Problem : it doesn't want to write new nodes.xxxx etc... files. The f> files are the same as before. It seems that splitter sees the already f> created files, and doesn't want to create new one by overwritting the f> previous ones. This gave me a problem, as I run splitter first f> against Andorra ( ;-) very tiny osm file), and then after against f> Switzerland. Splitter used the nodes.xxx ways... and relations.xxx f> files created for Andorra. f> f> 3rd run : without the option --cache enable : f> 20h 35mns 33s - On va decouper la carte switzerland.osm.bz2 en f> plusieurs f> morceaux f> 20h 39mns 51s - On a fini le decoupage de la carte f> switzerland.osm.bz2. f> So it much faster. f> f> Francois f>

Chris Miller a écrit : Hello,
"Be careful to delete the cache files if you want to rerun the splitter on a different .osm file, otherwise the previously cached data will be used from the original .osm file instead. (I'll probably add a check for this situation, but there's nothing in place to prevent it just yet.)"
Oups, I read your message too fast ;-)
Basically if you specify --cache and there's already some cache files in existence, they'll get used and any .osm files that you specify on the command line will be ignored. Until I address this, you could try creating different directories to use as a cache for each .osm file you want to process. eg:
java -Xmx2000m -jar splitter.jar --cache=switzerland switzerland.osm.bz2 java -Xmx2000m -jar splitter.jar --cache=andorra andorra.osm.bz2
Until you address that, a "rm -f nodes.bin* ways.bin* relations.bin*" at the end of my batch file did the trick. ;-) Thank you again Chris. Francois

Hi Francois, Have a go with splitter r77. It should now detect what a cache from a previous splitter run contains. It will then reuse or regenerate it as is appropriate for the parameters you have provided to the current run. I've tested just about every combination of --split-file, --cache, and .osm file parameters I can think of, hopefully it'll do the right thing in all situations now. Give me a shout if you have any problems. Chris
"Be careful to delete the cache files if you want to rerun the splitter on a different .osm file, otherwise the previously cached data will be used from the original .osm file instead. (I'll probably add a check for this situation, but there's nothing in place to prevent it just yet.)"
f> Oups, I read your message too fast ;-) f>
Basically if you specify --cache and there's already some cache files in existence, they'll get used and any .osm files that you specify on the command line will be ignored. Until I address this, you could try creating different directories to use as a cache for each .osm file you want to process. eg:
java -Xmx2000m -jar splitter.jar --cache=switzerland switzerland.osm.bz2 java -Xmx2000m -jar splitter.jar --cache=andorra andorra.osm.bz2
f> Until you address that, a "rm -f nodes.bin* ways.bin* relations.bin*" f> at f> the end of my batch file did the trick. ;-)

Hi Chris, I haven't tested the --cache parameter yet, but I have written some tools that have dealt with the same problem. On Wed, Aug 26, 2009 at 12:00:50AM +0000, Chris Miller wrote:
Hi Francois,
Have a go with splitter r77. It should now detect what a cache from a previous splitter run contains. It will then reuse or regenerate it as is appropriate for the parameters you have provided to the current run. I've tested just about every combination of --split-file, --cache, and .osm file parameters I can think of, hopefully it'll do the right thing in all situations now. Give me a shout if you have any problems.
Are you caching the command line parameters and the file sizes and time stamps of all input files? That should be rather safe. To be even safer, you should perhaps also cache the splitter revision number. Marko

Hi Marko, MM> Are you caching the command line parameters and the file sizes and MM> time stamps of all input files? That should be rather safe. To be MM> even safer, you should perhaps also cache the splitter revision MM> number. Yes I cache the file size, timestamp, and canonical path of each .osm input file. I don't cache the command line parameters because they don't affect the content of the cache. The cache is really just an optimised copy of the input .osm files. I don't take the splitter revision number into account however if I need to make a breaking change to the cache file format, my intention is to add version numbering to the cache format itself rather than use the splitter version. That way new versions of the splitter won't require the cache to be regenerated unless there has been a breaking change to the cache format. Chris

Chris Miller a écrit : Hello Chris, Sorry for the delay to reply, but just back from work.
Have a go with splitter r77. It should now detect what a cache from a previous splitter run contains. It will then reuse or regenerate it as is appropriate for the parameters you have provided to the current run. I've tested just about every combination of --split-file, --cache, and .osm file parameters I can think of, hopefully it'll do the right thing in all situations now. Give me a shout if you have any problems.
I tested it and it works great. On my side, it's ok. Thank you Chris. Francois

Chris Miller escribió:
If you are using compressed .osm files (bz2 compression especially), the speed improvement should be greater still, since the decompression will only need to happen once rather than on each pass.
Does it mean with --cache splitter will take the same time using bz2 compressed or uncompressed input osm files? Thanks for this new feature. Cheers Carlos

CD> Does it mean with --cache splitter will take the same time using bz2 CD> compressed or uncompressed input osm files? CD> Thanks for this new feature. CD> Cheers Carlos The first time you use the --cache option, the splitter still needs to uncompress the osm file once so it can parse the osm file and generate the cache. As a result of this, the initial cache creation will still take longer with a compressed osm file than an uncompressed one. Once the cache has been built however, the splitter doesn't need the osm file anymore. This means that the second stage of the split will take the same amount of time regardless of the compression. Additionally, if you run the splitter and the cache already exists from a previous run, the splitter will run very quickly regardless of the compression because it doesn't need the osm file at all in this scenario. If you don't use --cache, the splitter has to uncompress the bz2 file at least twice during the split (once during the first stage, once or more during the second) which slows things down significantly. Based on the tests I've done, I'd recommend always using --cache in the following situations: - if you are splitting compressed osm files - if you intend to run the splitter more than once on the same osm file - if you are splitting a file into lots of areas and the second stage of the split requires more than one pass The one situation where --cache might not make sense is if you are doing a one-off split on an uncompressed osm file, and the second stage only requires a single pass to write out the split files. In this situation --cache isn't much faster and may even slow things down a fraction, since the overhead in creating the cache can outweigh the benefits the cache provides. Chris
participants (5)
-
Carlos Dávila
-
Chris Miller
-
frmas
-
Lambertus
-
Marko Mäkelä