Splitter --cache parameter

23 Aug 2009

      I've just checked in some changes to the splitter that add a new --cache 
parameter. This is designed to speed up the splitting process, especially 
on large splits that require multiple passes over the .osm file, or in situations 
where you run the splitter several times on the same .osm file with different 
parameters each time.

If you want to enable the disk caching, you specify the cache location as 
follows:

--cache=<directory>

This will cause the splitter to generate several files in the specified directory 
during the first stage of the split (the areas.list calculation). These files 
contain the same information as the source .osm file(s) do, but in an optimised 
format that allow subsequent passes over the data to happen much more quickly. 
The more passes that happen in the second stage of the split, the greater 
the speedup you will see.

Some benchmarks on my PC have shown the following speed improvements when 
running against uncompressed .osm files:
1 pass - 5% faster
2 passes - 25% faster
3 passes - 35% faster
4 passes - 40% faster
5 passes - 45% faster

If you are using compressed .osm files (bz2 compression especially), the 
speed improvement should be greater still, since the decompression will only 
need to happen once rather than on each pass.

Note however that these figures are very approximate; the actual performance 
will vary depending on your disk and CPU speed, the particular map being 
processed, and what other disk and CPU activity is taking place on your PC 
at the same time. In some cases you might find that splits that only require 
a single pass will run faster without the disk cache enabled.

The disk cache can also be used across multiple runs of the splitter, as 
long as you are splitting the same .osm file(s) each time. For example suppose 
you ran a splitter as follows:

java -Xmx4000m -jar splitter.jar --cache=. --max-nodes=1500000 europe.osm

If you then run mkgmap and discover the max-nodes setting is too high, you 
can run the splitter again with a lower max-nodes value like so:

java -Xmx4000m -jar splitter.jar --cache=. --max-nodes=1200000

Because the cache files already exist for europe.osm as a result of the first 
run, there's no need to specify europe.osm on the rerun. The data will be 
loaded from the cache instead and the split will run much faster.

Be careful to delete the cache files if you want to rerun the splitter on 
a different .osm file, otherwise the previously cached data will be used 
from the original .osm file instead. (I'll probably add a check for this 
situation, but there's nothing in place to prevent it just yet.)

Note that the disk cache can require a lot of disk space, typically about 
20-25% of the space the uncompressed .osm file takes up. For example the 
27GB europe.osm file generates a cache of just over 5GB.

The --cache parameter is entirely optional. if you don't specify it, the 
splitter will work in exactly the same way it did previously.

I hope the above explanation makes sense. Any questions, comments or suggestions 
are welcome.

Cheers,
Chris

Chris Miller

Chris Miller

Lambertus

Chris Miller

Lambertus

frmas

Chris Miller

frmas

Chris Miller

Marko Mäkelä

Chris Miller

frmas

Carlos Dávila

Chris Miller

tags

participants (5)