
Hello list, since a few days I try to understand the idea of the class SparseInt2ShortMapInline in splitter because this seems to waste memory although it is documented to save it. If I got this right, the amount of needed memory directly depends on the highest id that is saved, thus for example the data for a small country like german "bundesland" saarland (from geofabrik) is not processed with java -Xmx700 -jar splitter.jar --max-areas=2 --max-nodes=200000 e:\dwnload\saarland.osm.pbf I tried that with R181 and R180. The program runs without problems when I change the code so that SparseInt2ShortMultiMap always uses Int2ShortOpenHashMap and not SparseInt2ShortMapInline. Was this code optimized for a special input (e.g.with small id values) ? ciao, Gerd -- View this message in context: http://gis.638310.n2.nabble.com/splitter-memory-usage-tp6935688p6935688.html Sent from the Mkgmap Development mailing list archive at Nabble.com.

Hello, i cannot help with details about the SparseMultiMap datatypes but i can confirme that memory usage of the splitter depends on the maximum node id within the input file, which should be nearly the same independent of the total size of your input. If you for example add a faked Node with id above 2^31 memory usage of the splitter explodes (Observed on 64bit linux engine), below this, memory usage seems to be nearly constant, even on very big input files. This will be a problem in the near future (1/2 to 1 year) when the osm node ids reach this limit. hasemann
since a few days I try to understand the idea of the class SparseInt2ShortMapInline in splitter because this seems to waste memory although it is documented to save it. If I got this right, the amount of needed memory directly depends on the highest id that is saved, thus for example the data for a small country like german "bundesland" saarland (from geofabrik) is not processed with java -Xmx700 -jar splitter.jar --max-areas=2 --max-nodes=200000 e:\dwnload\saarland.osm.pbf I tried that with R181 and R180. The program runs without problems when I change the code so that SparseInt2ShortMultiMap always uses Int2ShortOpenHashMap and not SparseInt2ShortMapInline. Was this code optimized for a special input (e.g.with small id values) ?

Hello Olaf, thanks for the confirmation. So if this high memory usage is not considered as a bug (but as a feature) I think it would be a good idea to avoid the resize actions when the highest node id is known before. If I understood it right, the first pass could save this information for each processed area. Also, this info could maybe be used to determine whether the available max. heap is large enough to use this feature, else the normal hashmap can be used. I'll try this and post a patch if it helps. Ciao, Gerd -- View this message in context: http://gis.638310.n2.nabble.com/splitter-memory-usage-tp6935688p6938817.html Sent from the Mkgmap Development mailing list archive at Nabble.com.

Hello list, I think I found a way to enhance splitter memory usage. I have now a version on my pc that is - a bit faster (at least for my largest test data "germany.osm.pbf" it saves ~29% runtime: 725 sec. instead of 1009 sec (AMD Dualcode 5050e+4GB+Win-XP) - able to handle node ids up to 137.438.953.471L (= 2^37-1) (from both xml and osm.pbf as far as I can say) - still less memory consuming. esp. for data with very high node ids I achived that by using a long for the id values and by replacing the two ArrayLists in SparseInt2ShortMapInline by an "Int2ObjectOpenHashMap" (fastutil) so that the HashMap stores objects like this: class Storer { private long chunkmask; private short [] chunk; } The meaning of these two fields is the same as in the existing code in SparseInt2ShortMapInline. I did not test the change with a whole planet because my machine is too slow for that and I have problems with the download of the large file. Since the new code needs a different fastutil.jar, I can't simply add a patch. If anybody is interested in testing with whole planet or reviewing my code, what is the proprosed way to do this? Ciao, Gerd -- View this message in context: http://gis.638310.n2.nabble.com/splitter-memory-usage-tp6935688p6945302.html Sent from the Mkgmap Development mailing list archive at Nabble.com.

On Thu, Oct 27, 2011 at 4:56 AM, GerdP <gpetermann_muenchen@hotmail.com> wrote:
Hello list,
since a few days I try to understand the idea of the class SparseInt2ShortMapInline in splitter because this seems to waste memory although it is documented to save it. If I got this right, the amount of needed memory directly depends on the highest id that is saved, thus for example the data for a small country like german "bundesland" saarland (from geofabrik) is not processed with java -Xmx700 -jar splitter.jar --max-areas=2 --max-nodes=200000 e:\dwnload\saarland.osm.pbf
There are two main algorithms. O(A*# distinct ID's), using a hash table, with A being around 64 bits (with 32 bit node IDs) or 100 bits (with 64 bit node IDs) But, this would result in too much memory usage for a whole planet due to the constant factor of the A. The current algorithm is O(B*#maximum_id) with the constant But, factor B is a lot smaller than A, around 4 bits unused entry and 20 bytes per used entry For a whole planet split, this is a third or half of the memory of the first choice.
I tried that with R181 and R180. The program runs without problems when I change the code so that SparseInt2ShortMultiMap always uses Int2ShortOpenHashMap and not SparseInt2ShortMapInline. Was this code optimized for a special input (e.g.with small id values) ?
No. The code is tuned for whole-planet splits, where minimizing memory consumption is paramount. The right algorithm would adapt the number of nodes, using Int2ShortOpenHashMap when for small splits (say, country-sized or smaller) and SparseInt2ShortMapInline for whole planet splits. Maybe do it as a command line option? I recommend against your proposed change later in the thread because it will increase the memory usage by at least 500mb for whole-planet splits. (1.5B nodes requires at least 25M Storers, with 8 bytes object overhead + 4 bytes hashtable key + 4 bytes hashtable value pointer + 50% hashtable overhead, or about 20 bytes each) Scott

Hello Scott, thanks for the detailed analysis. (I started looking at this because I "played" with the program on my netbook and wasn't able to split even small files like that for Niedersachsen, that's why I thought the program is in error. In my job I coded many programs that handle mass data (on IBM mainframes), but I am a bloody beginner in java) I think I understand now the idea in your implementation, and I agree that it it probably the best solution for a whole planet where the relation between (highest node Id / number of ids) is rather small. On the other hand, this ratio is going to get worser in the future for "normal" splits (e.g. germany), esp. when such a high node id is also saved in the 1st overflow map. Interesting for me: I tried to split europe.osm.pbf with default parms and -Xmx2000m : r181 crashed with a gc message, my version finished. regarding a parm: I assume the program can decide which algorithmn is better after the 1st pass, but a parm could also be used. regarding my Storer class: I think it reduces space. The normal approach would be to save each id in the Int2Short HashMap, but I liked your trick with the chunks, as they save space and reduce the problem of hash collisions and the future problem that node id will exceed 2^31. My first change was to store each (chunkmask and chunk) in their own HashMaps, and that caused a lot of overhead compared to the Storer. Besides that: Why is a new chunk initialized with 4 times 4 in chunkMake: Arrays.fill(out,(short)4); I used this: Arrays.fill(out,(short)unassigned); and I think it works fine. In the original code, the first 4 shorts in chunk are never used. Correct? Ciao, Gerd -- View this message in context: http://gis.638310.n2.nabble.com/splitter-memory-usage-tp6935688p6946578.html Sent from the Mkgmap Development mailing list archive at Nabble.com.

Hello Scott, while digging deeper into this problem I noticed that the data that we store for each node id is extremely redundant. I think it should be possible to build a directory of all possible combinations of values, so that we just have to store one int value which addresses the entry in the directory. Since we are not even interested in the order of the stored vales (the writer ids), this should be a very simple dictionary. I am just not sure if this dictionary should be build in the add method "on the fly" like in zip algorithmns, or if we can compute some tables like lexx+yacc first. Any comments? Gerd -- View this message in context: http://gis.638310.n2.nabble.com/splitter-memory-usage-tp6935688p6946974.html Sent from the Mkgmap Development mailing list archive at Nabble.com.
participants (3)
-
GerdP
-
Olaf Hasemann
-
Scott Crosby