Patch to reduce memory usage by interning strings.

I noticed that mkgmap does not intern any strings. In particular, this tile, generated by the splitter, fails to build with -Xmx3000m on 64-bit jdk under linux. With my patch, mkgmap generates the tile with -Xmx1000m. <bounds minlat='55.1953125' minlon='9.4921875' maxlat='56.6015625' maxlon='11.513671875'/> This tile has 1m nodes. Among the nodes and ways on this tile, there are 12m tags, yet only 100k distinct tag key/value pairs; on average each value occurs 120 times. I explicitly do not use normal string interning because String.intern() strings are kept forever, and I want these strings to be GC'able after the tile is done. I trade GCability for having the occasional string duplicated in memory by flushing the interning table every 10k unique strings. This code is not presently multithread safe; Ideally there should be one string interning table for each parser/thread. Scott

I noticed that mkgmap does not intern any strings. In particular, this tile, generated by the splitter, fails to build with -Xmx3000m on 64-bit jdk under linux. With my patch, mkgmap generates the tile with -Xmx1000m.
<bounds minlat='55.1953125' minlon='9.4921875' maxlat='56.6015625' maxlon='11.513671875'/>
This tile has 1m nodes. Among the nodes and ways on this tile, there are 12m tags, yet only 100k distinct tag key/value pairs; on average each value occurs 120 times.
I explicitly do not use normal string interning because String.intern() strings are kept forever, and I want these strings to be GC'able after the tile is done. I trade GCability for having the occasional string duplicated in memory by flushing the interning table every 10k unique strings.
This code is not presently multithread safe; Ideally there should be one string interning table for each parser/thread.
Scott
Hi Scott! I think that's a good idea to intern the strings. As far as I know the LossyIntern class is not needed. The .intern() function of a string does exactly the same. Some time ago I sent a very similar patch to the mailing list which is not yet committed. Could you please test with your use case if it performs a similar memory reduction? The patch is thread safe and does not intern all strings. In my opinion the value of a name tag should not be interned because there is a high probability that this tag is used once only. WanMil

On Wed, 31 Mar 2010 21:13:49 +0200, WanMil <wmgcnfg@web.de> writes:
I noticed that mkgmap does not intern any strings. In particular, this tile, generated by the splitter, fails to build with -Xmx3000m on 64-bit jdk under linux. With my patch, mkgmap generates the tile with -Xmx1000m.
<bounds minlat='55.1953125' minlon='9.4921875' maxlat='56.6015625' maxlon='11.513671875'/>
This tile has 1m nodes. Among the nodes and ways on this tile, there are 12m tags, yet only 100k distinct tag key/value pairs; on average each value occurs 120 times.
I explicitly do not use normal string interning because String.intern() strings are kept forever, and I want these strings to be GC'able after the tile is done. I trade GCability for having the occasional string duplicated in memory by flushing the interning table every 10k unique strings.
This code is not presently multithread safe; Ideally there should be one string interning table for each parser/thread.
Scott
Hi Scott!
I think that's a good idea to intern the strings. As far as I know the LossyIntern class is not needed. The .intern() function of a string does exactly the same.
You are right. String intern does not intern forever at least since Java 1.2.
Some time ago I sent a very similar patch to the mailing list which is not yet committed. Could you please test with your use case if it performs a similar memory reduction?
You can run it if you want, but from the numbers I gave above for this tile, interning values as in my patch will decrease the number of strings in RAM from 12M to <100k values. Interning only keys would reduce the number of Strings in RAM from 24M to 12M.
The patch is thread safe and does not intern all strings. In my opinion the value of a name tag should not be interned because there is a high probability that this tag is used once only.
Thats probably true for many or most tiles, but not for the tile I referenced above, where on average each value occurs 120 times. That tile is unbuildable with a 3gb heap without my patch and buildable with 1gb heap with my patch. Shall I post an updated patch without FuzzyIntern? Scott

Am 31.03.2010 22:10, schrieb Scott A Crosby:
On Wed, 31 Mar 2010 21:13:49 +0200, WanMil<wmgcnfg@web.de> writes:
I noticed that mkgmap does not intern any strings. In particular, this tile, generated by the splitter, fails to build with -Xmx3000m on 64-bit jdk under linux. With my patch, mkgmap generates the tile with -Xmx1000m.
<bounds minlat='55.1953125' minlon='9.4921875' maxlat='56.6015625' maxlon='11.513671875'/>
This tile has 1m nodes. Among the nodes and ways on this tile, there are 12m tags, yet only 100k distinct tag key/value pairs; on average each value occurs 120 times.
I explicitly do not use normal string interning because String.intern() strings are kept forever, and I want these strings to be GC'able after the tile is done. I trade GCability for having the occasional string duplicated in memory by flushing the interning table every 10k unique strings.
This code is not presently multithread safe; Ideally there should be one string interning table for each parser/thread.
Scott
Hi Scott!
I think that's a good idea to intern the strings. As far as I know the LossyIntern class is not needed. The .intern() function of a string does exactly the same.
You are right. String intern does not intern forever at least since Java 1.2.
Some time ago I sent a very similar patch to the mailing list which is not yet committed. Could you please test with your use case if it performs a similar memory reduction?
You can run it if you want, but from the numbers I gave above for this tile, interning values as in my patch will decrease the number of strings in RAM from 12M to<100k values. Interning only keys would reduce the number of Strings in RAM from 24M to 12M.
The patch is thread safe and does not intern all strings. In my opinion the value of a name tag should not be interned because there is a high probability that this tag is used once only.
Thats probably true for many or most tiles, but not for the tile I referenced above, where on average each value occurs 120 times. That tile is unbuildable with a 3gb heap without my patch and buildable with 1gb heap with my patch.
Shall I post an updated patch without FuzzyIntern?
Scott
Scott, my patch interned all keys and additionally the values of a limited number of keys. Maybe it's not necessary to limit the interning of values. So I have attached the very simple but hopefully very effective patch regarding the memory footprint of mkgmap. Regarding your patch: I don't understand the function of the FuzzyIntern class. You build a HashMap from (uninterned) Strings to the interned String. Then you are looking up new strings in this HashMap and use the interned variant. Where's the difference to the (hopefully) very performance optimized intern() method?
String intern does not intern forever I didn't know that. Do you have any link where this is specified?
WanMil

On Wed, 31 Mar 2010 23:58:45 +0200, WanMil <wmgcnfg@web.de> writes:
Am 31.03.2010 22:10, schrieb Scott A Crosby:
On Wed, 31 Mar 2010 21:13:49 +0200, WanMil<wmgcnfg@web.de> writes:
my patch interned all keys and additionally the values of a limited number of keys. Maybe it's not necessary to limit the interning of values. So I have attached the very simple but hopefully very effective patch regarding the memory footprint of mkgmap.
My opinion is that its simpler and more robust to intern or pseudointern every tag value. The bad tile had a lot of duplicate values, what if those tags were not on your list of the 'only intern value strings for some keys'?
Regarding your patch: I don't understand the function of the FuzzyIntern class. You build a HashMap from (uninterned) Strings to the interned String. Then you are looking up new strings in this HashMap and use the interned variant. Where's the difference to the (hopefully) very performance optimized intern() method?
Note that this code is not actually interning any strings in with String.intern(). Call it psuedointerning. The purpose of FuzzyIntern was when I believed that String.intern() interned forever, which I considered very undesirable. I wanted semantics that would remove (most) duplicate strings in memory without forcing those strings to remain in RAM forever.
String intern does not intern forever
I didn't know that. Do you have any link where this is specified?
A google search 'java intern weak reference' seems to indicate that Java since version 1.2 uses a weak reference table for string intern, which means that they can be removed. That search offers alternative implementation ideas, such as using a real weak reference hashtable. Scott

Note that Java's String.intern() method can be pretty slow, so while you'll save a fair chunk of memory you'll potentially suffer a noticable performance hit too if you're calling it a lot. By adding a barrier-free caching layer in front of the String.intern() calls you can gain a reasonable performance boost in this situation. As an example of how this can be implemented, take a look at Lucene's SimpleStringInterner which does exactly this: http://github.com/apache/lucene/blob/1c5c409241a2b8b9e64dc8c253791b497a66c36... It's threadsafe in that it guarantees just enough visibility to never generate invalid results, yet also avoids any blocking. Might be worth benchmarking something like this against the normal String.intern() with mkgmap. Chris
On Wed, 31 Mar 2010 21:13:49 +0200, WanMil <wmgcnfg@web.de> writes:
I noticed that mkgmap does not intern any strings. In particular, this tile, generated by the splitter, fails to build with -Xmx3000m on 64-bit jdk under linux. With my patch, mkgmap generates the tile with -Xmx1000m.
<bounds minlat='55.1953125' minlon='9.4921875' maxlat='56.6015625' maxlon='11.513671875'/>
This tile has 1m nodes. Among the nodes and ways on this tile, there are 12m tags, yet only 100k distinct tag key/value pairs; on average each value occurs 120 times.
I explicitly do not use normal string interning because String.intern() strings are kept forever, and I want these strings to be GC'able after the tile is done. I trade GCability for having the occasional string duplicated in memory by flushing the interning table every 10k unique strings.
This code is not presently multithread safe; Ideally there should be one string interning table for each parser/thread.
Scott
Hi Scott!
I think that's a good idea to intern the strings. As far as I know the LossyIntern class is not needed. The .intern() function of a string does exactly the same. You are right. String intern does not intern forever at least since Java 1.2.
Some time ago I sent a very similar patch to the mailing list which is not yet committed. Could you please test with your use case if it performs a similar memory reduction?
You can run it if you want, but from the numbers I gave above for this tile, interning values as in my patch will decrease the number of strings in RAM from 12M to <100k values. Interning only keys would reduce the number of Strings in RAM from 24M to 12M.
The patch is thread safe and does not intern all strings. In my opinion the value of a name tag should not be interned because there is a high probability that this tag is used once only.
Thats probably true for many or most tiles, but not for the tile I referenced above, where on average each value occurs 120 times. That tile is unbuildable with a 3gb heap without my patch and buildable with 1gb heap with my patch.
Shall I post an updated patch without FuzzyIntern?
Scott

On Wed, 31 Mar 2010 22:12:12 +0000 (UTC), Chris Miller <chris_overseas@hotmail.com> writes:
Note that Java's String.intern() method can be pretty slow, so while you'll save a fair chunk of memory you'll potentially suffer a noticable performance hit too if you're calling it a lot. By adding a barrier-free caching layer in front of the String.intern() calls you can gain a reasonable performance boost in this situation. As an example of how this can be implemented, take a look at Lucene's SimpleStringInterner which does exactly this:
http://github.com/apache/lucene/blob/1c5c409241a2b8b9e64dc8c253791b497a66c36...
It's threadsafe in that it guarantees just enough visibility to never generate invalid results, yet also avoids any blocking. Might be worth benchmarking something like this against the normal String.intern() with mkgmap.
We don't have to get rid of every duplicate string; only most of them, so approximation techniques such as a per-thread or per-parser FuzzyIntern will work fine, and require no concurrent access. Now that I know that String. intern() uses weak references, it seems to be the most trivial way to reduce the memory usage of that tile of planet.osm by at least 3x. Scott

On 31/03/10 14:56, Scott A Crosby wrote:
I noticed that mkgmap does not intern any strings. In particular, this
This is true. There is a pending patch that deals with excessive memory use in a slightly different way which is on the style-speed branch, with a pre-built one available at the bottom of the mkgmap download page. I drops all tags that are not used by the current style and as an extra feature it ensures that there is only copy of all the strings on the key side. Could you try it out please. It may be worth also interning the values but when I was looking at it there was much less benefit for the maps I was looking at (but it might vary with the input). I'd be happy to apply a patch that interned the values too if there was a decent memory saving without a significant performance drop. I think the point that Chris brought up applies mostly in cases where there is a lot of contended access between threads. If that is the case then it won't matter much for us as reading the input files is currently single threaded. I can't recall why reading maps *is* single threaded. I'm not sure there is really a problem and if there is I am sure it can be fixed easily. Does anyone know or can find out? ..Steve

I think the point that Chris brought up applies mostly in cases where there is a lot of contended access between threads. If that is the case then it won't matter much for us as reading the input files is currently single threaded.
It's true, the biggest gains by far in the approach I was pointing out are when you have heavy concurrent access. I just saw Scott mention his code wasn't threadsafe so figured a multithreaded solution was desirable. However String.intern() can still be expensive even in a single-threaded environment. Replacing String.intern() with something simple like if (!hashSet.contains()) hashSet.add(str) should be roughly 2x quicker. Probably negligible in the grand scheme of things :) Chris

On Thu, 01 Apr 2010 12:10:22 +0100, Steve Ratcliffe <steve@parabola.me.uk> writes:
On 31/03/10 14:56, Scott A Crosby wrote:
I noticed that mkgmap does not intern any strings. In particular, this
This is true. There is a pending patch that deals with excessive memory use in a slightly different way which is on the style-speed branch, with a pre-built one available at the bottom of the mkgmap download page.
I pulled it out of SVN@1569.
I drops all tags that are not used by the current style and as an extra feature it ensures that there is only copy of all the strings on the key side.
Could you try it out please. It may be worth also interning the values but when I was looking at it there was much less benefit for the maps I was looking at (but it might vary with the input). I'd be happy to apply a patch that interned the values too if there was a decent memory saving without a significant performance drop.
The style-speed branch builds the tile in question within 1gb of ram using the default style. I suspect that that style isn't using the problematic tags. A different style that did use the tags would likely still blow up. This seems fragile. I benchmarked the problematic tile on the style-speed branch with and without interning all keys and values in the Tag() constructor using ordinary String.intern(). The 18 character change implementing interning seems to increase performance by about a second, from 66 to 65 seconds. I'd say to go for it. Scott

Am 02.04.2010 06:16, schrieb Scott A Crosby:
On Thu, 01 Apr 2010 12:10:22 +0100, Steve Ratcliffe<steve@parabola.me.uk> writes:
On 31/03/10 14:56, Scott A Crosby wrote:
I noticed that mkgmap does not intern any strings. In particular, this
This is true. There is a pending patch that deals with excessive memory use in a slightly different way which is on the style-speed branch, with a pre-built one available at the bottom of the mkgmap download page.
I pulled it out of SVN@1569.
I drops all tags that are not used by the current style and as an extra feature it ensures that there is only copy of all the strings on the key side.
Could you try it out please. It may be worth also interning the values but when I was looking at it there was much less benefit for the maps I was looking at (but it might vary with the input). I'd be happy to apply a patch that interned the values too if there was a decent memory saving without a significant performance drop.
The style-speed branch builds the tile in question within 1gb of ram using the default style. I suspect that that style isn't using the problematic tags. A different style that did use the tags would likely still blow up. This seems fragile.
I benchmarked the problematic tile on the style-speed branch with and without interning all keys and values in the Tag() constructor using ordinary String.intern(). The 18 character change implementing interning seems to increase performance by about a second, from 66 to 65 seconds.
I'd say to go for it.
Scott
Scott, thanks for posting hard performance values! That's good news because it makes possible to use a simple intern() solution. Before I read your post I compared some german tiles. The single threaded SVN r1625 took 328s with the default style. Using the intern() patch it took 335s. That seems to be an unsignificant change. I am not able to post multithreaded comparisons because my geriatric CPU has one core only... I also used the VisualGC plugin of JVisualVM to view the GC. The permanent space (where all the interned Strings are stored) did not exceed 11mb. So interning all Strings will probably not exceed the permanent space. However we might add a note to the README file that in some special cases it is necessary to increase the permanent size with -XX:MaxPermSize=128m (or more) to avoid OutOfMemoryExceptions. WanMil
participants (4)
-
Chris Miller
-
Scott A Crosby
-
Steve Ratcliffe
-
WanMil