character repertoires

Robert Joop

13 Feb 2013 13 Feb '13

8:04 p.m.

There is support for a number of character sets in mkgmap, but do we have any collected wisdom about what character repertoires are supported on our devices? While at the hack weekend, I tried some latin2 on my new device and was pleasantly surprised that “šř” appeared in a street name – on my old GPSmap 60CSx with the very same gmapsupp.img, I see them as what looks like “.Ø” on the map and “◊Ø” in the tooltip when hovering over it. rj

Show replies by date

Steve Ratcliffe

19 Feb 19 Feb

9:04 p.m.

Hi On 13/02/13 20:04, Robert Joop wrote:

...

There is support for a number of character sets in mkgmap, but do we have any collected wisdom about what character repertoires are supported on our devices? While at the hack weekend, I tried some latin2 on my new device and was pleasantly surprised that “šř” appeared in a street name – on my old GPSmap 60CSx with the very same gmapsupp.img, I see them as what looks like “.Ø” on the map and “◊Ø” in the tooltip when hovering over it.

I hoped someone would reply with a detailed answer! I'm not an expert on the different devices. I am sure that the 'old' devices such as the Legend Cx and ones of that era, only supported the one character set. There may have been different versions sold in different regions with different character sets. But the standard European one was code-page 1252 of course. Newer devices seem to support a number of character sets. I have verified that my Nuvi 1490 can do Arabic script (code page 1256) for example. I would guess that it includes many/all of the western European from Greek to Cyrillic. I believe it is still true that only devices sold in Asia are capable of Chinese/Japanese etc. characters - although you can download replacement firmware that includes them from Garmin. But I'd be glad to be corrected by someone who has tried it all out. ..Steve

Robert Joop

23 Feb 23 Feb

9:47 p.m.

On 13-02-19 22:04:14 CET, Steve Ratcliffe wrote:

...

Hi

On 13/02/13 20:04, Robert Joop wrote:

...
There is support for a number of character sets in mkgmap, but do we have any collected wisdom about what character repertoires are supported on our devices? While at the hack weekend, I tried some latin2 on my new device and was pleasantly surprised that “šř” appeared in a street name – on my old GPSmap 60CSx with the very same gmapsupp.img, I see them as what looks like “.Ø” on the map and “◊Ø” in the tooltip when hovering over it.

I hoped someone would reply with a detailed answer! I'm not an expert on the different devices. I am sure that the 'old' devices such as the Legend Cx and ones of that era, only supported the one character set. There may have been different versions sold in

… with “the one” being CP1252 and not ASCII, fortunately. It actually is the CP1252 superset of ISO 8859-1, I see the printable characters in the range 128–159 (which ISO 8859 reserves for a second set of control characters). Funny difference: code pages 125x all have the double dagger U+2021 ‡ on position 87, and most maps with these code pages show the “‡”. Only the map with CP1252 differs, it shows “++” instead! Is it the Garmin’s fault, or can mkgmap be the reason? Same for the Euro sign U+20AC €: CP1252 shows “Eu” at its position 0x80, while the other maps having it at the same position show the “€”. The micro sign U+00B5 μ becomes a ? on most code page maps, except for the Greek one, even though it is at the same position in all code pages.

...

different regions with different character sets. But the standard European one was code-page 1252 of course.

Newer devices seem to support a number of character sets. I have verified that my Nuvi 1490 can do Arabic script (code page 1256) for example. I would guess that it includes many/all of the western European from Greek to Cyrillic.

While I wouldn’t consider Greek and Cyrillic to be western European ;-), the Montana supports them. Interestingly, I get to see Arabic characters (CP1256), but not any Hebrew characters (CP1255). And in the Arabic map’s upper half, the latin based characters show up as “?”. Another peculiar thing: while the Garmin does its usual wierd upper/lower casing, TWO LABELS ARE ALL CAPS, namely those containing the ª feminine and º masculine ordinal indicators.

...

I believe it is still true that only devices sold in Asia are capable of Chinese/Japanese etc. characters - although you can download replacement firmware that includes them from Garmin.

Asian: A map with CP1258 shows up with totally unlabeled streets, not even anything from the ASCII range. As for CP874, the U+0Exx characters in the code page’s upper half do not show up, but the ASCII half looks complete. I haven’t tried any of the larger Asian code pages. rj

Steve Ratcliffe

25 Feb 25 Feb

8:03 p.m.

...

It actually is the CP1252 superset of ISO 8859-1, I see the printable characters in the range 128–159 (which ISO 8859 reserves for a second set of control characters).

Your observations neatly illustrate the way that the code works. This is the current algorithm: 1a. if ascii(no-code-page): all characters > 0x7f are transliterated into ascii characters 1b. if code-page=1252: all characters > 0xff are transliterated into latin1 characters. 1c. all other code pages: no transliteration. 2. Create a character set name by prepending "cp" to the code-page (eg. cp1252). 3. Use the standard java character set conversion with that name to convert the result of step 1. Any character that cannot be converted is replaced with a '?' symbol. This may possibly vary with java version and platform. That explains most of the observations I think. U+2021 is transliterated to ++ for 1252, but not for any other 125x Same for the Euro symbol to Eu.

...

The micro sign U+00B5 μ becomes a ? on most code page maps, except for the Greek one, even though it is at the same position in all code pages.

U+00b5 is upper cased to GREEK CAPITAL LETTER MU, which is only present in the Greek code page.

...

And in the Arabic map’s upper half, the latin based characters show up as “?”.

That's because only lower case characters are included.

...

Another peculiar thing: while the Garmin does its usual wierd upper/lower casing, TWO LABELS ARE ALL CAPS, namely those containing the ª feminine and º masculine ordinal indicators.

I don't know about this. Possibly a device thing?

...

Asian: A map with CP1258 shows up with totally unlabeled streets, not even anything from the ASCII range.

Strange - are labels correct in the file? If you run strings on the img do you see the ascii labels? If so then it is a device thing. So currently ascii and 1252 are better than the other code pages since just about every unicode character can be represented, whereas in the other code pages you are limited to characters from that page. It looks possible to fix this by removing the transliteration step from where it is and only using it when a character that is un-mappable into the target code page is encountered. ..Steve

Robert Joop

10:18 p.m.

On 13-02-25 21:03:40 CET, Steve Ratcliffe wrote:

...

This is the current algorithm:

1a. if ascii(no-code-page): all characters > 0x7f are transliterated into ascii characters 1b. if code-page=1252: all characters > 0xff are transliterated into latin1 characters.

I guess here’s the little weakness (which you also hint at yourself elsewhere in your mail): all characters > 0xff by means of their unicode code point, not by their code point in the target code page. Well, I mean by whether they’ve got any code point in the target code page. :-) I wonder how to improve the algorithm without making it much more CPU intensive. Does Java offer a fast code page mapability lookup? If it were programmed in C (I haven’t written any Java code this century), I might throw some RAM at it, initialize 64 KiB so zeroes (to cover 16 bit unicode), and set all those to 1 for the unicode code points reverse mapped from the code page printable character code points of the target code page.

...

...
Asian: A map with CP1258 shows up with totally unlabeled streets, not even anything from the ASCII range.

Strange - are labels correct in the file? If you run strings on the img do you see the ascii labels? If so then it is a device thing.

Yes – strings on the generated cp1258.img look pretty similar to the output of `strings cp1251.img`. You can all try it yourselves, using the attached little package. rj

Steve Ratcliffe

26 Feb 26 Feb

1:19 p.m.

...

I wonder how to improve the algorithm without making it much more CPU intensive.

...

Does Java offer a fast code page mapability lookup?

I think the way I outlined would be no less efficient that what happens now. The low level CharsetEncoder in java can be set to replace unmappable characters with a '?' (as now) or to return on finding an unmappable character. The character can be transliterated to the ascii range and then loop back to the Encoder. Its all just array lookups so quicker than it sounds! ..Steve

Robert Joop

28 Feb 28 Feb

1:05 a.m.

On 13-02-26 14:19:28 CET, Steve Ratcliffe wrote:

...

happens now. The low level CharsetEncoder in java can be set to replace unmappable characters with a '?' (as now) or to return on finding an unmappable character. The character can be transliterated to the ascii range and then loop back to the Encoder. Its all just array lookups so quicker than it sounds!

With unicode maps (I just got a tiny example working, see the “unicode” thread), there is another problem: Since all characters are mappable, no transliteration would take place. But what would be desirable is to base the decision whether to transliterate on whether the target device has the characters in its repertoire (since the device simply shows nothing at all for characters outside its repertoire). For my device this means: Latin, Arabic, Cyrillic, Greek -> map Hebrew and many other -> transliterate More array lookups… But the contents for these arrays would need researching… rj

Steve Ratcliffe

4:11 p.m.

On 28/02/13 01:05, Robert Joop wrote:

...

Since all characters are mappable, no transliteration would take place.

Yes this is a case where transliteration would have to be done first with a mask of characters to process. Which leads to the question that perhaps that is how it should be done in all cases. ..Steve

Michał Rogala

10:06 p.m.

Is it possible to override java's remapping tables in mkgmap? I have a map in CP1250 charset which covers also part of Ukraine and cyrilic characters are shown as '?' - I would like to have them remapped to their latin equivalents as in --latin1. 2013/2/28 Steve Ratcliffe <steve@parabola.me.uk>

...

On 28/02/13 01:05, Robert Joop wrote:

...
Since all characters are mappable, no transliteration would take place.

Yes this is a case where transliteration would have to be done first with a mask of characters to process. Which leads to the question that perhaps that is how it should be done in all cases.

..Steve _______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://lists.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

Steve Ratcliffe

10:18 p.m.

On 28/02/13 22:06, Michał Rogala wrote:

...

Is it possible to override java's remapping tables in mkgmap? I have a map in CP1250 charset which covers also part of Ukraine and cyrilic characters are shown as '?' - I would like to have them remapped to their latin equivalents as in --latin1.

There is no re-mapping if you use code-page 1250. You only get the characters that are in 1250. If I make the change I was discussing, the that is almost exactly what will happen, the Cyrillic characters will be mapped to their ascii equivalents. But this has not been done yet. ..Steve

Robert Joop

25 Feb 25 Feb

11:09 p.m.

Steve, digging through your code a little, I stumbled upon the forceUppercase() method and traced its initialzation back to the --lower-case command line option. I had the impression that this option was considered useless, but I tried it nevertheless to see for myself: on the contrary! Perhaps for the first time ever, I see a proper “ß” on a Garmin. Since until only recently¹ this German character did not have any upper case counterpart, it could only be transliterated to SS and so I always see streets ending in -strasse instead of -straße everywhere on Garmins. It always feels like Swiss German (who abandoned the ß long ago). ¹ http://en.wikipedia.org/wiki/Capital_%E1%BA%9E

...

...
The micro sign U+00B5 μ becomes a ? on most code page maps, except for the Greek one, even though it is at the same position in all code pages.

U+00b5 is upper cased to GREEK CAPITAL LETTER MU, which is only present in the Greek code page.

With --lower-case, I now also get the μ micro sign all over, not just in the Greek code page. I guess … --code-page=1250 --lower-case … will become my favorite options. ;-)

...

...
And in the Arabic map’s upper half, the latin based characters show up as “?”.

That's because only lower case characters are included.

With --lower-case, the accented latin characters work fine in the Arabic map. Given these results, I’d like to ask you to reconsider your proposed action to remove the --lower-case option and code. http://wiki.openstreetmap.org/wiki/Mkgmap/dev/option-review I got the impression that it _does_ work, and I enjoy the effect very much so far.

...

...
Another peculiar thing: while the Garmin does its usual wierd upper/lower casing, TWO LABELS ARE ALL CAPS, namely those containing the ª feminine and º masculine ordinal indicators.

I don't know about this. Possibly a device thing?

I guess I’ve got a lead on this. With --lower-case, the input label “62 0062 b LATIN SMALL LETTER B” shows up the same on the Garmin, whereas the input label “42 0042 B LATIN CAPITAL LETTER B” shows up as “42 0042 B Latin Capital Letter B” on the Garmin, i.e. partially lower-cased. My theory: • when the Garmin sees all upper case characters in a label, it folds the case into title caps style • when the Garmin sees a lower case character in a label, it leaves the label characters’ case as-is. With this theory, it probably considered the ª and º characters as lower case. rj

Minko

26 Feb 26 Feb

9:03 a.m.

The option --lower-case might look better on newer generation GPS (the Dutch letter IJ finally gets rendered as it should, not Ij) but in older gps models the labels of the roads are not visible anymore, only dots and the first letter: a..... If this could be solved (or you don't care about older units), lower-case would be a good choice.

Minko

9:25 a.m.

See screenshot of an older nüvi (3xx). Lower-case renders only the pois names correct, even better than without lower-case (capital letters IJ instead of Ij). Streetnames however show only the first letter. Is it possible to use lower-case only for pois and not for lines/polygons?

Marko Mäkelä

9:59 a.m.

On Tue, Feb 26, 2013 at 10:03:23AM +0100, Minko wrote:

...

The option --lower-case might look better on newer generation GPS (the Dutch letter IJ finally gets rendered as it should, not Ij) but in older gps models the labels of the roads are not visible anymore, only dots and the first letter: a.....

IIRC, on the Garmin Edge 705 the lower-case works in 'tooltips' but not in the rotated fonts that are used for road names. There, it would only display the upper-case letters as is, and show all lower-case letters as '?' (it could be '.' too, but I think it was '?'). Would it be useful to implement a lower-case option that upper-cases only the road names and keeps everything else in mixed case? Marko

Minko

10:06 a.m.

Yes, on older Etrex series it also displayed as ???? Road names are indeed the only rotated names so if this could be implemented, that would be great.

...

IIRC, on the Garmin Edge 705 the lower-case works in 'tooltips' but not in the rotated fonts that are used for road names. There, it would only display the upper-case letters as is, and show all lower-case letters as '?' (it could be '.' too, but I think it was '?').

Would it be useful to implement a lower-case option that upper-cases only the road names and keeps everything else in mixed case?

Marko

Robert Joop

9:52 p.m.

On 13-02-26 10:59:09 CET, Marko Mäkelä wrote:

...

On Tue, Feb 26, 2013 at 10:03:23AM +0100, Minko wrote:

...
The option --lower-case might look better on newer generation GPS (the Dutch letter IJ finally gets rendered as it should, not Ij) but in older gps models the labels of the roads are not visible anymore, only dots and the first letter: a.....

IIRC, on the Garmin Edge 705 the lower-case works in 'tooltips' but not in the rotated fonts that are used for road names. There, it would only display the upper-case letters as is, and show all lower-case letters as '?' (it could be '.' too, but I think it was '?').

Similar on a GPSmap 60CSx. With --lower-case, the lower case characters in street names show up as something similar to “.” in the map, but show up fine in the tooltip, including the “ß”. In POIs, the lower case characters including ß show up fine in both map and tooltip.

...

Would it be useful to implement a lower-case option that upper-cases only the road names and keeps everything else in mixed case?

I agree: the mixed letter case in POI names looks nicer IMHO and takes up less space. rj

Steve Ratcliffe

1:29 p.m.

...

Given these results, I’d like to ask you to reconsider your proposed action to remove the --lower-case option and code. http://wiki.openstreetmap.org/wiki/Mkgmap/dev/option-review

Yes, you are right it should remain. I've just had to explain so many times why it doesn't work on the Legend era devices that I've got a thing against it!

...

My theory: • when the Garmin sees all upper case characters in a label, it folds the case into title caps style • when the Garmin sees a lower case character in a label, it leaves the label characters’ case as-is.

With this theory, it probably considered the ª and º characters as lower case.

Oh I see, yes that is probably it. ..Steve

Robert Joop

10:10 p.m.

On 13-02-26 14:29:43 CET, Steve Ratcliffe wrote:

...

Hi

...
Given these results, I’d like to ask you to reconsider your proposed action to remove the --lower-case option and code. http://wiki.openstreetmap.org/wiki/Mkgmap/dev/option-review

Yes, you are right it should remain. I've just had to explain so many times why it doesn't work on the Legend era devices that I've got a thing against it!

Isn’t --lower-case a little bit of a misnomer, as it actually does not lower-case but inhibits the upper-casing? Perhaps these three options would be adequate: * Absence of option upper-cases everything. This is the most safe default to get maps that work on all devices, I assume. * --upper-case-streets option provides working maps on older devices like the GPSmap 60CSx but with the nicer mixed-case POI names. Could this be the default or are there devices that do not support mixed case POI names? * --keep-case option for recent devices. This is --lower-case renamed. rj

4378

Age (days ago)

4393

Last active (days ago)

List overview

17 comments

5 participants

participants (5)

Marko Mäkelä
Michał Rogala
Minko
Robert Joop
Steve Ratcliffe