Re: [mkgmap-dev] Twülpstedt, Normalisation of unicode strings

16 Nov 2021

      Hi Ticker,

I think combining characters only exist in unicode. See https://en.wikipedia.org/wiki/Combining_character

I'll do some tests reg. performance, but I don't think that it causes trouble for the rather few strings which are encoded as labels.

Gerd

________________________________________
Von: mkgmap-dev <mkgmap-dev-bounces@lists.mkgmap.org.uk> im Auftrag von Ticker Berkin <rwb-mkgmap@jagit.co.uk>
Gesendet: Dienstag, 16. November 2021 11:42
An: Development list for mkgmap
Betreff: Re: [mkgmap-dev] Twülpstedt, Normalisation of unicode strings

Hi Gerd

If it is standard that in, for example, cp1252, rendering a letter
followed by accent looks the same as the equivalent unicode sequence
(ie it merges them), normalisation could be delayed until after an
attempt had been made to encode the whole string, ie in
AnyCharSetEncoder, after
        if (result.isUnmappable()) {
do the normalisation and try to encode the whole string again, before
going on to transliterate the normalised string if it fails.

I couldn't any pointers to expected behaviour for these circumstances,
to probably best to use your version.

Agree with Format6Encoder.

Utf8Encoder: Be consistent with AnyCharSetEncoder, ie agree with your
version if you keep your version of AnyCharSetEncode. If you change it
as above, then don't Normalise here.

In the "latin1" part of the test, depending on the editor, it might be
difficult to see that the test result contains the single char "ü", or,
that the starting string contains 2 chars, "u" and "¨". Worse, an
editor might change them. Maybe should be a test on the string lengths.

Ticker

On Tue, 2021-11-16 at 09:27 +0000, Gerd Petermann wrote:
...
Patch was missing...
________________________________________
Von: mkgmap-dev <mkgmap-dev-bounces@lists.mkgmap.org.uk> im Auftrag
von Gerd Petermann <gpetermann_muenchen@hotmail.com>
Gesendet: Dienstag, 16. November 2021 10:27
An: Development list for mkgmap
Betreff: Re: [mkgmap-dev] Twülpstedt, Normalisation of unicode
strings
Hi,
please review my patch. I had some problems adding the Twülpstedt
example to the existing unit test. I think the new code is closer to
what should be tested.
Did I miss something?
Gerd
________________________________________
Von: mkgmap-dev <mkgmap-dev-bounces@lists.mkgmap.org.uk> im Auftrag
von Gerd Petermann <gpetermann_muenchen@hotmail.com>
Gesendet: Montag, 15. November 2021 17:22
An: Development list for mkgmap
Betreff: Re: [mkgmap-dev] Twülpstedt, Normalisation of unicode
strings
Hi Ticker,
OK, I had the same thoughts.
Gerd
________________________________________
Von: mkgmap-dev <mkgmap-dev-bounces@lists.mkgmap.org.uk> im Auftrag
von Ticker Berkin <rwb-mkgmap@jagit.co.uk>
Gesendet: Montag, 15. November 2021 16:19
An: Development list for mkgmap
Betreff: Re: [mkgmap-dev] Twülpstedt, Normalisation of unicode
strings
Hi
I'd vote for normalisation when the label is generated.
If the un-normalised string can be represented in the target charset,
no need for normalisation.
I don't see that styles should be testing names like this, and, if
they
really need to, clauses for alternate representations could be added.
The proportion of input tag values that never make it into the final
.img must be quite high, so doing it early could be costly.
Ticker
On Mon, 2021-11-15 at 11:01 +0000, Gerd Petermann wrote:
...
Hi all,
see also https://forum.openstreetmap.org/viewtopic.php?id=74231
mkgmap sometimes fails to encode correct strings for a given
codepage
like 1252 (latin1).
I've uploaded a file that contains an area in Germany where the u-
umlaut in name
Twülpstedt is encoded in two different ways, either with ü (0xfc)
or
as u + "COMBINING DIAERESIS" (0x75 + 0x308)
See umlaut.osm at https://files.mkgmap.org.uk/detail/537
With the current code the 2nd variant is displayed as Twu?lpstedt.
This 1-liner
name = Normalizer.normalize(name, Normalizer.Form.NFC);
helps to change the name to the usual encoding which works well
with
the codepage translation.
So far so good. Now I wonder where exactly this call should be
placed.
My first idea was the code where the string is converted to a
Garmin
label, but maybe
it should happen much earlier so that also the style rules "see"
the
normalized form.
Any thoughts?
Gerd
_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
mkgmap-dev@lists.mkgmap.org.uk
https://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev