Transliteration

newer
Commit: r1314: Add latin1 varient...

Lambertus

21 Oct 2009 21 Oct '09

8:27 a.m.

A question popped up on the OSM forum about the many ??? in my maps. They are a result of the single code page which is used for the whole world and missing automatic transliteration. I'm starting to understand what's needed for automatic transliteration: detection of source language and various methods for transliteration to another language (some languages suffice with 1-on-1 character replacement, others need lookup tables etc). The subject has been mentioned on this mailinglist before but doesn't seem like it's very active, so: is anyone working on this? The discussion on the OSM forum starts here (and further down): http://forum.openstreetmap.org/viewtopic.php?pid=42175#p42175

Show replies by date

Steve Ratcliffe

22 Oct 22 Oct

11:22 a.m.

Hi Lambertus

...

A question popped up on the OSM forum about the many ??? in my maps. They are a result of the single code page which is used for the whole world and missing automatic transliteration.

I'm starting to understand what's needed for automatic transliteration: detection of source language and various methods for transliteration to another language (some languages suffice with 1-on-1 character replacement, others need lookup tables etc).

The subject has been mentioned on this mailinglist before but doesn't seem like it's very active, so: is anyone working on this?

There is some transliteration support, but this only kicks in if the target character set is ascii. I know I said a few months back that we should also do the same thing when --latin1 is supplied, and I suppose now is the time! We can also improve the range of translitterations using the perl script that Ævar Arnfjörð Bjarmason recently posted. This lead me to find a number of resources that might be useful, including http://site.icu-project.org/ which claims to be comprehensive but may or may not be useful for our purposes, I've not investegated it.

...

The discussion on the OSM forum starts here (and further down): http://forum.openstreetmap.org/viewtopic.php?pid=42175#p42175

The thing I would like to add is that code-page is not particularly useful for a general purpose map such as the one you produce. See for example, http://wiki.openstreetmap.org/index.php/Mkgmap/i18n where you can see a screen shot with Russian characters. Looks great, but you need firmware that supports that character set and even units that you buy in the country in question may not. There is also a unicode/mbcs format, which I have never seen and is as far as I know again dependant on having firmware support and so would be of no use to you either, even if implemented. As you say, until we can split tiles along country borders varying parameters between different tiles is not going to be satisfactory, even if there was an automated way of determining what they should be. (As an aside I looked into what is needed to do that and I think I understand it now.) ..Steve

Lambertus

11:49 a.m.

Thanks Steve for having a look at it. Steve Ratcliffe wrote:

...

The thing I would like to add is that code-page is not particularly useful for a general purpose map such as the one you produce. See for example, http://wiki.openstreetmap.org/index.php/Mkgmap/i18n where you can see a screen shot with Russian characters. Looks great, but you need firmware that supports that character set and even units that you buy in the country in question may not.

Ok, so if I understand correctly, I should remove the --latin1 and --code-page parameters (the code-page already being redundant)?

...

There is also a unicode/mbcs format, which I have never seen and is as far as I know again dependant on having firmware support and so would be of no use to you either, even if implemented.

Well indeed, there's no point in supporting apparent obscure features in a general purpose map... What's the best thing we can say to (impatient) map users that want readable names in maps in your opinion? Wait for improved support in Mkgmap or is adding transliterated names in the OSM data an option? Btw, it seems that automatic language detection is the biggest problem and that most tools expect you to provide them with a source and target language setting. It should be fairly straightforward to compile a list of country poly's and default source languages to work around the detection step.

Ivan Kostoski

12:47 p.m.

...

What's the best thing we can say to (impatient) map users that want readable names in maps in your opinion? Wait for improved support in Mkgmap or is adding transliterated names in the OSM data an option?

My suggestion would be if --code-page is given, transliterate to ascii all characters that fall out of the code page range. I.e. if I like build Cyrillic map (--code-page=1251) I would transliterate all characters that fall out of (0x0020-0x007f; 0x0400-0x04ff) ranges, so If any i.e. Greek characters are present it the input data, they will be "mapped" to ascii. Similar to with cp1252 (i.e. latin1), transliterate everyting outside 0x0020-0x007f; 0x00a0-0x00ff ranges. cp1250, transliterate everything

...

0x007f.

Cyrillic, in most cases, is easy to transliterate (or romanize, etc) by simple mapping as there are few exceptions. Try properly transliterating i.e. Greek (i.e. http://en.wikipedia.org/wiki/Romanization_of_Greek), taking account letter combination and positioning in the word, assuming you know which standard of Greek romanization you will use... What I did to work around it is to use small program that pre-processes osm data, adding name:ascii tags whenever it encounters non-ascii characters in the osm data, and than use --name-tag-order=name:ascii, name as options. However, it think that similar functionally could be implement in mkgmap... Regards, Ivan

Steve Ratcliffe

1:28 p.m.

On 22/10/09 12:49, Lambertus wrote:

...

Ok, so if I understand correctly, I should remove the --latin1 and --code-page parameters (the code-page already being redundant)?

Yes that will currently give the best results on a worldwide basis, but you will have no accents on characters at all. It won't take long to make the current transliteration support available to --latin1, but the code needs a bit of re-arrangement to make it possible.

...

What's the best thing we can say to (impatient) map users that want readable names in maps in your opinion? Wait for improved support in Mkgmap or is adding transliterated names in the OSM data an option?

If someone wants to go beyond what is currently possible and transliterate based on attempting to detect country/language then it might be a good idea to pre-process (although if you are going to do that I would just change name, rather than adding an extra tag). If something really good comes from I won't mind adding it into mkgmap if it is appropriate. ..Steve

Lambertus

23 Oct 23 Oct

2:30 p.m.

Steve Ratcliffe wrote:

...

On 22/10/09 12:49, Lambertus wrote:

...
Ok, so if I understand correctly, I should remove the --latin1 and --code-page parameters (the code-page already being redundant)?

Yes that will currently give the best results on a worldwide basis, but you will have no accents on characters at all.

It won't take long to make the current transliteration support available to --latin1, but the code needs a bit of re-arrangement to make it possible.

I'll leave the code-page to Latin1 then, assuming that Latin1 is supported my most GPS devices.

...

...
What's the best thing we can say to (impatient) map users that want readable names in maps in your opinion? Wait for improved support in Mkgmap or is adding transliterated names in the OSM data an option?

If someone wants to go beyond what is currently possible and transliterate based on attempting to detect country/language then it might be a good idea to pre-process (although if you are going to do that I would just change name, rather than adding an extra tag). If something really good comes from I won't mind adding it into mkgmap if it is appropriate.

Ok, I've sent this forward to the forum and we'll see what comes out of this :) Thanks.

Marko Mäkelä

22 Oct 22 Oct

6:44 p.m.

On Thu, Oct 22, 2009 at 12:22:23PM +0100, Steve Ratcliffe wrote:

...

I know I said a few months back that we should also do the same thing when --latin1 is supplied, and I suppose now is the time!

We can also improve the range of translitterations using the perl script that Ævar Arnfjörð Bjarmason recently posted. This lead me to find a number of resources that might be useful, including http://site.icu-project.org/ which claims to be comprehensive but may or may not be useful for our purposes, I've not investegated it.

Note that Garmin's latin1 is not exactly ISO 8859-1 or Microsoft CP1252: http://www.mkgmap.org.uk/pipermail/mkgmap-dev/2009q2/001862.html One notable deficiency is that curly quotes are not working. I tried to convert them to straight ' and " back in May, but failed: http://www.mkgmap.org.uk/pipermail/mkgmap-dev/2009q2/001905.html

...

...
The discussion on the OSM forum starts here (and further down): http://forum.openstreetmap.org/viewtopic.php?pid=42175#p42175

The thing I would like to add is that code-page is not particularly useful for a general purpose map such as the one you produce. See for example, http://wiki.openstreetmap.org/index.php/Mkgmap/i18n where you can see a screen shot with Russian characters. Looks great, but you need firmware that supports that character set and even units that you buy in the country in question may not.

I came across a download link to Cyrillic firmware some time ago. "UNOFFICIAL VERSION!!! AS IS!!! NO WARRANTIES!!! NOT FOR SALE!!!" :-) http://www.gps-forum.ru/cgi-bin/forum/showpost.pl?Board=gpsgeneral&Number=11... Best regards, Marko

Ævar Arnfjörð Bjarmason

24 Oct 24 Oct

2:12 p.m.

On Thu, Oct 22, 2009 at 11:22 AM, Steve Ratcliffe <steve@parabola.me.uk> wrote:

...

We can also improve the range of translitterations using the perl script that Ævar Arnfjörð Bjarmason recently posted. This lead me to find a number of resources that might be useful, including http://site.icu-project.org/ which claims to be comprehensive but may or may not be useful for our purposes, I've not investegated it.

You just committed the Greek fixes I sent a patch for though which were really meant as more of an example. It would be possible to extract a lot more from the Text::Unidecode database: http://cpansearch.perl.org/src/SBURKE/Text-Unidecode-0.04/lib/Text/Unidecode...

5602

Age (days ago)

5605

Last active (days ago)

List overview

7 comments

5 participants

participants (5)

Ivan Kostoski
Lambertus
Marko Mäkelä
Steve Ratcliffe
Ævar Arnfjörð Bjarmason