Name Substitution not correctly working

Felix Hartmann

25 Jul 2014 25 Jul '14

12:12 p.m.

1. There seems to be a problem with Chinese Tonemarks... e.g. on http://www.openstreetmap.org/node/244080668 name:zh_pinyin ~ '.*Shì' { set name:zh_pinyin='${name:zh_pinyin|subst:"Shì=>"}'; echo "rule working" } nor does name:zh_pinyin ~ '.*Shi' { set name:zh_pinyin='${name:zh_pinyin|subst:"Shi=>"}'; echo "rule 2 working" } work. 2. How can I remove parentheses from the name? name=* { set name='${name|subst:"(=>"'} name=* { set name='${name|subst:")=>"'} Is not working. Or better it gets called but I loose all characters in Japanese or Chinese... e.g. here: http://www.openstreetmap.org/node/569005420 -- keep on biking and discovering new trails Felix openmtbmap.org & www.velomap.org

Show replies by date

Steve Ratcliffe

25 Jul 25 Jul

3:38 p.m.

...

1. There seems to be a problem with Chinese Tonemarks... e.g. on http://www.openstreetmap.org/node/244080668

name:zh_pinyin ~ '.*Shì' { set name:zh_pinyin='${name:zh_pinyin|subst:"Shì=>"}'; echo "rule working" }

This seems to work for me.

...

name:zh_pinyin ~ '.*Shi' { set name:zh_pinyin='${name:zh_pinyin|subst:"Shi=>"}'; echo "rule 2 working" } work. This one does not match the given node.

...

2. How can I remove parentheses from the name? name=* { set name='${name|subst:"(=>"'} name=* { set name='${name|subst:")=>"'}

Perhaps it was just a typo in your mail, but these should be: name=* { set name='${name|subst:"(=>"}' } name=* { set name='${name|subst:")=>"}' }

...

Is not working. Or better it gets called but I loose all characters in Japanese or Chinese...

With these corrections it does seem to remove the brackets without losing anything else. ..Steve

Felix Hartmann

26 Jul 26 Jul

10:48 a.m.

On 25.07.2014 17:38, Steve Ratcliffe wrote:

...

Hi

...
1. There seems to be a problem with Chinese Tonemarks... e.g. on http://www.openstreetmap.org/node/244080668

name:zh_pinyin ~ '.*Shì' { set name:zh_pinyin='${name:zh_pinyin|subst:"Shì=>"}'; echo "rule working" }

This seems to work for me. Hmm, have you tried it out on China from Geofabrik? I copied the "Shì" directly from OSM in order to be sure to get it right (not sure if there is a difference between Chinese 4th tone and French Accent on my keyboard even though the look identical).

name:zh_pinyin ~ '.* Shì' { set name:zh_pinyin='${name:zh_pinyin|subst:" Shì=>"}'; echo "rule XX working" } name:zh_pinyin ~ '.* Shì' { set name:zh_pinyin='${name:zh_pinyin|subst:"Shì=>"}'; echo "rule working" } I cannot see any echo - even tough there is plenti of Shì in China in zh_pinyin.. The rule name:zh_pinyin ~ '.* Shì' {echo "found Shì"} is consequently also never outputting an echo against Geofabrik China extract... name:zh_pinyin ~ '.*Shì' {echo "found .Shì"} neither... I also thought I could try to just run the rule agains all name:zh_pinyin - but of course also to no avail.. name:zh_pinyin=* { set name:zh_pinyin='${name:zh_pinyin|subst:" Shì=>"}'; echo "rule XX working" } name:zh_pinyin=* { set name:zh_pinyin='${name:zh_pinyin|subst:"Shì=>"}'; echo "rule working" } is of course giving loads of echo - but I still end up with Shì in all my name:zh_pinyin.. Nothing is substituted... So clearly I have a problem with ì... Other rules using the exactly same notation with Shi for Japan work fine without probs... Could there be a bug on Windows only, if the rule works for you (assuming under Linux)? Is there a possible different notation for ì that could be used? Unicodelookup.com gives: 0354 236 0xEC ì But then I don't know if Chinese 4th tone is really this caracter, or maybe another? Is there a possibility to use unicode character in mkgmap style file? I also see: hex: 00EC but that should be similar I would guess...

...

...
name:zh_pinyin ~ '.*Shi' { set name:zh_pinyin='${name:zh_pinyin|subst:"Shi=>"}'; echo "rule 2 working" } work. This one does not match the given node.

I know - it was in there to make sure that maybe there is no mashing together of i and ì.

...

...
2. How can I remove parentheses from the name? name=* { set name='${name|subst:"(=>"'} name=* { set name='${name|subst:")=>"'}

Perhaps it was just a typo in your mail, but these should be:

name=* { set name='${name|subst:"(=>"}' } name=* { set name='${name|subst:")=>"}' }

...
Is not working. Or better it gets called but I loose all characters in Japanese or Chinese...

With these corrections it does seem to remove the brackets without losing anything else.

Oh yeah - that was a typo in my style. Rule is working correctly now...

...

..Steve

-- keep on biking and discovering new trails Felix openmtbmap.org & www.velomap.org

Steve Ratcliffe

5 p.m.

...

...
...
1. There seems to be a problem with Chinese Tonemarks...

...

...
...
http://www.openstreetmap.org/node/244080668

...

...
...
name:zh_pinyin ~ '.*Shì' { set name:zh_pinyin='${name:zh_pinyin|subst:"Shì=>"}'; echo "rule working" }

This seems to work for me. Hmm, have you tried it out on China from Geofabrik? I copied the "Shì"

I downloaded a small area around the point to gave directly from the OSM website.

...

Could there be a bug on Windows only, if the rule works for you (assuming under Linux)?

I suppose it is possible, but if the style file is not actually in utf-8 that would explain it. If the file is in cp1252 the character ì would look the same, but it would not work. I have uploaded the exact file I used to http://files.mkgmap.org.uk/download/218/s.style (Can be used as --style-file=s.style) and the map extract: http://files.mkgmap.org.uk/download/219/map.osm

...

Is there a possible different notation for ì that could be used? Unicodelookup.com gives: 0354 236 0xEC ì

Yes, that is the character I used, copied from the OSM file. ..Steve

Felix Hartmann

5:43 p.m.

On 26.07.2014 19:00, Steve Ratcliffe wrote:

...

Hi

...
...
...
1. There seems to be a problem with Chinese Tonemarks...

...
...
...
http://www.openstreetmap.org/node/244080668

...
...
...
name:zh_pinyin ~ '.*Shì' { set name:zh_pinyin='${name:zh_pinyin|subst:"Shì=>"}'; echo "rule working" }

This seems to work for me. Hmm, have you tried it out on China from Geofabrik? I copied the "Shì"

I downloaded a small area around the point to gave directly from the OSM website.

...
Could there be a bug on Windows only, if the rule works for you (assuming under Linux)?

I suppose it is possible, but if the style file is not actually in utf-8 that would explain it. If the file is in cp1252 the character ì would look the same, but it would not work.

Okay - I used ANSI. Could there maybe be a check for this in the check styles routine, or in general? I do suppose that must have been the problem. However - also if I open mkgmap default style-file I notice it is opened in ANSI in notepad++ on Windows... I thought notepad++ uses reasonable codepage without destruction like Windows own notepad... I'll have a look on what happens now - after changing the encoding to UTF-8 I get '.* ShxEC' shown - so well there might be some bug...

...

I have uploaded the exact file I used to http://files.mkgmap.org.uk/download/218/s.style (Can be used as --style-file=s.style) and the map extract: http://files.mkgmap.org.uk/download/219/map.osm

...
Is there a possible different notation for ì that could be used? Unicodelookup.com gives: 0354 236 0xEC ì

Yes, that is the character I used, copied from the OSM file.

..Steve

-- keep on biking and discovering new trails Felix openmtbmap.org & www.velomap.org

Steve Ratcliffe

27 Jul 27 Jul

10:32 a.m.

On 26/07/14 18:43, Felix Hartmann wrote:

...

Okay - I used ANSI. Could there maybe be a check for this in the check styles routine, or in general? I do suppose that must have been the problem.

Although it is not always possible to tell if a file is in the wrong encoding, it should have been in this case. I see that the ì character gets converted to a unicode replacement character (0xfffd) If you had done: echo 'Shì' it would have come out something like: Sh� (hope that works in email) and shown the problem. There are a couple of ways to make bad characters an error, rather than getting replaced. The attached patch allows them to be replaced and then throws and error when seen. This has the advantage of giving you file name and line number of the error. It might interfere with something valid, so give it a try. I don't use notepad++, but these links might be useful: http://superuser.com/questions/292086/how-can-i-enforce-so-notepad-uses-utf-... http://stackoverflow.com/questions/5090845/change-the-default-encoding-for-n... ..Steve

Felix Hartmann

1:54 p.m.

1. Yes - had I set notepad to default to UTF-8 I probably would have evaded the bug. (as long as you don't use create new document dialog on right click in Windows - they will always be in ANSI except if you do some registry hacks). And yes - the mkgmap style-file is in UTF-8 - but as a windows user you usually don't notice. Because it is without BOM - so as long as there is no Umlaut or other special character in it, notepad++ or probably most windows user will open the file as ANSI because as long as you don't use any such character - it is actually still identical. Where the mkgmap style-file in UTF-8 with BOM, it would be clearer... (but I don't want to start a with or without BOM discussion here). So right now only the address file in the style is quite safe - because recently there were some special characters added. /mkgmap:country=POL & mkgmap:region!=* & mkgmap:admin_level4=* { set mkgmap:region='${mkgmap:admin_level4|subst:województwo =>}' }/ But as long as there is no working check - and mkgmap default style-file comes in UTF-8 without BOM - there is quite big danger the bug will happen to others too... (for my style I now set it to UTF8 plus for added security (though it won't matter) I added a line : /#this is a UTF-8 check - ÖÄÜè/ so should any editor actually change the encoding to ANSI - I would directly notice... So such a line at the start could be an alternative to UTF-8 with BOM.. 2. about the patch: Mmmh - that patch goes a bit too far... - it actually stops at errors on input file (not style) too I think (note the time stamp 30 seconds later): 14:49:25 china cn 6555 this is run101 starting to compile openmtmbap with mkgmap Exception in thread "main" uk.me.parabola.mkgmap.scan.SyntaxException: Error: (stream:10089): Bad character in input, file probably not in utf-8 at uk.me.parabola.mkgmap.scan.TokenScanner.readChar(TokenScanner.java:239) at uk.me.parabola.mkgmap.scan.TokenScanner.readTok(TokenScanner.java:189) at uk.me.parabola.mkgmap.scan.TokenScanner.fillTok(TokenScanner.java:154) at uk.me.parabola.mkgmap.scan.TokenScanner.ensureTok(TokenScanner.java:150) at uk.me.parabola.mkgmap.scan.TokenScanner.isEndOfFile(TokenScanner.java:111) at uk.me.parabola.mkgmap.srt.SrtTextReader.read(SrtTextReader.java:145) at uk.me.parabola.mkgmap.srt.SrtTextReader.<init>(SrtTextReader.java:105) at uk.me.parabola.mkgmap.srt.SrtTextReader.<init>(SrtTextReader.java:97) at uk.me.parabola.mkgmap.srt.SrtTextReader.sortForCodepage(SrtTextReader.java:126) at uk.me.parabola.mkgmap.main.Main.getSort(Main.java:638) at uk.me.parabola.mkgmap.main.Main.processFilename(Main.java:246) at uk.me.parabola.mkgmap.CommandArgsReader$Filename.processArg(CommandArgsReader.java:256) at uk.me.parabola.mkgmap.CommandArgsReader.readArgs(CommandArgsReader.java:125) at uk.me.parabola.mkgmap.main.Main.mainStart(Main.java:134) at uk.me.parabola.mkgmap.main.Main.main(Main.java:105) Could Not Find C:\OpenMTBMap\maps\ovm_6555*.img 14:49:55 china cn 6555 Finished Compiling Openmtbmap - this is run101 mapsetbuilding failed - to few maxnodes?? Press any key to continue . . . vs (input file in ANSI): 15:11:38 china cn 6555 this is run101 starting to compile openmtmbap with mkgmap Exception in thread "main" uk.me.parabola.mkgmap.scan.SyntaxException: Error: (stream:10089): Bad character in input, file probably not in utf-8 at uk.me.parabola.mkgmap.scan.TokenScanner.readChar(TokenScanner.java:239) at uk.me.parabola.mkgmap.scan.TokenScanner.readTok(TokenScanner.java:189) at uk.me.parabola.mkgmap.scan.TokenScanner.fillTok(TokenScanner.java:154) at uk.me.parabola.mkgmap.scan.TokenScanner.ensureTok(TokenScanner.java:150) at uk.me.parabola.mkgmap.scan.TokenScanner.isEndOfFile(TokenScanner.java:111) at uk.me.parabola.mkgmap.srt.SrtTextReader.read(SrtTextReader.java:145) at uk.me.parabola.mkgmap.srt.SrtTextReader.<init>(SrtTextReader.java:105) at uk.me.parabola.mkgmap.srt.SrtTextReader.<init>(SrtTextReader.java:97) at uk.me.parabola.mkgmap.srt.SrtTextReader.sortForCodepage(SrtTextReader.java:126) at uk.me.parabola.mkgmap.main.Main.getSort(Main.java:638) at uk.me.parabola.mkgmap.main.Main.processFilename(Main.java:246) at uk.me.parabola.mkgmap.CommandArgsReader$Filename.processArg(CommandArgsReader.java:256) at uk.me.parabola.mkgmap.CommandArgsReader.readArgs(CommandArgsReader.java:125) at uk.me.parabola.mkgmap.main.Main.mainStart(Main.java:134) at uk.me.parabola.mkgmap.main.Main.main(Main.java:105) Could Not Find C:\OpenMTBMap\maps\ovm_6555*.img 15:11:42 china cn 6555 Finished Compiling Openmtbmap - this is run101 mapsetbuilding failed - to few maxnodes?? However now that I once had a file in ANSI - (even though changed back to UTF-8) some residue in memory means I always get directly the error - even on default style... C:\OpenMTBMap\maps>start /low /b /wait java -jar -XX:StringTableSize=100003 -Xms6000M -Xmx10300M c:\openmtbmap\mkgmap.jar --max-jobs=8 "--generate-sea" "--code-page=65001" "--precomp-sea=c:\openmtbmap\maps\sea.zip" --nsis --index --levels="0:24, 1:2 3, 2:22, 3:21, 4:20, 5:19, 6:18" --overview-levels="7:17, 8:16, 9:15, 10:14, 11:13, 12:12" --adjust-turn-headings --add-pois-to-areas --reduce-point-density=3.4 --reduce-point-density-polygon=6 --housenumbers --link-pois-to-ways --ignore-turn-restric tions --polygon-size-limits="24:16, 23:14, 22:12, 21:11, 20:10, 19:9, 18:8, 17:7, 16:6, 15:5, 14:4, 13:3, 12:2, 11:0, 10:0" --description=openmtbmap_gcc --show-profiles=1 --location-autofill=bounds,is_in,nearest --bounds=c:\openmtbmap\maps\bounds.z ip --route --country-abbr=gcc --country-name=gcc-states --mapname=65560000 --family-id=6556 --product-id=1 --series-name=openmtbmap_gcc-states_27.07.2014 --family-name=mtbmap_gcc_27.07.2014 --tdbfile --overview-mapname=mapsetc --keep-going --area-nam e="gcc-states_27.07.2014_openmtbmap.org" -c e:\openmtbmap\maps\template.gcc-states 7*.img 1>NUL Exception in thread "main" uk.me.parabola.mkgmap.scan.SyntaxException: Error: (stream:10089): Bad character in input, file probably not in utf-8 at uk.me.parabola.mkgmap.scan.TokenScanner.readChar(TokenScanner.java:239) at uk.me.parabola.mkgmap.scan.TokenScanner.readTok(TokenScanner.java:189) at uk.me.parabola.mkgmap.scan.TokenScanner.fillTok(TokenScanner.java:154) at uk.me.parabola.mkgmap.scan.TokenScanner.ensureTok(TokenScanner.java:150) at uk.me.parabola.mkgmap.scan.TokenScanner.isEndOfFile(TokenScanner.java:111) at uk.me.parabola.mkgmap.srt.SrtTextReader.read(SrtTextReader.java:145) at uk.me.parabola.mkgmap.srt.SrtTextReader.<init>(SrtTextReader.java:105) at uk.me.parabola.mkgmap.srt.SrtTextReader.<init>(SrtTextReader.java:97) at uk.me.parabola.mkgmap.srt.SrtTextReader.sortForCodepage(SrtTextReader.java:126) at uk.me.parabola.mkgmap.main.Main.getSort(Main.java:638) at uk.me.parabola.mkgmap.main.Main.processFilename(Main.java:246) at uk.me.parabola.mkgmap.CommandArgsReader$Filename.processArg(CommandArgsReader.java:256) at uk.me.parabola.mkgmap.CommandArgsReader.readArgs(CommandArgsReader.java:125) at uk.me.parabola.mkgmap.main.Main.mainStart(Main.java:134) at uk.me.parabola.mkgmap.main.Main.main(Main.java:105) On 27.07.2014 12:32, Steve Ratcliffe wrote:

...

On 26/07/14 18:43, Felix Hartmann wrote:

...
Okay - I used ANSI. Could there maybe be a check for this in the check styles routine, or in general? I do suppose that must have been the problem.

Although it is not always possible to tell if a file is in the wrong encoding, it should have been in this case. I see that the ì character gets converted to a unicode replacement character (0xfffd)

If you had done: echo 'Shì'

it would have come out something like: Sh� (hope that works in email) and shown the problem. yes - clearly. (and works in email somehow).

There are a couple of ways to make bad characters an error, rather than getting replaced. The attached patch allows them to be replaced and then throws and error when seen. This has the advantage of giving you file name and line number of the error. It might interfere with something valid, so give it a try.

I don't use notepad++, but these links might be useful:

http://superuser.com/questions/292086/how-can-i-enforce-so-notepad-uses-utf-...

http://stackoverflow.com/questions/5090845/change-the-default-encoding-for-n...

..Steve

-- keep on biking and discovering new trails Felix openmtbmap.org & www.velomap.org

Steve Ratcliffe

28 Jul 28 Jul

10:26 a.m.

On 27/07/14 14:54, Felix Hartmann wrote:

...

2. about the patch: Mmmh - that patch goes a bit too far... - it actually stops at errors on input file (not style) too I think (note the time stamp 30 seconds later): 14:49:25 china cn 6555 this is run101 starting to compile openmtmbap with mkgmap Exception in thread "main" uk.me.parabola.mkgmap.scan.SyntaxException: Error: (stream:10089): Bad character in input, file probably not in utf-8

OK, yes, we have that character in three different places in the files under resources. They can all be removed without harm though. New patch attached. ..Steve

Felix Hartmann

7:40 p.m.

Hi Steve, Great that patch is working now. Only the line number first confused me - as it concerned a commented out line. But I think that's even better - that way you will notice even earlier that you should use UTF-8. So it's ready to be pushed into trunk I think.. On 28.07.2014 12:26, Steve Ratcliffe wrote:

...

On 27/07/14 14:54, Felix Hartmann wrote:

...
2. about the patch: Mmmh - that patch goes a bit too far... - it actually stops at errors on input file (not style) too I think (note the time stamp 30 seconds later): 14:49:25 china cn 6555 this is run101 starting to compile openmtmbap with mkgmap Exception in thread "main" uk.me.parabola.mkgmap.scan.SyntaxException: Error: (stream:10089): Bad character in input, file probably not in utf-8

OK, yes, we have that character in three different places in the files under resources. They can all be removed without harm though.

New patch attached.

..Steve

-- keep on biking and discovering new trails Felix openmtbmap.org & www.velomap.org

Felix Hartmann

26 Jul 26 Jul

6:39 p.m.

Yes - that was the solution. The problem seems to be here with Notepad++, it will only choose UTF-8 or UTF-8 without BOM if it is needed. Otherwise it will read it as ANSI (if you then choose UTF-8 manually, it seems to put somewhere an additional marker that linux editors do not put - I would guess). Therefore I always assumed I'm using the correct codepage. ---- So out of my latest series of problems (all of which where on my side/system) - I still don't understand this one. I mean I'm more or less sure the syntax is correct. But mkgmap doesn't seem to like ~ if it is comparing two values to each other instead of one value to one tag? I want to change it so I can filter names e.g. here: http://www.openstreetmap.org/node/369495900 In this case I want to delete name:en based on the condition that name is fully present in name:en. So I created the below rules according to the above scheme name:en=* & name=* & (name ~ '.*$name:en' ) {delete name:en; echo "Beginning name matched"} name:en=* & name=* & (name ~ '$name:en.*' ) {delete name:int; echo "end name matched"} name:en=* & name=* & (name:en ~ '.*$name' ) {delete name; echo "beginning name:int matched"} name:en=* &name=* & (name:en ~ '$name.*' ) {delete name; echo "end name:int matched"} On 26.07.2014 19:00, Steve Ratcliffe wrote:

...

I suppose it is possible, but if the style file is not actually in utf-8 that would explain it. If the file is in cp1252 the character ì would look the same, but it would not work.

3863

Age (days ago)

3866

Last active (days ago)

List overview

9 comments

2 participants

participants (2)

Felix Hartmann
Steve Ratcliffe