
Hi Gerd Here is updated patch that closes the file, although I find many files in mkgmap that don't have explicit close(), but I presume .finalize() will close them eventually. I'll do another patch for other text file handling, using StandardCharset where possible and fixing TokenScanner message for bad characters if not utf-8 and, if reasonable, allowing a BOM even if the file is opened as utf-8 anyway. Ticker On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
Hi Ticker,
thanks for the patch.
Please review TypCompiler.CharsetProbe. BufferedReader br is not closed. Is that intended?
I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap sources. I think it would be good to use StandardCharsets.UTF_8 where possible and unify the rest.
Gerd
________________________________________ Von: mkgmap-dev <mkgmap-dev-bounces@lists.mkgmap.org.uk> im Auftrag von Ticker Berkin <rwb-mkgmap@jagit.co.uk> Gesendet: Montag, 13. Januar 2020 11:34 An: Development list for mkgmap Betreff: Re: [mkgmap-dev] TYP files and character encoding
Hi Gerd
I've updated this patch with changes to TypCompiler CharsetProbe:
1/ looks for unicode BOM in various encodings near start of file. 2/ looks for line containing "-*- coding: charset -*-" near start of the file. 3/ retains the check for "CodePage=" coding for compatibility. 4/ in the absence of the above, sets the reading charset to utf-8 if the file is valid utf-8, otherwise to Cp1252. 5/ fixes the bad character message from the scanner to say what the charset really is rather than saying "uft-8" regardless. 6/ removes the logic to that checks if String... lines, read in the charset it is currently trying, can be encoded in the presumed output CodePage.
The final result of this patch should be that:
a/ No existing usage is broken b/ 2 methods to indicate the charset/encoding of the file that are commonly used by text editors can be used and are taken notice of. Previously, just the UTF-8 BOM was detected. c/ Typ files can, and should from now on, be written in utf-8 d/ labels for languages not supported in the --code-page of the output img just generate a warning in mkgmap.log.x
Ticker
On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:
Hi Gerd
Attached is a patch that:
Doesn't use the 'CodePage=' command in the typ-file to determine output character encoding of the typ-file, rather it uses the main map encoding from the --code-page argument.
log.warn's any typ labels that can't be encoded in the --code-page, rather than just giving up with message like:
TYP file cannot be written in code page 1252
The message:
WARNING: SortCode in TYP txt file different from command line setting that was written direct to system.out is changed to a log.warn and it shouldn't happen anyway now
For the moment, the 'CodePage=' command in the typ-file is, under some circumstances, used to determine the encoding of the typ-file itself and I've left this alone for compatibility with existing useage. Sometime in January I'll provide a better method for this
Ticker
On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
Hi Gerd
I think it is best to continue with the ideas for typ-files that:
1/ they can be in any character set and we just need a better way of working out the correct one - see my posting earlier today.
2/ it can include as many languages as anyone can be bothered to add, and so has to be an a character set that allows the languages to be added, implying unicode for a common one (more particulary, UTF -8)
3/ the codepage= statement should be redundant and ignored for controlling the output character set, which should be taken from the map, but its use for determining the input coding might need to be kept for a while for compatability.
4/ the messages my hack generates should be turned into 1 warning or information message per language or maybe suppressed altogether. If someone is generating a map with a character set that doesn't support a particular language, they really won't care that that data for other languages that have an incompatible representation with their language won't be there.
Ticker
On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
Hi Ticker,
I think I understand now why we didn't have a default typ file ;) If I got that right I should revert the changes in r4395 and mkgmap should not allow or warn loudly when a typ file with a different codepage is merged? Or should we force the usage of unicode codepage? Or is it possible to compile mapnik.txt with cp 1252 (or any other) in a way that only those lines which contain non-matching characters are ignored?
Gerd
________________________________________ Von: mkgmap-dev <mkgmap-dev-bounces@lists.mkgmap.org.uk> im Auftrag von Ticker Berkin <rwb-mkgmap@jagit.co.uk> Gesendet: Mittwoch, 18. Dezember 2019 19:46 An: mkgmap development Betreff: [mkgmap-dev] TYP files and character encoding
Hi
A couple of problems with typ-files and unicode.
With 'Codepage=65001' the final contents of the labels in mapnik.typ that is included with the composite map is unicode, but if the map is codepage 1252, the unicode characters with the top bit set are simply displayed as if in 1252.
Removing the codepage statement from mapnik.txt and making fixes elsewhere to ensure that the file is read correctly as utf-8 and then generating a map with --code-page=1252, it gives the error:
SEVE: uk.me.parabola.imgfmt.MapFailedException ../svn/trunk/resources/typ-files/mapnik.txt: (thrown in TypCompiler.makeMap()) TYP file cannot be written in code page 1252
Changing the exception handling in imgfmt/app/typ/TypElement.java, so that makeLabelBlock() reads as ... CharBuffer cb = CharBuffer.wrap(tl.getText()); try { ByteBuffer buffer = encoder.encode(cb); out.put((byte) tl.getLang()); out.put(buffer); out.put((byte) 0); } catch (CharacterCodingException ignore) { // ignore.printStackTrace(); String name = encoder.charset().name(); System.out.println("Cannot represent String=" + tl.getLang() + "," + tl.getText() + " in CodePage=" + name); // throw newTypLabelException(name); } ...
It gives output like: Cannot represent String=21,Gara|e in CodePage=windows-1252 Cannot represent String=21,Obszar przemysBowy in CodePage=windows -1252 Cannot represent String=21,ZieleD in CodePage=windows-1252 Cannot represent String=21,Zaro[la in CodePage=windows-1252 Cannot represent String=21,MokradBa in CodePage=windows-1252 Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in CodePage=windows-1252 Cannot represent String=21,Droga szybkiego ruchu (B^Ecznik) in CodePage=windows-1252 Cannot represent String=21,Droga szybkiego ruchu (B^Ecznik) in CodePage=windows-1252 Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows -1252 Cannot represent String=21,Wybrze|e in CodePage=windows-1252 Cannot represent String=21,Zcie|ka in CodePage=windows-1252 Cannot represent String=21,StrumieD in CodePage=windows-1252 Cannot represent String=21,Granica paDstwa in CodePage=windows -1252 Cannot represent String=21,Rzeka, KanaB in CodePage=windows -1252 Cannot represent String=21,StrumieD in CodePage=windows-1252 Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252 Cannot represent String=21,Kabel wysokiego napi^Ycia in CodePage=windows-1252 Cannot represent String=21,Tor wy[cigowy in CodePage=windows -1252 Cannot represent String=21,Droga szybkiego ruchu (B^Ecznik) in CodePage=windows-1252 Cannot represent String=21,Droga krajowa (B^Ecznik) in CodePage=windows -1252 Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in CodePage=windows-1252 Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows -1252 Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows -1252 Cannot represent String=21,Restauracja (AmerykaDska) in CodePage=windows-1252 Cannot represent String=21,Restauracja (ChiDska) in CodePage=windows -1252 Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in CodePage=windows-1252 Cannot represent String=21,Restauracja (WBoska) in CodePage=windows -1252 Cannot represent String=21,Restauracja (MeksykaDska) in CodePage=windows-1252 Cannot represent String=21,Restauracja (P^Eczki) in CodePage=windows -1252 Cannot represent String=21,Restauracja (WegetariaDska) in CodePage=windows-1252 Cannot represent String=21,Kr^Ygle in CodePage=windows-1252 Cannot represent String=21,Sklep odzie|owy in CodePage=windows -1252 Cannot represent String=21,Wypo|yczalnia samochod\363w in CodePage=windows-1252 Cannot represent String=21,Gara| in CodePage=windows-1252 Cannot represent String=21,Sprzeda| samochod\363w in CodePage=windows -1252 Cannot represent String=21,Sklep |eglarski in CodePage=windows -1252 Cannot represent String=21,S^Ed in CodePage=windows-1252 Cannot represent String=21,O[rodek kultury in CodePage=windows -1252 Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252 Cannot represent String=21,Stra| po|arna in CodePage=windows -1252 Cannot represent String=21,SBupek in CodePage=windows-1252 Cannot represent String=21,PrzystaD in CodePage=windows-1252 Cannot represent String=21,L^Edowisko helikopterowe in CodePage=windows -1252 Cannot represent String=21,Wie|a in CodePage=windows-1252 Cannot represent String=21,yr\363dBo in CodePage=windows-1252 Cannot represent String=21,Pla|a in CodePage=windows-1252 Cannot represent String=21,Przyl^Edek in CodePage=windows-1252 Cannot represent String=21,SkaBa in CodePage=windows-1252
Which makes sense if codepage 1252 doesn't handle Polish (hex 0x15, decimal 21).
NB the non ascii characters in above are messed up by my cutting and pasting.
Checking the French, on my Garmin device, the type descriptions now display accents correctly.
Ticker
_______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev _______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk
mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev