Re: [mkgmap-dev] unicode

28 Feb 2013

On 13-02-27 23:35:49 CET, Robert Joop wrote:
...
What I haven’t been able to find out after over an hour of searching:
Is it publicly known how such “unicode maps” are encoded, or is this a
mystery hidden in encrypted maps?
At least getting unicode into street names turned out to be easy:
I brutally patched the code to use the codepage 65001 and put UTF-8
bytes for “äαЯب” into the code, i.e. German, Greek, Cyrillic and Arabic,
which is usually contained in four codepages (1250, 1253, 1251, 1256).
The street names show up with these four characters at the same time.
Yeah!

I can’t program Java, so:
- I don’t see the reason why the Utf8Encoder encodeText() seemingly
  results in ASCII instead of UTF-8.
- with my patching I didn’t go beyond the point of demonstrating a
  minimal example.

The option was “--code-page=65001”.

I believe the lines that accomplished this are no more than these:

Index: src/uk/me/parabola/imgfmt/app/labelenc/CodeFunctions.java
===================================================================

--- src/uk/me/parabola/imgfmt/app/labelenc/CodeFunctions.java	(revision 2501)
+++ src/uk/me/parabola/imgfmt/app/labelenc/CodeFunctions.java	(working copy)
@@ -97,6 +97,11 @@
 			funcs.setEncodingType(ENCODING_FORMAT10);
 			funcs.setEncoder(new Utf8Encoder());
 			funcs.setDecoder(new Utf8Decoder());
+		} else if ("cp65001".equals(charset)) {
+			funcs.setEncodingType(ENCODING_FORMAT10);
+			funcs.setEncoder(new Utf8Encoder());
+			funcs.setDecoder(new Utf8Decoder());
+			funcs.setCodepage(65001);
 		} else if ("simple8".equals(charset)) {
 			funcs.setEncodingType(ENCODING_FORMAT9);
 			funcs.setEncoder(new Simple8Encoder());
Index: src/uk/me/parabola/imgfmt/app/labelenc/Utf8Encoder.java
===================================================================
--- src/uk/me/parabola/imgfmt/app/labelenc/Utf8Encoder.java	(revision 2501)
+++ src/uk/me/parabola/imgfmt/app/labelenc/Utf8Encoder.java	(working copy)
@@ -43,9 +43,22 @@
 			byte[] res = new byte[buf.length + 1];
 			System.arraycopy(buf, 0, res, 0, buf.length);
 			res[buf.length] = 0;
+			if (buf.length >= 8){
+			res[0] = (byte)195;
+			res[1] = (byte)164;
+			res[2] = (byte)206;
+			res[3] = (byte)177;
+			res[4] = (byte)208;
+			res[5] = (byte)175;
+			res[6] = (byte)216;
+			res[7] = (byte)168;
+			}
+//System.out.println("copied utf-8 bytes "+res[0]+" "+res[1]+" "+res[2]);
 			et = new EncodedText(res, res.length);
+//System.out.println("encoded utf-8 bytes: "+et);
 		} catch (UnsupportedEncodingException e) {
 			// As utf-8 must be supported, this can't happen
+System.out.println(" // As utf-8 must be supported, this can't happen");
 			byte[] buf = uctext.getBytes();
 			et = new EncodedText(buf, buf.length);
 		}
Index: src/uk/me/parabola/imgfmt/app/srt/Sort.java
===================================================================
--- src/uk/me/parabola/imgfmt/app/srt/Sort.java	(revision 2501)
+++ src/uk/me/parabola/imgfmt/app/srt/Sort.java	(working copy)
@@ -253,6 +253,8 @@
 		this.codepage = codepage;
 		if (codepage == 0)
 			charset = Charset.forName("cp1252");
+		else if (codepage == 65001)
+			charset = Charset.forName("UTF-8");
 		else if (codepage == 932)
 			// Java uses "ms932" for code page 932
 			// (Windows-31J, Shift-JIS + MS extensions)