data:image/s3,"s3://crabby-images/802f4/802f43eb70afc2c91d48f43edac9b0f56b0ec4a4" alt=""
Hi So finally I will merge the mixed index branch. I have added a temporary option that will be required to enable it - it will be off by default as there may still be a problem with the gmapsupp version. I think it would be best to selectively enable it per country along with lists of names to avoid. This would be best done by people from or familiar with the countries in question. I'll write another message when it is done. ..Steve
data:image/s3,"s3://crabby-images/c125b/c125b853f0995d45aaac92eceb3ca5c1f81f52f5" alt=""
On Thu, Feb 12, 2015 at 01:24:29PM +0000, Steve Ratcliffe wrote:
So finally I will merge the mixed index branch.
I believe that the database terminology for this is 'inverted index' or 'fulltext index'.
I think it would be best to selectively enable it per country along with lists of names to avoid. This would be best done by people from or familiar with the countries in question.
In fulltext search, these are called 'stopwords'. It might not be necessary to do anything to for countries where street names are commonly written as a single word. Example: "Main Street" would be "Hauptstrasse" in German, "Huvudgatan" in Sweden and "Päätie" in Finnish. Only if the first part of the street name is a proper name such as a person's name, the second part could be written as a separate word, separated by a space or dash. That said, I guess it would still make sense to introduce some stopwords. Words that I can think of: Swedish: gata, gatan, gränd, gränden, stig, stigen, (stråk, stråket) Finnish: tie, katu, polku, kuja, (raitti, taival) German: Straße, Strasse, Weg, Allee, Chaussee Estonian: mnt, maantee, tn, tänav, pst, puiestee In Estonia, it seems to be common to write the tn, mnt or pst as a separate word. I could be missing some stopwords in Estonian and for German-speaking countries. Also, it could be that the French loan words Allee and Chaussee are sometimes accented. The Finnish and Swedish words that I have put in parenthesis should be very rare, typically used for ways for non-motorized traffic. I don't think that including them would pollute the index much. You might in fact want to search for such a name when you are looking for a nice walking or cycling route (i.e., you expect there to exist some random-famous-person-name-stråket, but you do not know the random name). Marko
data:image/s3,"s3://crabby-images/501c5/501c53923ee030f9d1d527d6ca05acfdab33104b" alt=""
Hi guys, The stopwords are very important for Brazilian's maps, because more than 90% of our street names are prefixed with its kind. Examples: Rua Paris, Avenida Antônio de Castro, Avenida Afonso Pena, etc. Avenida (avenue), Rua (road), etc are prefixes. These prefixes will be included in the index increasing its size unnecessarily. I believe that you don't need to care about the country where maps will be compiled. Firstly, because it will be very difficult to identify, understand and apply the particular rules for every country. Moreover, you will expend too much time creating these rules and the users will lost flexibility to the define their own stopwords. So, my suggestion is exactly that: allow the users to define their own stopwords. It should be developed a feature in mkgmap allowing the users to pass the stopwords throw a new parameter/file, for example: --index_stopwords=file.csv file.csv example: "rua","avenida", "tie", "katu", "polku", "kuja" mkgmap must ignore case. That's it. Regards, Alexandre 2015-02-14 5:50 GMT-02:00 Marko Mäkelä <marko.makela@iki.fi>:
On Thu, Feb 12, 2015 at 01:24:29PM +0000, Steve Ratcliffe wrote:
So finally I will merge the mixed index branch.
I believe that the database terminology for this is 'inverted index' or 'fulltext index'.
I think it would be best to selectively enable it per country along with
lists of names to avoid. This would be best done by people from or familiar with the countries in question.
In fulltext search, these are called 'stopwords'.
It might not be necessary to do anything to for countries where street names are commonly written as a single word. Example: "Main Street" would be "Hauptstrasse" in German, "Huvudgatan" in Sweden and "Päätie" in Finnish. Only if the first part of the street name is a proper name such as a person's name, the second part could be written as a separate word, separated by a space or dash.
That said, I guess it would still make sense to introduce some stopwords. Words that I can think of:
Swedish: gata, gatan, gränd, gränden, stig, stigen, (stråk, stråket) Finnish: tie, katu, polku, kuja, (raitti, taival) German: Straße, Strasse, Weg, Allee, Chaussee Estonian: mnt, maantee, tn, tänav, pst, puiestee
In Estonia, it seems to be common to write the tn, mnt or pst as a separate word.
I could be missing some stopwords in Estonian and for German-speaking countries. Also, it could be that the French loan words Allee and Chaussee are sometimes accented.
The Finnish and Swedish words that I have put in parenthesis should be very rare, typically used for ways for non-motorized traffic. I don't think that including them would pollute the index much. You might in fact want to search for such a name when you are looking for a nice walking or cycling route (i.e., you expect there to exist some random-famous-person-name-stråket, but you do not know the random name).
Marko _______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
data:image/s3,"s3://crabby-images/c125b/c125b853f0995d45aaac92eceb3ca5c1f81f52f5" alt=""
On Sat, Feb 14, 2015 at 09:57:49AM -0200, Alexandre Loss wrote:
So, my suggestion is exactly that: allow the users to define their own stopwords.
Yes, I agree on that. But, could it be done in the style files based on the administrative boundaries, similar to how the admin levels are processed? This would still allow customization, while providing a sanely working default style. Marko
data:image/s3,"s3://crabby-images/f334b/f334b31dc987476ffd5728a12c263c451ec5b72d" alt=""
The stopword processing should be language-specific and not (solely) based on admin boundaries... One man's stopword is another man's significant proper name. I agree that the languages which prefix the "road type" (like French and Spanish) are most in need of this, but it is a little less desperate for suffixes as in German and Dutch. In the latter case the roads still end up in the right place in the index when searching. There is another aspect as well: multi-part street names, often using titles and personal names of local heroes. On 2015-02-14 13:03, Marko Mäkelä wrote:
On Sat, Feb 14, 2015 at 09:57:49AM -0200, Alexandre Loss wrote:
So, my suggestion is exactly that: allow the users to define their own stopwords.
Yes, I agree on that. But, could it be done in the style files based on the administrative boundaries, similar to how the admin levels are processed? This would still allow customization, while providing a sanely working default style.
Marko _______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev [1]
Links: ------ [1] http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
data:image/s3,"s3://crabby-images/c125b/c125b853f0995d45aaac92eceb3ca5c1f81f52f5" alt=""
On Sat, Feb 14, 2015 at 01:23:18PM +0100, Colin Smale wrote:
The stopword processing should be language-specific and not (solely) based on admin boundaries... One man's stopword is another man's significant proper name.
Sure, but I would guess that there is an admin boundary around a bilingual or multilingual area where you would not want to use the normal stopwords for any of the languages. Generally, I would agree with you that country!=language, and I dislike misusing a national flag for designating language. But, we are only talking about official street name signs here. Can you think of any actual examples where one language's stopword is another language's significant proper name? Such that both languages can be used in the street name signs within the same admin boundary?
In the latter case the roads still end up in the right place in the index when searching. There is another aspect as well: multi-part street names, often using titles and personal names of local heroes.
Right. In the languages that I am aware of (other than Spanish and French and maybe Italian), the stopwords tend to be at the end, not at the start of the name. But still, you would not want to get a lot of "[stopword], local hero name" entries somewhere in the index, for each name="local hero name [stopword]" entry. Marko
data:image/s3,"s3://crabby-images/c9c8c/c9c8cc56bbe9b7629f55e8b34e1e4aae5e838de0" alt=""
Hi all, In French, from the top of my head, I can think of : Rue, Ruelle, Avenue, Boulevard, Quai, Chaussée, Route, Cour, Cours, Cité, Chemin, Place, Esplanade, Passage, Allée, Carrefour, Sentier, Square, Villa. This list is without a doubt not complete but should cover more than 95% of named addresses in France. They should only be ignored from index if they're in the first place and followed by anything else. Cheers, Paco Le 14 févr. 2015 à 08:50, Marko Mäkelä <marko.makela@iki.fi> a écrit :
On Thu, Feb 12, 2015 at 01:24:29PM +0000, Steve Ratcliffe wrote:
So finally I will merge the mixed index branch.
I believe that the database terminology for this is 'inverted index' or 'fulltext index'.
I think it would be best to selectively enable it per country along with lists of names to avoid. This would be best done by people from or familiar with the countries in question.
In fulltext search, these are called 'stopwords'.
It might not be necessary to do anything to for countries where street names are commonly written as a single word. Example: "Main Street" would be "Hauptstrasse" in German, "Huvudgatan" in Sweden and "Päätie" in Finnish. Only if the first part of the street name is a proper name such as a person's name, the second part could be written as a separate word, separated by a space or dash.
That said, I guess it would still make sense to introduce some stopwords. Words that I can think of:
Swedish: gata, gatan, gränd, gränden, stig, stigen, (stråk, stråket) Finnish: tie, katu, polku, kuja, (raitti, taival) German: Straße, Strasse, Weg, Allee, Chaussee Estonian: mnt, maantee, tn, tänav, pst, puiestee
In Estonia, it seems to be common to write the tn, mnt or pst as a separate word.
I could be missing some stopwords in Estonian and for German-speaking countries. Also, it could be that the French loan words Allee and Chaussee are sometimes accented.
The Finnish and Swedish words that I have put in parenthesis should be very rare, typically used for ways for non-motorized traffic. I don't think that including them would pollute the index much. You might in fact want to search for such a name when you are looking for a nice walking or cycling route (i.e., you expect there to exist some random-famous-person-name-stråket, but you do not know the random name).
Marko _______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
data:image/s3,"s3://crabby-images/f0134/f0134b5004a2a90c1324ff9331e4ce1f20ff1c83" alt=""
Hi all, wouldn't it be easier to let mkgmap report those words which appear in more than n (e.g. 20) roads and use that list to produce a user-defined list of stop-words? Gerd
From: paco.tyson@free.fr Date: Sat, 14 Feb 2015 15:06:16 +0100 To: mkgmap-dev@lists.mkgmap.org.uk Subject: Re: [mkgmap-dev] mixed index branch merge
Hi all,
In French, from the top of my head, I can think of :
Rue, Ruelle, Avenue, Boulevard, Quai, Chaussée, Route, Cour, Cours, Cité, Chemin, Place, Esplanade, Passage, Allée, Carrefour, Sentier, Square, Villa.
This list is without a doubt not complete but should cover more than 95% of named addresses in France.
They should only be ignored from index if they're in the first place and followed by anything else.
Cheers, Paco
Le 14 févr. 2015 à 08:50, Marko Mäkelä <marko.makela@iki.fi> a écrit :
On Thu, Feb 12, 2015 at 01:24:29PM +0000, Steve Ratcliffe wrote:
So finally I will merge the mixed index branch.
I believe that the database terminology for this is 'inverted index' or 'fulltext index'.
I think it would be best to selectively enable it per country along with lists of names to avoid. This would be best done by people from or familiar with the countries in question.
In fulltext search, these are called 'stopwords'.
It might not be necessary to do anything to for countries where street names are commonly written as a single word. Example: "Main Street" would be "Hauptstrasse" in German, "Huvudgatan" in Sweden and "Päätie" in Finnish. Only if the first part of the street name is a proper name such as a person's name, the second part could be written as a separate word, separated by a space or dash.
That said, I guess it would still make sense to introduce some stopwords. Words that I can think of:
Swedish: gata, gatan, gränd, gränden, stig, stigen, (stråk, stråket) Finnish: tie, katu, polku, kuja, (raitti, taival) German: Straße, Strasse, Weg, Allee, Chaussee Estonian: mnt, maantee, tn, tänav, pst, puiestee
In Estonia, it seems to be common to write the tn, mnt or pst as a separate word.
I could be missing some stopwords in Estonian and for German-speaking countries. Also, it could be that the French loan words Allee and Chaussee are sometimes accented.
The Finnish and Swedish words that I have put in parenthesis should be very rare, typically used for ways for non-motorized traffic. I don't think that including them would pollute the index much. You might in fact want to search for such a name when you are looking for a nice walking or cycling route (i.e., you expect there to exist some random-famous-person-name-stråket, but you do not know the random name).
Marko _______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
data:image/s3,"s3://crabby-images/4d1a2/4d1a2cc1ca7193135c2a10650420a3ff228913ee" alt=""
Hi, in my opinion stopwords sholud be dependent on country code and should be defined in style or in definition of local parameters, the same way like zip code before street name. Style would be preferred, since definitions can be included in default style, where contribute many people. Probably we could get definitions for many countries quite fast. I think it could be something like: mkgmap:country=POL {set mkgmap:stopwords='ulica;ul.'} Looking for solution, please take in consideration maps, which include many countries at once. -- Best regards, Andrzej
data:image/s3,"s3://crabby-images/f334b/f334b31dc987476ffd5728a12c263c451ec5b72d" alt=""
What about multi-lingual countries such as Belgium or Switzerland? I think the primary factor should be language, not country. Belgian French, French French and Swiss French are probably similar enough for these purposes that they can share a solution. But Belgian French and Belgian Dutch are completely different animals.... Of course you can try to map countries (or at least a generic area like "territories") to default languages, but the user may still want to override the default. How about this (sorry the abbreviations are wrong but it is only to illustrate my point): mkgmap:country=POL {set mkgmap:lang=polish;} mkgmap:region=Vlaanderen {set mkgmap:lang=dutch;} mkgmap:country=NED {set mkgmap:lang=dutch;} mkgmap:region=Wallonie {set mkgmap:lang=french;} mkgmap:country=FRA {set mkgmap:lang=french;} mkgmap:lang=french {set mkgmap:stopwords="rue,place,..."} mkgmap:lang=dutch {set mkgmap:stopwords="de,het,..."} mkgmap:lang=polish {set mkgmap:stopwords="ulica;ul."} If the stopwords were also defined to be regular expressions, then it could also handle prefixes and suffixes as well as whole words. On 2015-02-14 15:38, Andrzej Popowski wrote:
Hi,
in my opinion stopwords sholud be dependent on country code and should be defined in style or in definition of local parameters, the same way like zip code before street name.
Style would be preferred, since definitions can be included in default style, where contribute many people. Probably we could get definitions for many countries quite fast. I think it could be something like:
mkgmap:country=POL {set mkgmap:stopwords='ulica;ul.'}
Looking for solution, please take in consideration maps, which include many countries at once.
data:image/s3,"s3://crabby-images/4d1a2/4d1a2cc1ca7193135c2a10650420a3ff228913ee" alt=""
Hi,
What about multi-lingual countries such as Belgium or Switzerland?
Good remark. I think we can set "mkgmap:stopwords" directly too, without defining intermediate variable for language. These definitions would be evaluated per each object, we could change stopwords when using for example "name_fr" instead of "name" tag. -- Best regards, Andrzej
data:image/s3,"s3://crabby-images/c125b/c125b853f0995d45aaac92eceb3ca5c1f81f52f5" alt=""
On Sat, Feb 14, 2015 at 03:57:21PM +0100, Colin Smale wrote:
What about multi-lingual countries such as Belgium or Switzerland?
Or multi-lingual cities, such as Montréal in Canada? But, is this really an issue? Street signs may be in two or more languages, saying "Foo Street" and "Rue Foo" for example. Can anyone name a multi-lingual area where a stopword in one language would be a non-stopword in the other language? For example, could there be a highway=* with name="Rue Street" in a French/English area? I would not think so. For what it is worth, there are a lot of bilingual street signs in Finland, using Finnish (name:fi), Swedish (name:sv) or in the north, Sámi (name:se). It depends on the share of the minority population whether multiple languages are used. The majority language appears first in the signs. So, usually it is Finnish first, then Swedish, or Swedish first, then Finnish. Sometimes the signs could be Finnish or Swedish only.
How about this (sorry the abbreviations are wrong but it is only to illustrate my point):
mkgmap:country=POL {set mkgmap:lang=polish;}
AFAIU, your suggestion wrongly assumes that only one language will be used in a given region. And I think it should be based on administrative regions, not necessarily countries. How would you represent an area that has multiple official languages that can appear on street signs? I think that the OSM convention would be something like this: { set mkgmap:lang:fi=yes; mkgmap:lang:sv=yes; } or the (more tricky for our style rules) { set mkgmap:lang='fi;sv' }
If the stopwords were also defined to be regular expressions, then it could also handle prefixes and suffixes as well as whole words.
I agree that defining stopwords as regular expressions would provide some necessary flexibility. Like someone said, we do not want to omit Straße (or other stopwords) at the start of a street name in languages that usually put the stopword at the end of the name. But, in French and Spanish the stopword is always at the start of the name. An anchored regexp (\<Straße$ or ^Calle\>) would nicely express this. Maybe the regexp could also facilitate a rewriting system for abbreviating the index entries, such as replacing "street" with "st" in English, "Straße" with "Str" in German, "puiestee" with "pst" in Estonian, "katu" with "k" in Finnish and so on. Marko
data:image/s3,"s3://crabby-images/f334b/f334b31dc987476ffd5728a12c263c451ec5b72d" alt=""
On 2015-02-14 20:45, Marko Mäkelä wrote:
On Sat, Feb 14, 2015 at 03:57:21PM +0100, Colin Smale wrote:
What about multi-lingual countries such as Belgium or Switzerland?
Or multi-lingual cities, such as Montréal in Canada?
But, is this really an issue? Street signs may be in two or more languages, saying "Foo Street" and "Rue Foo" for example. Can anyone name a multi-lingual area where a stopword in one language would be a non-stopword in the other language?
"de" is "the" in Dutch, "of" in French - both (candidate) stopwords in their own way, but you would want different rules for keeping or omitting "de" in street names. It also means "South" in Welsh, which you probably would not want to omit in most cases.....
For example, could there be a highway=* with name="Rue Street" in a French/English area? I would not think so.
For what it is worth, there are a lot of bilingual street signs in Finland, using Finnish (name:fi), Swedish (name:sv) or in the north, Sámi (name:se). It depends on the share of the minority population whether multiple languages are used. The majority language appears first in the signs. So, usually it is Finnish first, then Swedish, or Swedish first, then Finnish. Sometimes the signs could be Finnish or Swedish only.
How about this (sorry the abbreviations are wrong but it is only to illustrate my point): mkgmap:country=POL {set mkgmap:lang=polish;}
AFAIU, your suggestion wrongly assumes that only one language will be used in a given region. And I think it should be based on administrative regions, not necessarily countries.
I intended to suggest that each area would have a single "default" language. Main reason is to select the correct stopword treatment in the absence of explicit name:xx tags. In most cases roads are just tagged with "name=*" - so this mechanism would define the mapping of "name" to a language. Then you only need a single stopword treatment for the language, which can be shared by all territories which use that language.
How would you represent an area that has multiple official languages that can appear on street signs? I think that the OSM convention would be something like this:
{ set mkgmap:lang:fi=yes; mkgmap:lang:sv=yes; } or the (more tricky for our style rules) { set mkgmap:lang='fi;sv' }
Well, I assume that the maps produced by mkgmap are targeted to a language (or ordered list of languages) chosen by the mkgmap user. I can't imagine someone wanting all the languages in the map at the same time. Can the Garmin format even handle that?
If the stopwords were also defined to be regular expressions, then it could also handle prefixes and suffixes as well as whole words.
I agree that defining stopwords as regular expressions would provide some necessary flexibility. Like someone said, we do not want to omit Straße (or other stopwords) at the start of a street name in languages that usually put the stopword at the end of the name. But, in French and Spanish the stopword is always at the start of the name. An anchored regexp (<Straße$ or ^Calle>) would nicely express this.
Maybe the regexp could also facilitate a rewriting system for abbreviating the index entries, such as replacing "street" with "st" in English, "Straße" with "Str" in German, "puiestee" with "pst" in Estonian, "katu" with "k" in Finnish and so on.
Marko _______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev [1]
Links: ------ [1] http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
data:image/s3,"s3://crabby-images/4d1a2/4d1a2cc1ca7193135c2a10650420a3ff228913ee" alt=""
Hi,
I can't imagine someone wanting all the languages in the map at the same time. Can the Garmin format even handle that?
City Navigator maps support multiple languages. Street names change, when you change language settings in GPS. For example, if you use German, then street name is "Waldheimer Weg" but in Italian you see "Via Villa del Bosco". I have found this example in South Tyrol–Trentino region. Not possible with mkgmap but you can add multiple labels to a street and all should be included in search index. I think in most cases we could put stopwords for 2-3 languages. For case above, we can set both "via" and "weg" as stopwords in Italy. -- Best regards, Andrzej
data:image/s3,"s3://crabby-images/c125b/c125b853f0995d45aaac92eceb3ca5c1f81f52f5" alt=""
On Sun, Feb 15, 2015 at 01:02:37AM +0100, Colin Smale wrote:
But, is this really an issue? Street signs may be in two or more languages, saying "Foo Street" and "Rue Foo" for example. Can anyone name a multi-lingual area where a stopword in one language would be a non-stopword in the other language?
"de" is "the" in Dutch, "of" in French - both (candidate) stopwords in their own way, but you would want different rules for keeping or omitting "de" in street names.
It also means "South" in Welsh, which you probably would not want to omit in most cases.....
Is there an administrative area where both Welsh and Dutch or French are used in street name signs? I would not expect so.
AFAIU, your suggestion wrongly assumes that only one language will be used in a given region. And I think it should be based on administrative regions, not necessarily countries.
I intended to suggest that each area would have a single "default" language. Main reason is to select the correct stopword treatment in the absence of explicit name:xx tags. In most cases roads are just tagged with "name=*" - so this mechanism would define the mapping of "name" to a language. Then you only need a single stopword treatment for the language, which can be shared by all territories which use that language.
Right, this suggestion should be a reasonable default, if it is selectable by any admisitrative level (such as country, state, province or municipality).
How would you represent an area that has multiple official languages that can appear on street signs? I think that the OSM convention would be something like this:
{ set mkgmap:lang:fi=yes; mkgmap:lang:sv=yes; } or the (more tricky for our style rules) { set mkgmap:lang='fi;sv' }
Well, I assume that the maps produced by mkgmap are targeted to a language (or ordered list of languages) chosen by the mkgmap user.
I do not think that it is always a reasonable assumption. Most areas in Finland use Finnish as the primary official language. For some places or streets in a Swedish-speaking area there could exist Finnish names, but maybe the Finnish-speaking minority is so small that the signs are only in Swedish. It could be more useful to have the street names in the same language on the map as you have on the signs. Only if the street signs displayed each language, it would make more sense to let the user to override the primary language (name=* labels) at map translation time. Similarly, when travelling in former Finnish-language areas that were made part of the Soviet Union, it could be more useful to have the current Russian names on the map, because the street signs would not be in Finnish any more.
I can't imagine someone wanting all the languages in the map at the same time. Can the Garmin format even handle that?
AFAIU, only the Garmin NT format (which mkgmap does not support) allows you to define labels in multiple languages. Marko
data:image/s3,"s3://crabby-images/e44cb/e44cb4f7e0092e7cf5766c42740c31f899660f49" alt=""
Hi Gerd, I would prefer kind of such a solution. Maybe if this list is generated after processing mkgmap:country the country could be also used, so having a list for each country. Of course it would be nice to have the list written to a file and also give mkgmap a specific list. Format could be something like: [DE] Straße Weg [US] Drive Street Bye Henning
data:image/s3,"s3://crabby-images/802f4/802f43eb70afc2c91d48f43edac9b0f56b0ec4a4" alt=""
Hi There are some interesting comments here. I did have code to count the number of times certain words appeared in a name in attempt to automatically create a stop word list for a map. It turned out that it wasn't all that useful, for England at least. From the numbers you get stop words such as 'The', 'Avenue' and 'Road' as you would expect. However many streets have names such as 'The Avenue' 'Avenue Road' and so on that consist entirely of likely stop words. And these are not theoretical names that occur infrequently, these are names of streets that I know. I think we really need to be able to identify which parts of the name are useful to index, rather than which parts are not. So for England I think that the only rule required is to index from the beginning of the name, as now. For places where streets are named after people and there is no word for 'street' included, and the street is generally refered to by the second name then probably adding entries for all parts of the name will work. For places where there is a word for street at the beginning then we have to step over that word and any following prepositions etc. So for France not just "Rue", but any following "de", "des", "d'" etc. The required action does of course depend on language rather than country, but we don't in general have the language, so we will have to start out using the country (or perhaps region) and see how that goes. I suspect it will work quite well, but if not we can think of something else when the problems are more well known. I guess we will start out having configurable rule types and word lists, but we need to gather sensible defaults once a working system is developed for each country. ..Steve
data:image/s3,"s3://crabby-images/f0134/f0134b5004a2a90c1324ff9331e4ce1f20ff1c83" alt=""
Hi Steve, I fear I don't understand what problem you see with roads like 'The Avenue' My understanding is that we put the full name into the index, so the road can be found. On the other hand, nobody would expect to find this road by typing just avenue, right? Gerd
Date: Mon, 16 Feb 2015 00:21:26 +0000 From: steve@parabola.me.uk To: mkgmap-dev@lists.mkgmap.org.uk Subject: Re: [mkgmap-dev] mixed index branch merge
Hi
There are some interesting comments here.
I did have code to count the number of times certain words appeared in a name in attempt to automatically create a stop word list for a map. It turned out that it wasn't all that useful, for England at least.
From the numbers you get stop words such as 'The', 'Avenue' and 'Road' as you would expect. However many streets have names such as 'The Avenue' 'Avenue Road' and so on that consist entirely of likely stop words. And these are not theoretical names that occur infrequently, these are names of streets that I know.
I think we really need to be able to identify which parts of the name are useful to index, rather than which parts are not.
So for England I think that the only rule required is to index from the beginning of the name, as now.
For places where streets are named after people and there is no word for 'street' included, and the street is generally refered to by the second name then probably adding entries for all parts of the name will work.
For places where there is a word for street at the beginning then we have to step over that word and any following prepositions etc. So for France not just "Rue", but any following "de", "des", "d'" etc.
The required action does of course depend on language rather than country, but we don't in general have the language, so we will have to start out using the country (or perhaps region) and see how that goes. I suspect it will work quite well, but if not we can think of something else when the problems are more well known.
I guess we will start out having configurable rule types and word lists, but we need to gather sensible defaults once a working system is developed for each country.
..Steve _______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
data:image/s3,"s3://crabby-images/e44cb/e44cb4f7e0092e7cf5766c42740c31f899660f49" alt=""
Am 16.02.2015 um 14:05 schrieb Gerd Petermann:
Hi Steve,
I fear I don't understand what problem you see with roads like 'The Avenue' My understanding is that we put the full name into the index, so the road can be found. On the other hand, nobody would expect to find this road by typing just avenue, right?
I don't think so. Think about The Avenue and Greenwich Avenue. The first one I would expect to find by Avenue, the second one not. But I think there could be hard stop-words and soft ones. So first the hard once like "The" or "Straße" will be removed. Soft stop words will be only removed if there are other parts in the name. Another possibility could also be a white list for such combinations. Henning
data:image/s3,"s3://crabby-images/802f4/802f43eb70afc2c91d48f43edac9b0f56b0ec4a4" alt=""
Hi Gerd I wasn't trying to say it was a problem, but that having a stop word list for England is not useful. Just adding the full name is probably enough, certainly no one has complained before. For say France however, the full name will not be added (as a searchable phrase), just the part from the first useful word. This will require a word list to know which words to skip. There may be other rules needed. So we need a way of expressing which rule(s) to apply. ..Steve On 16 February 2015 13:05:46 GMT+00:00, Gerd Petermann <gpetermann_muenchen@hotmail.com> wrote:
Hi Steve,
I fear I don't understand what problem you see with roads like 'The Avenue' My understanding is that we put the full name into the index, so the road can be found. On the other hand, nobody would expect to find this road by typing just avenue, right?
Gerd
Date: Mon, 16 Feb 2015 00:21:26 +0000 From: steve@parabola.me.uk To: mkgmap-dev@lists.mkgmap.org.uk Subject: Re: [mkgmap-dev] mixed index branch merge
Hi
There are some interesting comments here.
I did have code to count the number of times certain words appeared in a name in attempt to automatically create a stop word list for a map. It turned out that it wasn't all that useful, for England at least.
From the numbers you get stop words such as 'The', 'Avenue' and 'Road' as you would expect. However many streets have names such as 'The Avenue' 'Avenue Road' and so on that consist entirely of likely stop words. And these are not theoretical names that occur infrequently, these are names of streets that I know.
I think we really need to be able to identify which parts of the name are useful to index, rather than which parts are not.
So for England I think that the only rule required is to index from the beginning of the name, as now.
For places where streets are named after people and there is no word for 'street' included, and the street is generally refered to by the second name then probably adding entries for all parts of the name will work.
For places where there is a word for street at the beginning then we have to step over that word and any following prepositions etc. So for France not just "Rue", but any following "de", "des", "d'" etc.
The required action does of course depend on language rather than country, but we don't in general have the language, so we will have to start out using the country (or perhaps region) and see how that goes. I suspect it will work quite well, but if not we can think of something else when the problems are more well known.
I guess we will start out having configurable rule types and word lists, but we need to gather sensible defaults once a working system is developed for each country.
..Steve _______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
------------------------------------------------------------------------
_______________________________________________ mkgmap-dev mailing list mkgmap-dev@lists.mkgmap.org.uk http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
participants (9)
-
Alexandre Loss
-
Andrzej Popowski
-
Colin Smale
-
Gerd Petermann
-
Henning Scholland
-
Marko Mäkelä
-
Paco Tyson
-
Steve Ratcliffe
-
thesurveyor@wolke7.net