Another splitter bug

Paul Ortyl

14 Aug 2009 14 Aug '09

8:37 p.m.

Hi, there is next splitter bugreport: Splitter r73 europe.osm from today java -Xmx3800m -verbose:gc -jar ../splitter.jar --max-areas=255 --max-nodes=600000 ../../europe.osm [GC 98554K->51194K(99008K), 0.0009550 secs] 12,500,000 nodes processed... [GC 96378K->51194K(96896K), 0.0009430 secs] [GC 94330K->51194K(95040K), 0.0009170 secs] [GC 92410K->51194K(93120K), 0.0009060 secs] [GC 90554K->51194K(91392K), 0.0008910 secs] [GC 88826K->51194K(89728K), 0.0008880 secs] [GC 87162K->51194K(88128K), 0.0008940 secs] [GC 85562K->51194K(86656K), 0.0009170 secs] [GC 84090K->51194K(85248K), 0.0009010 secs] [GC 82682K->51194K(83904K), 0.0008870 secs] Exception in thread "main" java.lang.NullPointerException at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:991) at java.lang.Double.parseDouble(Double.java:510) at uk.me.parabola.splitter.DivisionParser.startElement(DivisionParser.java:64) at uk.me.parabola.splitter.AbstractXppParser.parse(AbstractXppParser.java:38) at uk.me.parabola.splitter.Main.calculateAreas(Main.java:183) at uk.me.parabola.splitter.Main.split(Main.java:108) at uk.me.parabola.splitter.Main.main(Main.java:87) Paul -- Don't take life too seriously; you will never get out of it alive. -- Elbert Hubbard

Show replies by date

Chris Miller

15 Aug 15 Aug

9:51 a.m.

Hi Paul, This is because the .osm file you are splitting has a node with no 'lat' attribute. As far as I'm aware this shouldn't happen; something is probably wrong with your .osm file? I downloaded the europe.osm file a few hours after you posted your message and it processes without a problem. My europe.osm file is 28,626,280,448 bytes in size. Anyway I've put in a check for this since in this case the change is simple and it doesn't have much affect on performance. The splitter will now output details of the problem, ignore the node and carry on with the split. Generally speaking though it's not such a good idea to put too much validation of the XML into the splitter because it will just complicate the code and slow things down. I guess if we any further problems like this we'll have to decide what's best on a case-by-case basis. Chris PO> Hi, PO> PO> there is next splitter bugreport: PO> PO> Splitter r73 PO> europe.osm from today PO> java -Xmx3800m -verbose:gc -jar ../splitter.jar --max-areas=255 PO> --max-nodes=600000 ../../europe.osm PO> [GC 98554K->51194K(99008K), 0.0009550 secs] PO> 12,500,000 nodes processed... PO> [GC 96378K->51194K(96896K), 0.0009430 secs] PO> [GC 94330K->51194K(95040K), 0.0009170 secs] PO> [GC 92410K->51194K(93120K), 0.0009060 secs] PO> [GC 90554K->51194K(91392K), 0.0008910 secs] PO> [GC 88826K->51194K(89728K), 0.0008880 secs] PO> [GC 87162K->51194K(88128K), 0.0008940 secs] PO> [GC 85562K->51194K(86656K), 0.0009170 secs] PO> [GC 84090K->51194K(85248K), 0.0009010 secs] PO> [GC 82682K->51194K(83904K), 0.0008870 secs] PO> Exception in thread "main" java.lang.NullPointerException PO> at PO> sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:9 PO> 91) PO> at java.lang.Double.parseDouble(Double.java:510) PO> at PO> uk.me.parabola.splitter.DivisionParser.startElement(DivisionParser.j PO> ava:64) PO> at PO> uk.me.parabola.splitter.AbstractXppParser.parse(AbstractXppParser.ja PO> va:38) PO> at uk.me.parabola.splitter.Main.calculateAreas(Main.java:183) PO> at uk.me.parabola.splitter.Main.split(Main.java:108) PO> at uk.me.parabola.splitter.Main.main(Main.java:87) PO> Paul PO>

Paul Ortyl

10:14 a.m.

2009/8/15 Chris Miller <chris.miller@kbcfp.com>:

...

Hi Paul,

This is because the .osm file you are splitting has a node with no 'lat' attribute. As far as I'm aware this shouldn't happen; something is probably wrong with your .osm file? I downloaded the europe.osm file a few hours after you posted your message and it processes without a problem. My europe.osm file is 28,626,280,448 bytes in size.

My file was about 200KB smaller...

...

Anyway I've put in a check for this since in this case the change is simple and it doesn't have much affect on performance. The splitter will now output details of the problem, ignore the node and carry on with the split. Generally speaking though it's not such a good idea to put too much validation of the XML into the splitter because it will just complicate the code and slow things down. I guess if we any further problems like this we'll have to decide what's best on a case-by-case basis.

I think that putting additional validation in the splitter for the case of reporting and ignoring bad data is useful. I can understand the problem you mentioned about performance -- it is almost always the case of making compromise between robustness and speed. The alternative would be creating additional app that would filter the bad data ("bad" as defined in splitter, not necessarily OSM) and write valid xml to file for processing with splitter. It is almost always the case you cannot assume that data coming from outside of your private framework is valid. The use case I could imagine would be processing data with splitter, if it fails, then preprocessing it with "cleaner" application and starting splitter once again on the cleaned data. It is just an idea. If we come across significant speed reduction because of necessary robustness then splitting validation/cleaning and processing might be a good way to go. Thanks for the fix :) Paul -- Don't take life too seriously; you will never get out of it alive. -- Elbert Hubbard

FlaBot

10:33 a.m.

cant you put in the check as an option. With the -check_for_bad_date all data will be past in the splitter for a growing number of know xml bugs and without the check .. you check will be done an the app will be faster ? lg Dirk

Chris Miller

10:57 a.m.

That doesn't really buy us much unfortunately, given the way the parsing works the checks need to be scattered about in various places (otherwise a separate validation pass would be required, taking even longer. I think a separate validation step is outside the scope of the splitter). Experience in dealing with huge volumes of unreliable external data feeds at my current job has taught me there's only so much you can do to tollerate and recover from bad input. If the source data is corrupt/unexpected then it's often better to fail-fast and let a human deal with the problem rather than make assumptions about what is wrong with the data and carry on since that can just introduce further problems downstream. And something else I've learned is that no matter how many checks you have in place, something will always slip through the cracks anyway! That aside, I don't think we have anything to worry about here. The splitter is already tollerant of unrecognised XML tags and attributes and for the most part this is all we really need. As I said in a previous post, any further problems that we hit of this nature are best dealt with on a case-by-case basis. Chris F> cant you put in the check as an option. With the -check_for_bad_date F> all data will be past in the splitter for a growing number of know F> xml bugs and without the check .. you check will be done an the app F> will be faster ? F> F> lg Dirk

Jeffrey Ollie

2:19 p.m.

On Sat, Aug 15, 2009 at 5:57 AM, Chris Miller<chris.miller@kbcfp.com> wrote:

...

That doesn't really buy us much unfortunately, given the way the parsing works the checks need to be scattered about in various places (otherwise a separate validation pass would be required, taking even longer. I think a separate validation step is outside the scope of the splitter).

Is there a XML schema definition for the 0.6 XML dumps? A bit of googling showed one for the 0.5 api but I didn't see anything for 0.6. If we had an XML schema definition a separate XML parser/validator could be used to check inputs. -- Jeff Ollie

Chris Miller

3:17 p.m.

JO> Is there a XML schema definition for the 0.6 XML dumps? A bit of JO> googling showed one for the 0.5 api but I didn't see anything for JO> 0.6. JO> If we had an XML schema definition a separate XML parser/validator JO> could be used to check inputs. I can't say I've looked for one though I'd hope there was a schema/DTD somewhere. As far as the splitter's concerned it's only interested in a limited subset of the information anyway so strict conformance to the schema isn't essential. It certainly wouldn't do any harm for people to validate their XML one way or another before passing it to the splitter though.

5671

Age (days ago)

5672

Last active (days ago)

List overview

6 comments

4 participants

participants (4)

Chris Miller
FlaBot
Jeffrey Ollie
Paul Ortyl