PNelly
Member
I've landed a real fun one guys.
A little context in the first few paragraphs then the spooky stuff.
I have a fairly complex networking project going that's got a meet-up server which facilitates udp hole punching, a reliable udp implementation, all kinds of neat stuff. Recently I made some changes to how the meet-up server and meet-up clients talk to each other to reduce bandwidth consumption by that part of the application. I started encountering a very strange fault that I haven't been able to track down, and I'm beginning to wonder if perhaps packets are being corrupted in transit, or something else is going wrong outside of the GML. I'll explain why.
The problem manifests itself as a crash caused by trying to access a map that doesn't exist, or a non-existent key within a map that does exist, with a low reproducibility of about 5%. Not the best, but should still be straightforward right? Capture the packet metadata in the headers I created (a message id and some other parameters), and use that information to pinpoint where I'm writing or reading data incorrectly.
My message ids are declared as two sets of enums (one for tcp and one for udp), that span the ranges 0-18 and 1000-1010. Any time a packet is to be sent, the buffer containing the data is passed to a script that fills in all the header information, with the message id enum as an argument. Any time a packet is received, the header information is consumed and the appropriate action taken with the data. It's been a pretty rock solid system so far.
Skipping some of the intermediate detective work, I discovered an invalid message id is being passed around somehow as a part of the problem, with a value of 44000 or some such nonsense causing a bad map access. Since that doesn't tell me where to go looking for the fault my next step was to capture the ip and port associated with the Network Async event itself to try and gather more clues. I figured it'd at least point me towards which program instances were talking to each other, and what state they were in when things went wrong.
Now it gets weird. I was able to capture some of these packets this morning and got the following for the ip and port receiving the Network Async event. Which if you recall, come from async_load[] and not from reading the buffer associated with the data event:
I've been testing by running multiple program instances on the same machine, so they're all using the ip 127.0.0.1. The ip 148.245.24.0 is not an address on my local network (logged into my router to double check) and my internet connection was disabled at the time. The port value of 0 of course makes no sense at all to begin with, and additionally the meet-up server uses a port in the 4000 range and the client sockets are all placed in the ephemeral port range (49152 - 65535). Simply bizarre.
Of course there's probably something wrong with my code or system design that contributes to (or outright causes) the problem, but those weird values give me some doubts. It stands to reason that if I'm writing or reading buffer data with the wrong format or typing that I could reproduce the crash very consistently, rather than once in a blue moon with everything (appearing to be) working flawlessly the rest of the time. Further, the funky ip and port values in async_load[] I think would have to come from somewhere under the hood, and not the GML itself.
A few points of information that might be relevant:
Any insights on what in the world might be going on would be awesome. I think the next thing I'll try when I get home is simply ignoring any packets with bad metadata, then seeing what information if any is missing in the application that received it. I really don't like the idea of that becoming approach becoming a band-aid though, I'd very much like to get to the bottom of this.
More generally, if packet corruption is possible (whether or not it's happening here) should I be looking out for bad metadata all the time? My understanding is that sort of thing with checksums etc is already taken care of in the transport layer, and that I shouldn't have to worry about it?
Any help appreciated!
Cheers,
Patrick
A little context in the first few paragraphs then the spooky stuff.
I have a fairly complex networking project going that's got a meet-up server which facilitates udp hole punching, a reliable udp implementation, all kinds of neat stuff. Recently I made some changes to how the meet-up server and meet-up clients talk to each other to reduce bandwidth consumption by that part of the application. I started encountering a very strange fault that I haven't been able to track down, and I'm beginning to wonder if perhaps packets are being corrupted in transit, or something else is going wrong outside of the GML. I'll explain why.
The problem manifests itself as a crash caused by trying to access a map that doesn't exist, or a non-existent key within a map that does exist, with a low reproducibility of about 5%. Not the best, but should still be straightforward right? Capture the packet metadata in the headers I created (a message id and some other parameters), and use that information to pinpoint where I'm writing or reading data incorrectly.
My message ids are declared as two sets of enums (one for tcp and one for udp), that span the ranges 0-18 and 1000-1010. Any time a packet is to be sent, the buffer containing the data is passed to a script that fills in all the header information, with the message id enum as an argument. Any time a packet is received, the header information is consumed and the appropriate action taken with the data. It's been a pretty rock solid system so far.
Skipping some of the intermediate detective work, I discovered an invalid message id is being passed around somehow as a part of the problem, with a value of 44000 or some such nonsense causing a bad map access. Since that doesn't tell me where to go looking for the fault my next step was to capture the ip and port associated with the Network Async event itself to try and gather more clues. I figured it'd at least point me towards which program instances were talking to each other, and what state they were in when things went wrong.
Now it gets weird. I was able to capture some of these packets this morning and got the following for the ip and port receiving the Network Async event. Which if you recall, come from async_load[] and not from reading the buffer associated with the data event:
- ip:148.245.24.0
- port: 0
I've been testing by running multiple program instances on the same machine, so they're all using the ip 127.0.0.1. The ip 148.245.24.0 is not an address on my local network (logged into my router to double check) and my internet connection was disabled at the time. The port value of 0 of course makes no sense at all to begin with, and additionally the meet-up server uses a port in the 4000 range and the client sockets are all placed in the ephemeral port range (49152 - 65535). Simply bizarre.
Of course there's probably something wrong with my code or system design that contributes to (or outright causes) the problem, but those weird values give me some doubts. It stands to reason that if I'm writing or reading buffer data with the wrong format or typing that I could reproduce the crash very consistently, rather than once in a blue moon with everything (appearing to be) working flawlessly the rest of the time. Further, the funky ip and port values in async_load[] I think would have to come from somewhere under the hood, and not the GML itself.
A few points of information that might be relevant:
- When the crash does appear it happens when an established udp session is broken up. That process entails the (ex) udp host sending udp data to tell the clients to pack up, the (ex) udp clients then close their udp sockets, and open tcp sockets and connect back to the meet-up server, which leads to more information being exchanged. Could all of those machinations contribute to data being interpreted incorrectly?
- Like I said in the previous bullet, there's tcp and udp stuff often happening at the same time. Is there some nuance about the Network Async event perhaps treating them differently that could be contributing?
- Each time I've seen the crash happen the invalid message id is the same. Additionally, the very first item in the buffer header is supposed to be a boolean value, but reading it as a u8 shows it contains 173. That evaluates to true in a condition check but is clearly wrong. That would seem to indicate I've written bad data somewhere, but even so what's up with the crazy associated ip address and port number?
- I'm not using the *_raw network functions so whatever's going on I think the GMS header has to be intact for this to appear, and the information I capture is the same each time it happens. I think this ought to rule out packet corruption.
Any insights on what in the world might be going on would be awesome. I think the next thing I'll try when I get home is simply ignoring any packets with bad metadata, then seeing what information if any is missing in the application that received it. I really don't like the idea of that becoming approach becoming a band-aid though, I'd very much like to get to the bottom of this.
More generally, if packet corruption is possible (whether or not it's happening here) should I be looking out for bad metadata all the time? My understanding is that sort of thing with checksums etc is already taken care of in the transport layer, and that I shouldn't have to worry about it?
Any help appreciated!
Cheers,
Patrick
Last edited: