Hi Guys,
Thanks for the replies.
@Bill - I agree, it's unlikely that someone else would not have found such a bug given how widely used lwIP is. My first assumption is always that I've made an error somewhere:-) but there is no harm in asking the question while I search for the answer. I should have mentioned that I have LWIP_CHECKSUM_ON_COPY = 1 and CHECKSUM_CHECK_TCP = 1. So my code calls the #if TCP_CHECKSUM_ON_COPY code hence I don't call inet_chksum_pseudo(), see below for more.
@Simon - I'll apply the patch and re-test, but I can see from a debug run that that bit of code is not being executed in my implementation.
If you saw my follow up email, you will notice that I identified the code that is causing my problem. It is caused by the line of code at line 1146 in tcp_out.c (Note I have LWIP_CHECKSUM_ON_COPY = 1)
"acc += (u16_t)~(seg->chksum);"
acc is a one's compliment checksum obtained from a call to inet_chksum_pseudo_partial() and seg->chksum is a checksum of the payload.
What is happening is that occasionally during operation acc is resulting in a value of M and seg->chksum has, by coincidence, a value of M. Then M + (~M) always gives 0xFFFF.
Why hasn't it been seen by others before?
As I'm sure you are aware (I have just been reading up on it!) some checksum checkers might accept 0xFFFF as a valid checksum depending on how they validate the checksum (recalculate and compare to inserted checksum OR calculate with checksum value and check results is = 0). On windows 7 in my application it seems it re-calculates and compares the checksum and expects 0x0000 (wireshark does too!). This combination of lwip options and checksum validation method might explain why others may not have seen this error before now?
Mathematically speaking using ones compliment maths, ~(sum(a+b+c+d)) is not the same as [(~sum(a+b)) + (~sum(c+d))] for the special corner case where sum(a+b) = ~sum(c+d). In this special case the answer will be 0xFFFF instead of 0x0000. Which is what is happening in my case!
example (using 4 bit numbers for simplicity):
let a = 1, b = 2, c = 4, d = 8.
checksum = ~sum(a+b+c+d) = ~(0xF) = 0x0
sum(a+b) = 3
sum(b+c) = 0xC
Calculated by code = [(~sum(a+b)) + (~sum(c+d))] = [~(3) + ~(0xC)] = [0xC + 3] = 0xF
QED!?
I'm more convinced that this is a coding issue in lwIP that doesn't handle this special corner case, but am happy to be proved wrong!
Regards,
Niall.