lwip-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lwip-devel] trying to make closing-from-different-task-while-in-accept/


From: Fabian Koch
Subject: [lwip-devel] trying to make closing-from-different-task-while-in-accept/select/connect possible
Date: Tue, 1 Apr 2014 18:36:36 +0200

Hey all,

disclaimer: I know that LwIP currently does not support what we want to achieve here. So if you want to answer "LwIP doesn't support this", save yourself the time. Still, I hope I can get some ideas and feedback from you guys about this because we really need it.

We have got a version running now that kind of "works". We are not sure yet, whether we have thought of everything and that is part of the reason I write this mail.

1) in the sys_arch layer, we had a very paranoid implementation that we had to losen a bit to achieve the current working behavior:
* sys_arch_mbox_fetch() and sys_arch_sem_wait() do return with SYS_ARCH_TIMEOUT when their specific resources were deleted while waiting on. (Should be a different return value to be able to react differently but works for now)

2) inside lwip_select(), where the socket->select_waiting counter is incremented, there is an assertion that fires, when the socket is suddenly gone (scenario: close() from one task while still in select() in other task).
We changed this like this:

-        LWIP_ASSERT("sock != NULL", sock != NULL);
+        if(sock == NULL) {
+          continue;        //socket has been closed while we were in the select(). Continue the loop to get out of the select() cleanly
+        }

for both for-loops where select_waiting is supposed to be increased.
Also not a clean solution/reaction but at least select() returns.

3) we encountered a problem when two tasks both did a lwip_close() on the same socket in a very short order of time.
When the close() is called the second time, while the first is still in progress, free_socket() was not called yet, so the protections in netconn_delete() and do_delconn() do not work and the close is executed a second time.
When it hits netconn_free(), the sys_sem_free() works on the already invalid semaphore and the memp_free() on the already invalid conn.
To avoid this we made the following change in netconn_free():

-  sys_sem_free(&conn->op_completed);
-  sys_sem_set_invalid(&conn->op_completed);
-
-  memp_free(MEMP_NETCONN, conn);
+  if (sys_sem_valid(&conn->op_completed)) {
+    sys_sem_free(&conn->op_completed);
+    sys_sem_set_invalid(&conn->op_completed);
+    memp_free(MEMP_NETCONN, conn);
+  }

4) in netconn_accept() we added an initialization value for the newconn pointer. Not only because it is better style to have a knwon value there, but also to cope with the fact from point 1) where we return from sys_arch_mbox_fetch() with SYS_ARCH_TIMEOUT when the socket was closed while we stuck in accept().

-  struct netconn *newconn;
+  struct netconn *newconn = NULL;

5) in the general tcpip_apimsg() call, we introduced a protection against returning possibly wrong (e.g. ERR_OK) error codes after returning from sys_arch_sem_wait() when it was closed:

@@ -316,6 +316,10 @@
     msg.msg.apimsg = apimsg;
     sys_mbox_post(&mbox, &msg);
     sys_arch_sem_wait(&apimsg->msg.conn->op_completed, 0);
+    /* do a quick check if the connection is still valid after we return from sem_wait */
+    if(apimsg->msg.conn == NULL){
+      return ERR_VAL;
+    }
     return apimsg->msg.err;
   }
   return ERR_VAL;

we could have checked for SYS_ARCH_TIMEOUT here, but I feel that this hack has to end in the future and sys_arch_sem_wait() and _mbox_fetch() should return something else when their resources have been deleted...

6) We are still looking for a good solution for the problem of closing a socket that is currently blocking in connect(). Of course there are several LWIP_ASSERT() in do_delconn() but even ignoring those, the netconn_drain() call is a problem with our sys_arch layer and it gets pretty ugly from there.
I'll keep looking into that roadblock tomorrow (there is an  @todo TCP: abort running write/connect? comment in there, so...)

7) Doing write and read from different tasks seems to work for now. As long as there are only one tx and one rx task

I would love any notes, hints, help, comments, anything!

kind regards,
Fabian
reply via email to

[Prev in Thread] Current Thread [Next in Thread]