mldonkey-bugs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Mldonkey-bugs] [bugs #11384] Source of Orphaned File Descriptor Bug


From: anonymous
Subject: [Mldonkey-bugs] [bugs #11384] Source of Orphaned File Descriptor Bug
Date: Thu, 06 Jan 2005 21:51:45 -0500
User-agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/125.5.5 (KHTML, like Gecko) Safari/125.12

This mail is an automated notification from the bugs tracker
 of the project: mldonkey, a multi-networks file-sharing client.

/**************************************************************************/
[bugs #11384] Latest Modifications:

Changes by: Anonymous user
Date:  
                Thu 01/06/2005 at 21:26

------------------ Additional Follow-up Comments ----------------------------
Although I don't understand the source of the problem, I think I now know the 
reason for the orphaned socket on Mac OS X.  I believe the source of the 
problem is the OCAML Unix.accept routine.  This routine I think calls the OCAML 
unix_accept routine defined in ocaml-3.08.0/otherlibs/unix/accept.c and the 
code looks like this:

CAMLprim value unix_accept(value sock)
{
  int retcode;
  value res;
  value a;
  union sock_addr_union addr;
  socklen_param_type addr_len;
  
  addr_len = sizeof(addr);
  enter_blocking_section();
  retcode = accept(Int_val(sock), &addr.s_gen, &addr_len);
  leave_blocking_section();
  if (retcode == -1) uerror("accept", Nothing);
  a = alloc_sockaddr(&addr, addr_len);
  Begin_root (a);
    res = alloc_small(2, 0);
    Field(res, 0) = Val_int(retcode);
    Field(res, 1) = a;
  End_roots();
  return res;
}

Notice that the unix accept routine is called which creates a file descriptor 
in retcode (assuming no error).    Then this routine calls alloc_sockaddr which 
is defined in ocaml-3.08.0/otherlibs/unix/socketaddr.c and this routine looks 
like this:

value alloc_sockaddr(union sock_addr_union * adr /*in*/,
                     socklen_param_type adr_len)
{
  value res;
  switch(adr->s_gen.sa_family) {
#ifndef _WIN32
  case AF_UNIX:
    { value n = copy_string(adr->s_unix.sun_path);
      Begin_root (n);
        res = alloc_small(1, 0);
        Field(res,0) = n;
      End_roots();
      break;
    }
#endif 
  case AF_INET: 
    { value a = alloc_inet_addr(&adr->s_inet.sin_addr);
      Begin_root (a);
        res = alloc_small(2, 1);
        Field(res,0) = a;
        Field(res,1) = Val_int(ntohs(adr->s_inet.sin_port));
      End_roots();
      break;
    }
#ifdef HAS_IPV6
  case AF_INET6:
    { value a = alloc_inet6_addr(&adr->s_inet6.sin6_addr);
      Begin_root (a);
        res = alloc_small(2, 1);
        Field(res,0) = a;
        Field(res,1) = Val_int(ntohs(adr->s_inet6.sin6_port));
      End_roots();
      break;
    }
#endif
  default:
    unix_error(EAFNOSUPPORT, "", Nothing);
  }
  return res;
}

Note that if the sa_family doesn't match any case, the default is to set the 
unix error to "EAFNOSUPPORT" which is exactly the error that is seen in 
mldonkey.   Now I'm not sure exactly what happens when unix_error is called, 
but if alloc_socketaddr doesn't return in unix_accept that would keep retcode 
from being put into the result of unix_accept which would mean that there would 
be no way in mldonkey to access the file descriptor to close it when this error 
occurs.   If this is true, the problem has to be corrected in the OCAML library 
by closing retcode if alloc_sockaddr gets an error.

On the other hand if alloc_sockaddr does return after calling unix_error and 
unix_accept does fill in the return value with the file descriptor (retcode), 
then the mldonkey code would have to "close s" in the error part of the routine 
and I don't know OCAML well enough to figure out how to get to s to close it.

So, bottom line is that it may be that the OCAML library needs to be modified 
to close the file descriptor in retcode if alloc_sockaddr gets an error.

Shunga






/**************************************************************************/
[bugs #11384] Full Item Snapshot:

URL: <http://savannah.nongnu.org/bugs/?func=detailitem&item_id=11384>
Project: mldonkey, a multi-networks file-sharing client
Submitted by: Shunga
On: Thu 12/23/2004 at 10:46

Category:  Core
Severity:  5 - Average
Item Group:  Program malfunction
Resolution:  None
Privacy:  Public
Assigned to:  None
Status:  Open
Release:  None
Release:  2.5-22
Platform Version:  Mac OS X Jaguar
Binaries Origin:  CVS / Self compiled
CPU type:  PowerPC


Summary:  Source of Orphaned File Descriptor Bug

Original Submission:  On Mac OS X and I assume other systems there are two bugs 
in mlnet which together generate hundreds of orphaned file descriptors causing 
mlnet to eventually hang.  I worked with mlnet 2.5-22 source and ocaml 3.07-p12 
to debug the source of the problem:

I don't know the source code or OCAML well enough to suggest exactly why it is 
happening or the best way to fix it, but I have done enough debugging to figure 
out the cause of the probem. There are two issues: 

1. The first is in src/daemon/common/commonChat.ml in the routine 
send_paquet_to_mlchat. The Unix.connect fails with "Connection refused : 
connect" but the error is not trapped and the socket is not closed. Trapping 
the error and closing the socket fixes this one. 

2. The rest of the orphaned file descriptors is in 
src/utils/net/tcpServerSocket.ml in the routine tcp_handler. The Unix.accept 
fails with Exception tcp_handler: failed: Address family not supported by 
protocol family" but apparently has created a new socket which is never closed. 
If I trap the exception and issue the following "close t (Closed_for_error 
(Printexc2.to_string e));" I find that lsof only shows one orphaned socket 
after hours of running. I assume that issuing "close t" closes the original 
socket that is being listened to and this stop this Unix.accept from being 
called again. I don't know why it is getting this error unless perhaps the 
previous bind failed and that wasn't trapped, but maybe there is some other 
reason. 

So if a developer who knows the code and OCAML can fix these two problems and 
get the patches in the current release then that should solve the orhpaned file 
descriptor problem which causes mlnet to hang after running for some hours

Follow-up Comments
------------------


-------------------------------------------------------
Date: Thu 01/06/2005 at 21:26       By: 0 <None>
Although I don't understand the source of the problem, I think I now know the 
reason for the orphaned socket on Mac OS X.  I believe the source of the 
problem is the OCAML Unix.accept routine.  This routine I think calls the OCAML 
unix_accept routine defined in ocaml-3.08.0/otherlibs/unix/accept.c and the 
code looks like this:

CAMLprim value unix_accept(value sock)
{
  int retcode;
  value res;
  value a;
  union sock_addr_union addr;
  socklen_param_type addr_len;
  
  addr_len = sizeof(addr);
  enter_blocking_section();
  retcode = accept(Int_val(sock), &addr.s_gen, &addr_len);
  leave_blocking_section();
  if (retcode == -1) uerror("accept", Nothing);
  a = alloc_sockaddr(&addr, addr_len);
  Begin_root (a);
    res = alloc_small(2, 0);
    Field(res, 0) = Val_int(retcode);
    Field(res, 1) = a;
  End_roots();
  return res;
}

Notice that the unix accept routine is called which creates a file descriptor 
in retcode (assuming no error).    Then this routine calls alloc_sockaddr which 
is defined in ocaml-3.08.0/otherlibs/unix/socketaddr.c and this routine looks 
like this:

value alloc_sockaddr(union sock_addr_union * adr /*in*/,
                     socklen_param_type adr_len)
{
  value res;
  switch(adr->s_gen.sa_family) {
#ifndef _WIN32
  case AF_UNIX:
    { value n = copy_string(adr->s_unix.sun_path);
      Begin_root (n);
        res = alloc_small(1, 0);
        Field(res,0) = n;
      End_roots();
      break;
    }
#endif 
  case AF_INET: 
    { value a = alloc_inet_addr(&adr->s_inet.sin_addr);
      Begin_root (a);
        res = alloc_small(2, 1);
        Field(res,0) = a;
        Field(res,1) = Val_int(ntohs(adr->s_inet.sin_port));
      End_roots();
      break;
    }
#ifdef HAS_IPV6
  case AF_INET6:
    { value a = alloc_inet6_addr(&adr->s_inet6.sin6_addr);
      Begin_root (a);
        res = alloc_small(2, 1);
        Field(res,0) = a;
        Field(res,1) = Val_int(ntohs(adr->s_inet6.sin6_port));
      End_roots();
      break;
    }
#endif
  default:
    unix_error(EAFNOSUPPORT, "", Nothing);
  }
  return res;
}

Note that if the sa_family doesn't match any case, the default is to set the 
unix error to "EAFNOSUPPORT" which is exactly the error that is seen in 
mldonkey.   Now I'm not sure exactly what happens when unix_error is called, 
but if alloc_socketaddr doesn't return in unix_accept that would keep retcode 
from being put into the result of unix_accept which would mean that there would 
be no way in mldonkey to access the file descriptor to close it when this error 
occurs.   If this is true, the problem has to be corrected in the OCAML library 
by closing retcode if alloc_sockaddr gets an error.

On the other hand if alloc_sockaddr does return after calling unix_error and 
unix_accept does fill in the return value with the file descriptor (retcode), 
then the mldonkey code would have to "close s" in the error part of the routine 
and I don't know OCAML well enough to figure out how to get to s to close it.

So, bottom line is that it may be that the OCAML library needs to be modified 
to close the file descriptor in retcode if alloc_sockaddr gets an error.

Shunga

-------------------------------------------------------
Date: Thu 01/06/2005 at 11:55       By: spiralvoice <spiralvoice>
Just my observation: with 2-5-28i and this patch I always got LowID´s, without 
it HighID´s.

-------------------------------------------------------
Date: Thu 01/06/2005 at 09:47       By: Shunga <shunga>
Adding the begin/end around the "try" in tcp_handler fixed the compilation 
warning bug didn't change anything.  It still appears that when "close t" is 
performed that the socket for port 4662 is closed and from that point on 
because that port is now closed future server connections get a low ID.   The 
error in the log "Exception tcp_handler:  failed: Address family not supported 
by protocol family" appears to be an error due to a Unix connect call and not 
due to the accept call, although when I added a Unix.Unix_error match to the 
try, Unix.Unix_error was never matched, only "e".  What I would like to try but 
don't know how to do is when the error occurs, close the socket "s" that was 
created by the Unix.accept call instead of closing t (which appears to be the 
socket for port 4662).  I expect that whaterver the problem is, it is not 
actually in tcp_handler, but somewhere else and perhaps related to a connect 
call.

Shunga


-------------------------------------------------------
Date: Thu 01/06/2005 at 09:06       By: 0 <None>
I compiled the tag-2-5-29ab source that you pointed to that schlumpf provided 
with no changes.  After I posted the bug and the "try" suggestion to illustrate 
the error, I did notice the compilation warning and added the begin/end 
combination just as you did in the patch.  As I recall, it didn't change 
anything, but I'll try it once more to make sure using the tag-2-5-29ab source..

Shunga

-------------------------------------------------------
Date: Thu 01/06/2005 at 08:40       By: 0 <None>
I compiled the tag-2-5-29ab source that you pointed to that schlumpf provided 
with no changes.  After I posted the bug and the "try" suggestion to illustrate 
the error, I did notice the compilation warning and added the begin/end 
combination just as you did in the patch.  As I recall, it didn't change 
anything, but I'll try it once more to make sure using the tag-2-5-29ab source..

Shunga

-------------------------------------------------------
Date: Wed 01/05/2005 at 23:55       By: Amorphous <amorphous>
did you try that with or without the patch i posted in the forum linked in my 
last comment to this bug? if without please try with it applied. (oh and no 
need to message me through savannah i get notified on changes of bugs i posted 
to)


-------------------------------------------------------
Date: Wed 01/05/2005 at 22:33       By: Shunga <shunga>
When I suggested the try, I noticed the same thing that I'm noticing with 
2.5.29ab.  When I originally suggested the "try" in tcp_handler to illustrate 
the error, I noticed the following issue which also occurs with the patch.  
When the error occurs and the "close t" is issued, that close apparently closes 
the socket on port 4662.  After that occurs, it is true that there are no more 
orphaned file descriptors, however, it appears that any additional connections 
to servers results in a lowid.  For example in my console log, after the error 
occurs, I start seeing the following when attaching to new servers:

+-- From server  [193.41.142.148:10000] ------
| WARNING : You have a lowid. Please review your network config and/or your 
settings.

+-- From server DonkeyServer No6  [62.241.53.4:4242] ------
| WARNING : You have a lowid. Please review your network config and/or your 
settings.

+-- From server www.MESSENGER7.NET [205.209.178.170:12933] ------
| WARNING : Your 4662 port is not reachable. Please review your network config.
| server version 17.1 (lugdunum)

Before the error occured, servers did not report lowid.

Shunga.


-------------------------------------------------------
Date: Wed 01/05/2005 at 13:53       By: Amorphous <amorphous>
it's in the svn repository mentioned in another thread in that forum-group. i 
added a link to an archive of the source, schlumpf provided.

-------------------------------------------------------
Date: Wed 01/05/2005 at 04:34       By:  <d-b>
I would like to try this but with what? I read the post at 
http://mldonkey.berlios.de/modules.php?name=Forums&file=viewtopic&t=3201&sid=6c52a2530f6046d72fdfbbb94c0c1d72
 and I have looked at the CVS-page but I don't know where the/which source to 
download - where is the 29ab-version?

-------------------------------------------------------
Date: Wed 01/05/2005 at 04:30       By:  <d-b>
I would like to try this but with what? I read the post at 
http://mldonkey.berlios.de/modules.php?name=Forums&file=viewtopic&t=3201&sid=6c52a2530f6046d72fdfbbb94c0c1d72
 and I have looked at the CVS-page but I don't know where the/which source to 
download - where is the 29ab-version?

-------------------------------------------------------
Date: Wed 01/05/2005 at 03:23       By: Amorphous <amorphous>
could you confirm if this is fixed with 2.5.29ab? see 
http://mldonkey.berlios.de/modules.php?name=Forums&file=viewtopic&t=3201&sid=6c52a2530f6046d72fdfbbb94c0c1d72

-------------------------------------------------------
Date: Thu 12/23/2004 at 15:51       By: Shunga <shunga>
Programmer asleep at the switch :-).  I was wrong about some other software 
change.  Turns out tcp_handler does fail if changed as follows:

let tcp_handler t sock event =
  match event with 
  | CAN_READ
  | CAN_WRITE ->
      try 
        let s,id = Unix.accept (fd sock) in
        if !verbose_bandwidth > 1 then lprintf "[BW2 %6d] accept on %sn" 
(last_time ()) t.name;
        (match t.accept_control with
            None -> () | Some cc ->
              cc.nconnections_last_second <- cc.nconnections_last_second + 1);
        incr nconnections_last_second;
        t.event_handler t (CONNECTION (s,id))
      with  e ->
        lprintf "Exception tcp_handler: %sn" (Printexc2.to_string e);
        close t (Closed_for_error (Printexc2.to_string e)); 
        raise e
  | _ -> t.event_handler t (BASIC_EVENT event)

and it leaves one socket orphaned which I assume is "s".  I don't know how to 
get "s" down into the "with -> e" so that it can be closed and I don't know if 
I need to "close t" as is indicated in the code which at the moment is closing 
one of the server sockets that is being listened to., however, with this change 
"mlnet" has run for hours with only one orphaned socket.  This plus the 
commonChat change should get rid of the orphaned sockets.  I'll leave it up to 
the experts to figure out what is really going on and how to best fix it.

-------------------------------------------------------
Date: Thu 12/23/2004 at 15:03       By: Shunga <shunga>
Well it would appear that the failure in tcp_handler was due to some other 
change that I must have made while attempting to debug this.  When I start over 
with a fresh copy of the source the handler doesn't fail and the file 
descriptors start building up.

Guess I have to go back and see if I can figure out what else it was that I 
changed.  :-(





CC List
-------

CC Address                          | Comment
------------------------------------+-----------------------------
hgd                                 | 
i97_bed --AT-- i --DOT-- kth --DOT-- se | 









For detailed info, follow this link:
<http://savannah.nongnu.org/bugs/?func=detailitem&item_id=11384>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/







reply via email to

[Prev in Thread] Current Thread [Next in Thread]