[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Binary recognition is to narrow [new suggestion]
From: |
Hideki IWAMOTO |
Subject: |
Re: Binary recognition is to narrow [new suggestion] |
Date: |
Sat, 21 Nov 2009 19:21:25 +0900 |
Hi.
> if (c <= 8)
> return 1;
> if (c >= 14 && c < 32)
> return 1;
You had better use table look-up like attached patch.
On Fri, 20 Nov 2009 17:33:01 +0100 (CET), Erik Jonsson wrote...
> Hi again,
>
> I have done some more testing and calculations now. The probability that a
> binary file will pass as a text-file is quite high if one only tests the
> first 32 bytes. I have therefore tested the performance if one where to
> use the first 512 bytes. What I found was that the performance hit was
> minimal however the benefits are several.
>
> Instead of counting characters over 127 the only test is that the first
> 511 bytes don't contain any of the controll characters 0-8, 14-31. No
> normal textfile would contain these.
>
> Assuming that binary data is random the probability of a incorrectly
> tagged binary would be
>
> ((256-8-18)/256)^511=.00000000000000000000000170726
>
> just testing 127 bits would be a bit to little
>
> ((256-8-18)/256)^127=.00000123868
>
> One of the benefits is that this will correctly tag files in uni-code as
> text as well. Since those control characters never appears in uni-code
> either.
>
> The performance hit seems minimal on my computer.
>
> 511 byte version
> address@hidden:~/source/dps/src$ time ~/install/global-5.7.6/gtags/gtags
> real 0m34.425s
> user 0m8.337s
> sys 0m3.080s
>
> 32 byte version
> address@hidden:~/source/dps/src$ time gtags
> real 0m32.120s
> user 0m8.361s
> sys 0m2.820s
>
>
> I have tried to clear the cache as good as possible between the runs.
>
> Here is the 511 byte is_binary that I'm using.
>
> static int
> is_binary(const char *path)
> {
> int ip;
> char buf[512];
> char *cp;
> int i, c, size;
>
> ip = open(path, O_RDONLY);
> if (ip < 0)
> die("cannot open file '%s' in read mode.", path);
> size = read(ip, buf, sizeof(buf)-1);
> close(ip);
>
> buf[size] = 0; //Terminate the data
>
> if (size <= 0)
> return 1;
> if (size >= 7 && locatestring(buf, "!<arch>", MATCH_AT_FIRST))
> return 1;
> cp = buf;
> while ((c = (unsigned char) *cp)) {
> if (c <= 8)
> return 1;
> if (c >= 14 && c < 32)
> return 1;
> cp++;
> }
>
> return cp != buf+size;
> }
>
> feel free to use the code as you like.
>
> /Erik J.
>
>
>
>
>
> _______________________________________________
> Bug-global mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/bug-global
----
Hideki IWAMOTO address@hidden
20091121-binarychar.patch
Description: Binary data
Re: Binary recognition is to narrow., Hideki IWAMOTO, 2009/11/19