[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Cannot searh MessengerPlus xhtml chat logs
From: |
Paolo Bonzini |
Subject: |
Re: Cannot searh MessengerPlus xhtml chat logs |
Date: |
Mon, 26 Apr 2010 10:14:38 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100301 Fedora/3.0.3-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.3 |
On 04/25/2010 11:36 AM, Wuhtzu wrote:
Hi everyone
Forgive me if my possible lack of knowledge, but I've come across something
strange using GNU grep 2.5.4 with Windows 7
(http://gnuwin32.sourceforge.net/packages/grep.htm).
I was rying to search xhtml chat logs generated by MessengerPlus! Live v.
4.83.0.376 but I have been completely unable to do so. Trying to search for
something a little more complicated I came to s dead stop trying just this:
grep -ic td *.html
All the xhtml files in the current directory was listed but with count 0
even though they contain hundreds of td-tags. In order to get matches within
a file I had to open it, copy it's content to a new file and save it again.
Then it matched just as it's supposed to.
Two file samples are available here:
Original chat log. Not able to match anything:
http://wuhtzu.dk/random/posts/ex-april-2009.html
This one is in UTF-16. It's very hard to match anything in this
encoding since "normal" Latin characters are not represented the same
way as ASCII.
In particular, this won't work
LANG=en_US.UTF-16LE fgrep -c 't\x00d\x00' ex-april-2009.html
because grep does not handle \x escape sequences; maybe that could be
added as a feature. These three on the other hand work:
1) using Perl regular expressions:
LANG=en_US.UTF-16LE grep -Pc 't\x00d\x00' ex-april-2009.html
2) using tr or printf to print the regex, using bash <(...) syntax.
This won't work because echo truncates the argument after the first nul
character:
LANG=en_US.UTF-16LE grep -icf <(echo $'t\x00d\x00') ex-april-2009.html
however you can use these two:
LANG=en_US.UTF-16LE grep -icf <(echo address@hidden@ | tr @ '\0')
ex-april-2009.html
LANG=en_US.UTF-16LE grep -icf <(printf '%c\0' t d) ex-april-2009.html
3) same as above using a temporary file.
echo -n td | iconv -f UTF-8 -t UTF-16LE > test-re
LANG=en_US.UTF-16LE grep -icf test-re ex-april-2009.html
With some care, the full power of regular expressions can be used, for
example
LANG=en_US.UTF-16LE grep -icf <(printf '%c\0' t . t) ex-april-2009.html
However, there are a lot of tricky areas here, for example the \n _byte_
is used as a separator rather than the Unicode character \n (which would
be "\n\x0"), and that is why "echo -n" is needed in the example above.
Paolo