octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #51707] textscan seems to skip chunks of text


From: HJW
Subject: [Octave-bug-tracker] [bug #51707] textscan seems to skip chunks of text
Date: Wed, 9 Aug 2017 14:14:57 -0400 (EDT)
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36

URL:
  <http://savannah.gnu.org/bugs/?51707>

                 Summary: textscan seems to skip chunks of text
                 Project: GNU Octave
            Submitted by: thrynae
            Submitted on: Wed 09 Aug 2017 06:14:56 PM UTC
                Category: Octave Function
                Severity: 3 - Normal
                Priority: 5 - Normal
              Item Group: Matlab Compatibility
                  Status: None
             Assigned to: None
         Originator Name: 
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any
                 Release: 4.2.0
        Operating System: Microsoft Windows

    _______________________________________________________

Details:

(context: I'm downloading webpages and merging specific content)

When opening the downloaded file in notepad++, all the text is there, but when
I use textscan, some text is missing. This seems to be stable behavior (same
text is missing each time). This is not the case under Matlab (tested on
R2017a and R2012b). My OS is a 64bit Windows 10, and my Octave version is
4.2.0.

As far as I can tell, this is not yet reported. Other files with longer lines
(77k on a single line) are not failing. I can't find any systematic reason for
this. This occurs in multiple files.

MWE:

filename='HB_SNG3.html';
if ~exist(filename,'file')
  %download file
 
url='http://web.archive.org/web/20170807165834/https://www.bible.com/nl/bible/75/SNG.3.htb';
  urlwrite(url,filename);
end
%load file
fid=fopen(filename,'rt','n');
data=textscan(fid,'%s','Delimiter','\n');
fclose(fid);
%convert file to a single long string
data=data{1};data(:,2)={' '};data=data';data=data(:)';data=cell2mat(data);
%remove the parts of the webpage that are not relevant for my goal.
pattern='<div class="book bk';
idx=strfind(data,pattern);
data=data(idx(1):end);
pattern='</div><div class="version-copyright"';
idx=strfind(data,pattern);
if isempty(idx),error('this is possibly a bug'),end
data=data(1:(idx(end)-1));





    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?51707>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]