lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] Extract links from html with application/ld+json script


From: David Woolley
Subject: Re: [Lynx-dev] Extract links from html with application/ld+json script
Date: Sun, 17 Dec 2023 21:59:27 +0000
User-agent: Mozilla Thunderbird

Looking a bit further, ld+json is a database serialisation format, based on javascript, but it is declarative. It definitely isn't HTML, but one could render it by basically pretty printing, without the need to handle the generalities of javascript. You may, though have to manually extract it from the page, as I suspect general execution of javascript may be needed to actually find it reliably.

Lynx does not even have a JSON interpreter and I'm sure it doesn't have a JSON pretty printer.

Using <http://jsonprettyprint.net/json-pretty-print> to pretty print it, the core of one of the items comes out as (I've just used an extract to minimise copyright issues):

      {
        "@type": "VideoObject",
        "name": "The Chokepoint (EGC Finals)",
"url": "https://clips.twitch.tv/ElatedIncredulousPepperOpieOP-oUeW6hXXZs8nmWtX";, "description": "Watch EGCTV's clip of Age of Empires IV on Twitch!",
        "thumbnailUrl": [

"https://clips-media-assets2.twitch.tv/A-IO1KFHluoV12bPJ5lrVw/AT-cm%7CA-IO1KFHluoV12bPJ5lrVw-preview-86x45.jpg";,

"https://clips-media-assets2.twitch.tv/A-IO1KFHluoV12bPJ5lrVw/AT-cm%7CA-IO1KFHluoV12bPJ5lrVw-preview-260x147.jpg";,

"https://clips-media-assets2.twitch.tv/A-IO1KFHluoV12bPJ5lrVw/AT-cm%7CA-IO1KFHluoV12bPJ5lrVw-preview-480x272.jpg";
        ],
        "uploadDate": "2023-12-17T16:16:18Z",
        "duration": "PT60S",
        "position": 2,
        "interactionStatistic": {
          "@type": "InteractionCounter",
          "interactionType": {
            "@type": "http://schema.org/WatchAction";
          },
          "userInteractionCount": 29
        },
"embedUrl": "https://player.twitch.tv?video=1542310342&autoplay=true&parent=meta.tag";
      },

I'm pretty sure that most of the tags have no intrinsic meaning, and you still need the full javascript code, or to guess from the names, to correctly interpret them.

The actual HTML doesn't include anything renderable. Everything is done as empty DIVs and relies on styling for any display, so can't be considered foreground content. There is some directly renderable content, but it is SVG, with no accessible text fallback. This is an image format, so useless for a text only browser.


On 17/12/2023 20:44, David Woolley wrote:
On 17/12/2023 19:31, Super Bonaci via Lynx-dev wrote:
Lynx is not able to extract most html links inside the html file.


There are no HTML links in 9ed7a8bb (no anchor elements, and all occurrences of href are either in link elements, which don't generate visible hyperlinks, inline, except for one, which is in javascript code)!  I think this is a Javascript application program, not an HTML document.  Lynx doesn't have a javascript interpreter and doesn't parse HTML in a way that creates a document object model in a format that would allow such an interpreter to do anything non-trivial.

Any links are created by manipulating the document in the browser, which Lynx can't do.

Supporting javascript applications would require a complete rewrite from first principles.  The result would not be Lynx.

I suspect the same is true of the other document.

Since the Lynx version is from 2018

I don't think there have been major changes in HTML in the last five years that would break a real HTML document on Lynx.  The problem with web applications is over a decade old.  It goes back to the original Netscape, but was solidified when the Web Hypertext Applications Technology working group effectively took over control of HTML from W3C leading to the creation of HTML5.  Although that can be used for pure documents, the name of the working group clearly indicates that the intention was otherwise.  That happened about 19 years ago.

Commercial artists and marketing managers, don't buy into the TBL notion of HTML and want programs that can be run on the advertising consumer's machine.  Whilst there are some cases where this is valid, for technical, or privacy reasons, most such applications are written for marketing reasons.

Some text mode browsers handle some javascript uses, but I'm pretty sure they would not cope with your examples.

The only certain way of finding the links in javascript code is run the program.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]