> Claude/LLMs in general are still pretty bad at the intricate details of layout...

Wowfunhappy · 2025-12-07T21:24:57 1765142697

Interesting thought. I wonder if Anthropic et al could include some sort of render-html-to-screenshot as part of the training routine, such that the rendered output would get included as training data.

btown · 2025-12-07T23:02:15 1765148535

Even better, a tool that can tell the rendered bounding box of any set of elements, and what the distances between pairs of elements are, so it can make adjustments if relative positioning doesn't match its expectation. This would be incredible for SVG generation for diagrams, too.

KaiserPro · 2025-12-07T22:47:13 1765147633

thats basically a VLM, but the problem is that describing the world requires a better understanding of the world. Hence why LeCunn is talking about world models (Its also cutting edge for teaching robots to manipulate and plan manipulations)

ubercow13 · 2025-12-08T01:40:28 1765158028

Why wouldn't they be?

littlecranky67 · 2025-12-08T06:56:22 1765176982

Why would they be?

ubercow13 · 2025-12-08T19:43:23 1765223003

Well, I don't know but many LLMs are multimodal and understand pictures and images. You can upload videos to Gemini and they're tokenised and fed into the LLM. If some programming blog post has a screenshot with the result of some UI code, why would that not be scraped and used for training? Is there some reason that wouldn't be possible?