> Claude/LLMs in general are still pretty bad at the intricate details of layouts and visual things
Because the rendered output (pixels, not HTML/CSS) is not fed as data in the training. You will find tons of UI snippets and questions, but they rarely included screenshots. And if they do, the are not scraped.
Interesting thought. I wonder if Anthropic et al could include some sort of render-html-to-screenshot as part of the training routine, such that the rendered output would get included as training data.
Even better, a tool that can tell the rendered bounding box of any set of elements, and what the distances between pairs of elements are, so it can make adjustments if relative positioning doesn't match its expectation. This would be incredible for SVG generation for diagrams, too.
thats basically a VLM, but the problem is that describing the world requires a better understanding of the world. Hence why LeCunn is talking about world models (Its also cutting edge for teaching robots to manipulate and plan manipulations)
Well, I don't know but many LLMs are multimodal and understand pictures and images. You can upload videos to Gemini and they're tokenised and fed into the LLM. If some programming blog post has a screenshot with the result of some UI code, why would that not be scraped and used for training? Is there some reason that wouldn't be possible?
Because the rendered output (pixels, not HTML/CSS) is not fed as data in the training. You will find tons of UI snippets and questions, but they rarely included screenshots. And if they do, the are not scraped.