You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately not, it was quite some time ago... I encountered this issue while processing Common Crawl data, but I do understand that having to download & parse a billion pages to find one that uses XHTML is a bit too much to ask 😄
OK, thank you. I guess it's not that hard to add. Maybe I am wrong but I don't think there are plenty of XHTML documents left out there. We will see if anyone else writes here.
While jusText extracts the page encoding for a HTML page correctly from the meta attribute, it does not for XHTML, which uses an XML header:
The text was updated successfully, but these errors were encountered: