You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On some pages, Readability succeeds at extracting a textual body, but has done such a poor job it would be better if it failed. For example, consider this change, which scores 1.0 priority:
There are no textual changes in the page body! What happened? Well, the markup changed very slightly, causing Readability to ignore 80% of the page’s content in the newer version (technical details at bottom). So for text comparison purposes, most of the page was removed. That would definitely be significant, if only it had actually happened.
This case is probably extreme, since Readability parsed differently in the two versions. But I also know I’ve seen examples of pages that just have a lot of their main content excluded by Readability in both versions. Unfortunately, I don’t know how widespread or serious the issue is. I think it’s probably not a majority of cases, but I don’t know if it’s 1% or 45%.
We might be better off finding a more conservative method of separating main content from headers/footers/nav/etc.
Specific Explanation of This Readability Failure
The page is mostly made of paragraph-sized bullet points. They used to be all inline in a big container:
Readability is biased against lists (because they are often used for navigation, news feeds, etc.) and if an element is primarily composed of lists, it will tend to throw it out. Because the lists were previously included alongside the introductory text that occurred in normal paragraphs, they were considered part of the content. Once they were isolated in their own containers, however, Readability saw them as non-content elements.
This is especially rough because many might consider the new version to be better markup (especially if they used <section> instead of <div>, but 🤷), even though Readability handles it poorly.
The text was updated successfully, but these errors were encountered:
On some pages, Readability succeeds at extracting a textual body, but has done such a poor job it would be better if it failed. For example, consider this change, which scores 1.0 priority:
View in Scanner: https://monitoring.envirodatagov.org/page/080165d7-873d-4319-9a2f-8e5388a1933b/74127563-9ead-42b2-bf50-b354fbf4d3c5..29cd854c-b629-44ee-bd26-a5deed5b0620
Page in API: https://api.monitoring.envirodatagov.org/api/v0/pages/080165d7-873d-4319-9a2f-8e5388a1933b
Left version in API: https://api.monitoring.envirodatagov.org/api/v0/versions/74127563-9ead-42b2-bf50-b354fbf4d3c5
Right version in API: https://api.monitoring.envirodatagov.org/api/v0/versions/29cd854c-b629-44ee-bd26-a5deed5b0620
There are no textual changes in the page body! What happened? Well, the markup changed very slightly, causing Readability to ignore 80% of the page’s content in the newer version (technical details at bottom). So for text comparison purposes, most of the page was removed. That would definitely be significant, if only it had actually happened.
This case is probably extreme, since Readability parsed differently in the two versions. But I also know I’ve seen examples of pages that just have a lot of their main content excluded by Readability in both versions. Unfortunately, I don’t know how widespread or serious the issue is. I think it’s probably not a majority of cases, but I don’t know if it’s 1% or 45%.
We might be better off finding a more conservative method of separating main content from headers/footers/nav/etc.
Specific Explanation of This Readability Failure
The page is mostly made of paragraph-sized bullet points. They used to be all inline in a big container:
But are now wrapped in
<div>
s (which has no visual or textual impact at all):Readability is biased against lists (because they are often used for navigation, news feeds, etc.) and if an element is primarily composed of lists, it will tend to throw it out. Because the lists were previously included alongside the introductory text that occurred in normal paragraphs, they were considered part of the content. Once they were isolated in their own containers, however, Readability saw them as non-content elements.
This is especially rough because many might consider the new version to be better markup (especially if they used
<section>
instead of<div>
, but 🤷), even though Readability handles it poorly.The text was updated successfully, but these errors were encountered: