Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink use of Readability #9

Open
Mr0grog opened this issue Dec 1, 2020 · 0 comments
Open

Rethink use of Readability #9

Mr0grog opened this issue Dec 1, 2020 · 0 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Dec 1, 2020

On some pages, Readability succeeds at extracting a textual body, but has done such a poor job it would be better if it failed. For example, consider this change, which scores 1.0 priority:

Screen Shot 2020-12-01 at 12 25 26 PM

View in Scanner: https://monitoring.envirodatagov.org/page/080165d7-873d-4319-9a2f-8e5388a1933b/74127563-9ead-42b2-bf50-b354fbf4d3c5..29cd854c-b629-44ee-bd26-a5deed5b0620
Page in API: https://api.monitoring.envirodatagov.org/api/v0/pages/080165d7-873d-4319-9a2f-8e5388a1933b
Left version in API: https://api.monitoring.envirodatagov.org/api/v0/versions/74127563-9ead-42b2-bf50-b354fbf4d3c5
Right version in API: https://api.monitoring.envirodatagov.org/api/v0/versions/29cd854c-b629-44ee-bd26-a5deed5b0620

There are no textual changes in the page body! What happened? Well, the markup changed very slightly, causing Readability to ignore 80% of the page’s content in the newer version (technical details at bottom). So for text comparison purposes, most of the page was removed. That would definitely be significant, if only it had actually happened.

This case is probably extreme, since Readability parsed differently in the two versions. But I also know I’ve seen examples of pages that just have a lot of their main content excluded by Readability in both versions. Unfortunately, I don’t know how widespread or serious the issue is. I think it’s probably not a majority of cases, but I don’t know if it’s 1% or 45%.

We might be better off finding a more conservative method of separating main content from headers/footers/nav/etc.


Specific Explanation of This Readability Failure

The page is mostly made of paragraph-sized bullet points. They used to be all inline in a big container:

<p>An introductory paragraph.</p>
<h2>Section Header</h2>
<ul>
  <li>A Bullet</li>
  <li>Point</li>
</ul>
<h2>Another Section Header</h2>
<ul>
  ...etc...

But are now wrapped in <div>s (which has no visual or textual impact at all):

<div>
  <p>An introductory paragraph.</p>
</div>
<div>
  <h2>Section Header</h2>
  <ul>
    <li>A Bullet</li>
    <li>Point</li>
  </ul>
</div>
<div>
  <h2>Another Section Header</h2>
  <ul>
    ...etc...

Readability is biased against lists (because they are often used for navigation, news feeds, etc.) and if an element is primarily composed of lists, it will tend to throw it out. Because the lists were previously included alongside the introductory text that occurred in normal paragraphs, they were considered part of the content. Once they were isolated in their own containers, however, Readability saw them as non-content elements.

This is especially rough because many might consider the new version to be better markup (especially if they used <section> instead of <div>, but 🤷), even though Readability handles it poorly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant