You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In light of the large FineWeb-2 corpora, I came across an inconsistency in the data where extracted data seems not reproducible. After chasing it down, with help by @guipenedo, I found that this is caused by trafilatura's deduplicate option.
When enabled, it does not seem to be deterministic: running extract(... deduplicate=True) on the same data can return different results
Some (destructive) caching seems to be going on. In the code below, I have a loop that runs 11 times, and in each loop: a file's HTML contents are read, and trafilatura extract is called (with deduplication). But I find that sometimes after a few runs, the contents changes (often run 4 and 7). I find that in run 7, the extraction often ends up as None.
So to me it seems that trafilatura is doing some caching and after a while it will try to update the cache with new info, but in a destructive manner so that at each cache update, more data is removed. What I think is problematic is that this happens even with newly read data. As I said, at each iteration I read the file contents again from disk, and still some other types of caching seem to be happening that ignore the new input.
I am not entirely sure if this is expected behavior in the deduplication algorithm, but I do not think that many people would expect such behavior. No such issue occurs when deduplication is disabled.
Note: I am on trafilatura==1.11.0 but I also tried on 2.0.0 and the issue persists.
Hi @BramVanroy, thanks for the detailed report. The deduplication component works with a Least Recently Used cache (LRU), so its behavior depends on document order. It would not be thread-safe so there is one LRU cache per process which adds to the complexity. I believe both factors explain the problem you describe.
I suggest you make processing order deterministic and/or set a higher limit for deduplication in the settings.cfg file.
In light of the large FineWeb-2 corpora, I came across an inconsistency in the data where extracted data seems not reproducible. After chasing it down, with help by @guipenedo, I found that this is caused by trafilatura's
deduplicate
option.extract(... deduplicate=True)
on the same data can return different resultsextract
is called (with deduplication). But I find that sometimes after a few runs, the contents changes (often run 4 and 7). I find that in run 7, the extraction often ends up asNone
.So to me it seems that trafilatura is doing some caching and after a while it will try to update the cache with new info, but in a destructive manner so that at each cache update, more data is removed. What I think is problematic is that this happens even with newly read data. As I said, at each iteration I read the file contents again from disk, and still some other types of caching seem to be happening that ignore the new input.
I am not entirely sure if this is expected behavior in the deduplication algorithm, but I do not think that many people would expect such behavior. No such issue occurs when deduplication is disabled.
Note: I am on
trafilatura==1.11.0
but I also tried on 2.0.0 and the issue persists.Reproducible code (you can also decrease the number of items in
uid_and_file_paths
: https://gist.github.com/BramVanroy/c7e9778aa1f4259f7066e22e2cd1aa3a or here:The text was updated successfully, but these errors were encountered: