`SeasonalNaive` (and other models) spend a lot of time acquiring thread locks #898

caseytomlin · 2024-09-05T07:24:25Z

What happened + What you expected to happen

Caveats: I'm not sure if this is expected behavior, and I'm not sure if it's new, but I think it is.

The majority of wall clock time is spent (in the case of SeasonalNaive) acquiring thread locks. See below for an example. It seems to take even longer per-time series when the number of series increases (e.g. 10000 series takes 1 minute, but 100000 series takes 20 minutes).

The issue is reproducible in multiple environments, but it is particularly problematic in kubeflow pipeline executions (the resulting long running times cause the pods to stagnate and never finish). We have a fair number of time series (~O(10^6)), but SeasonalNaive with n_jobs=12 can take 20+ hours to finish (if it finishes at all).

Thanks for an otherwise great product!

Versions / Dependencies

Click to expand

Dependencies:

numpy = '1.23.5'
pandas = '2.1.1'
statsforecast = '1.7.6'

Reproducible example

import numpy as np
import pandas as pd
import statsforecast as sf
from statsforecast.core import StatsForecast
from statsforecast.models import SeasonalNaive

n = 10000
season_length = 12
freq = "M"

df = pd.DataFrame(
    {
        "unique_id": [f"id{i}" for i in range(n) for _ in range(season_length)],
        "ds": [*pd.date_range(start="2023-08-31", periods=season_length, freq=freq)] * n,
        "y": np.random.rand(season_length * n)
    }
)

sn = StatsForecast(models=[SeasonalNaive(season_length=season_length)], freq=freq, n_jobs=8)

%%prun
preds = sn.forecast(h=2*season_length, df=df)

Top of prun:

Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    49462   65.589    0.001   65.589    0.001 {method 'acquire' of '_thread.lock' objects}
    10000    0.353    0.000    1.416    0.000 process.py:678(submit)
    10025    0.308    0.000    0.316    0.000 threading.py:228(__init__)
        1    0.241    0.241   67.748   67.748 core.py:1287(_forecast_parallel)
    10001    0.236    0.000    0.236    0.000 {built-in method posix.write}
    59696    0.154    0.000    0.186    0.000 threading.py:256(__enter__)
    10001    0.098    0.000   65.974    0.007 _base.py:201(as_completed)
     9853    0.065    0.000   65.668    0.007 threading.py:280(wait)
        8    0.060    0.007    0.060    0.008 {built-in method posix.fork}
    19840    0.050    0.000    0.088    0.000 _base.py:179(_yield_finished_futures)
     9857    0.039    0.000   65.728    0.007 threading.py:556(wait)

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

rauerhans · 2024-09-05T15:03:01Z

This is affecting me, too! Exact same problem

jmoralez · 2024-09-05T17:32:43Z

Hey @caseytomlin, thanks for raising this. This is expected for very simple models, since the overhead of running multiprocessing is greater than the actual model training, so if you're using SeasonalNaive, Naive, HistoricAverage, etc. it'll always be faster to set n_jobs=1.

My guess is that you're seeing the locks because %prun isn't able to capture code executed on different processes, the time is being spent waiting for the results that are being executed elsewhere, which you can verify by using line_profiler for example, which will show that when n_jobs=1 the time is spent here

statsforecast/statsforecast/core.py

Lines 190 to 192 in 7f60571

    
           res_i = model.forecast( 
        
               h=h, y=y_train, X=X_train, X_future=X_f, fitted=fitted, **kwargs 
        
           )

and when n_jobs>1 time time is spent waiting for the results here

statsforecast/statsforecast/core.py

Line 1320 in 7f60571

for future in iterable:

Here's a colab notebook with the comparison. I don't get a nice speedup there with multiprocessing (4.5s n_jobs=1 vs 4s with n_jobs=2) but on my local machine I get a reasonable speedup.

caseytomlin · 2024-09-06T06:38:28Z

@jmoralez thanks for the quick reply and the hints! I will try line_profiler. Just to clarify, we see basically the same behavior with AutoARIMA and AutoETS - these exhibit a longer phase at the beginning where the full allotted compute is being used (according to htop), followed by a phase with the same multiprocessing "long tail" as in SeasonalNaive. Also with these models the run time is dominated by this second phase with very low cpu usage.

jmoralez · 2024-09-06T18:58:20Z

Do your series differ a lot in length? The scheduling now works per serie, to support progress bars and also (hopefully) better scheduling, since previously we just partitioned all series in n_jobs chunks and processed those separately, which produced less overhead but could also end up with just a few CPUs running their partitions while all others were done. Note that can still happen if for example most of your series are short but you have some that are very long and thus the process will have to wait until those are done, although this wouldn't happen with SeasonalNaive for example.

I think the locks you see are actually how concurrent.futures.as_completed waits for the results to be ready, so that's not really informative of what may be going on. If you can provide an example with AutoARIMA or AutoETS where using n_jobs>1 is slower than n_jobs=1 I can help debug it further.

caseytomlin · 2024-09-07T13:16:30Z

The series in our original use case do differ somewhat in length, but I can see the behavior when using dummy data like above. I didn't get to the line profiler yet, but there seems to be a threshold in the number of series (n above) where n_jobs>1 becomes a bad idea.

Using the same code as above on an 8-cpu machine (replacing SeasonalNaive with AutoETS) I see:

n = 10_000: n_jobs = 1 finished in ~2 minutes, while n_jobs = 8 took ~40 seconds
n = 100_000: n_jobs = 1 finished in ~14 minutes, while n_jobs = 8 is still running after 85 minutes!

Let me know if I can provide more info/tests. Very curious!

caseytomlin · 2024-09-10T09:17:56Z

The scheduling now works per serie, to support progress bars and also (hopefully) better scheduling, since previously we just partitioned all series in n_jobs chunks and processed those separately, which produced less overhead but could also end up with just a few CPUs running their partitions while all others were done.

@jmoralez could you point to the package version that does the partitioning the "old" way?

jmoralez · 2024-09-10T14:15:10Z

1.7.5

caseytomlin · 2024-09-10T16:04:14Z

Rerunning the above example with AutoETS on 1.7.5 indeed brings the n = 100_000 n_jobs = 8 run time down to 3 minutes.

jmoralez · 2024-09-10T16:11:49Z

For the SeasonalNaive? How long does it take with n_jobs=1?

caseytomlin · 2024-09-10T17:31:29Z

Apologies, I meant with AutoETS - comment is edited.

With SeasonalNaive the run times are 3.7s with n_jobs = 1 and 2.8s with n_jobs = 8 using version 1.7.5 (as opposed to > 80 minutes using version 1.7.6)

jmoralez · 2024-09-11T02:48:39Z

Thanks a lot for working through this issue with us. I've narrowed down the problems and I'll push a fix soon.

caseytomlin · 2024-09-16T20:09:41Z

Thanks @jmoralez for the quick fix, but unfortunately I am still running into similar problems using 1.7.7.1. However the threshold number of series where the "crossover" occurs has gone up (and for my actual use cases this is the relevant scale). Here are some timings from 1.7.7.1:

Model	`n`	`n_jobs`	run time
`SeasonalNaive`	500_000	1	0:19
`SeasonalNaive`	500_000	8	2:41
`SeasonalNaive`	1_000_000	1	0:39
`SeasonalNaive`	1_000_000	8	> 40:00 (aborted)
`AutoETS`	500_000	1	24:00
`AutoETS`	500_000	8	7:30
`AutoETS`	1_000_000	1	48:00
`AutoETS`	1_000_000	8	> 2:00:00 (aborted)

With 1.7.5 on the other hand e.g.,

Model	`n`	`n_jobs`	run time
`SeasonalNaive`	1_000_000	1	0:39
`SeasonalNaive`	1_000_000	8	0:24
`AutoETS`	1_000_000	8	32:00

jmoralez · 2024-09-18T18:23:11Z

Hey. Yeah, sorry I only tested with 100k, it seems that the process pool executor chokes on that many tasks, I'm working on keeping the number of tasks at a given time constant. It already works but I'm trying to improve it a bit, I'll push another PR soon.

caseytomlin added the bug label Sep 5, 2024

jmoralez added the awaiting response label Sep 5, 2024

github-actions bot removed the awaiting response label Sep 6, 2024

jmoralez added the awaiting response label Sep 6, 2024

github-actions bot removed the awaiting response label Sep 7, 2024

jmoralez mentioned this issue Sep 11, 2024

fix: reduce overhead in parallel forecast #901

Merged

jmoralez closed this as completed in #901 Sep 11, 2024

jmoralez reopened this Sep 18, 2024

jmoralez mentioned this issue Sep 19, 2024

fix: chunk series in parallel forecast #915

Merged

jmoralez closed this as completed in #915 Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`SeasonalNaive` (and other models) spend a lot of time acquiring thread locks #898

`SeasonalNaive` (and other models) spend a lot of time acquiring thread locks #898

caseytomlin commented Sep 5, 2024

rauerhans commented Sep 5, 2024

jmoralez commented Sep 5, 2024

caseytomlin commented Sep 6, 2024

jmoralez commented Sep 6, 2024

caseytomlin commented Sep 7, 2024 •

edited

Loading

caseytomlin commented Sep 10, 2024

jmoralez commented Sep 10, 2024

caseytomlin commented Sep 10, 2024 •

edited

Loading

jmoralez commented Sep 10, 2024

caseytomlin commented Sep 10, 2024

jmoralez commented Sep 11, 2024 •

edited

Loading

caseytomlin commented Sep 16, 2024

jmoralez commented Sep 18, 2024

SeasonalNaive (and other models) spend a lot of time acquiring thread locks #898

SeasonalNaive (and other models) spend a lot of time acquiring thread locks #898

Comments

caseytomlin commented Sep 5, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproducible example

Issue Severity

rauerhans commented Sep 5, 2024

jmoralez commented Sep 5, 2024

caseytomlin commented Sep 6, 2024

jmoralez commented Sep 6, 2024

caseytomlin commented Sep 7, 2024 • edited Loading

caseytomlin commented Sep 10, 2024

jmoralez commented Sep 10, 2024

caseytomlin commented Sep 10, 2024 • edited Loading

jmoralez commented Sep 10, 2024

caseytomlin commented Sep 10, 2024

jmoralez commented Sep 11, 2024 • edited Loading

caseytomlin commented Sep 16, 2024

jmoralez commented Sep 18, 2024

`SeasonalNaive` (and other models) spend a lot of time acquiring thread locks #898

`SeasonalNaive` (and other models) spend a lot of time acquiring thread locks #898

caseytomlin commented Sep 7, 2024 •

edited

Loading

caseytomlin commented Sep 10, 2024 •

edited

Loading

jmoralez commented Sep 11, 2024 •

edited

Loading