Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SeasonalNaive (and other models) spend a lot of time acquiring thread locks #898

Closed
caseytomlin opened this issue Sep 5, 2024 · 13 comments · Fixed by #901 or #915
Closed

SeasonalNaive (and other models) spend a lot of time acquiring thread locks #898

caseytomlin opened this issue Sep 5, 2024 · 13 comments · Fixed by #901 or #915
Labels

Comments

@caseytomlin
Copy link

What happened + What you expected to happen

Caveats: I'm not sure if this is expected behavior, and I'm not sure if it's new, but I think it is.

The majority of wall clock time is spent (in the case of SeasonalNaive) acquiring thread locks. See below for an example. It seems to take even longer per-time series when the number of series increases (e.g. 10000 series takes 1 minute, but 100000 series takes 20 minutes).

The issue is reproducible in multiple environments, but it is particularly problematic in kubeflow pipeline executions (the resulting long running times cause the pods to stagnate and never finish). We have a fair number of time series (~O(10^6)), but SeasonalNaive with n_jobs=12 can take 20+ hours to finish (if it finishes at all).

Thanks for an otherwise great product!

Versions / Dependencies

Click to expand Dependencies:
  • numpy = '1.23.5'
  • pandas = '2.1.1'
  • statsforecast = '1.7.6'

Reproducible example

import numpy as np
import pandas as pd
import statsforecast as sf
from statsforecast.core import StatsForecast
from statsforecast.models import SeasonalNaive
n = 10000
season_length = 12
freq = "M"

df = pd.DataFrame(
    {
        "unique_id": [f"id{i}" for i in range(n) for _ in range(season_length)],
        "ds": [*pd.date_range(start="2023-08-31", periods=season_length, freq=freq)] * n,
        "y": np.random.rand(season_length * n)
    }
)

sn = StatsForecast(models=[SeasonalNaive(season_length=season_length)], freq=freq, n_jobs=8)
%%prun
preds = sn.forecast(h=2*season_length, df=df)

Top of prun:

Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    49462   65.589    0.001   65.589    0.001 {method 'acquire' of '_thread.lock' objects}
    10000    0.353    0.000    1.416    0.000 process.py:678(submit)
    10025    0.308    0.000    0.316    0.000 threading.py:228(__init__)
        1    0.241    0.241   67.748   67.748 core.py:1287(_forecast_parallel)
    10001    0.236    0.000    0.236    0.000 {built-in method posix.write}
    59696    0.154    0.000    0.186    0.000 threading.py:256(__enter__)
    10001    0.098    0.000   65.974    0.007 _base.py:201(as_completed)
     9853    0.065    0.000   65.668    0.007 threading.py:280(wait)
        8    0.060    0.007    0.060    0.008 {built-in method posix.fork}
    19840    0.050    0.000    0.088    0.000 _base.py:179(_yield_finished_futures)
     9857    0.039    0.000   65.728    0.007 threading.py:556(wait)

Issue Severity

High: It blocks me from completing my task.

@caseytomlin caseytomlin added the bug label Sep 5, 2024
@rauerhans
Copy link

This is affecting me, too! Exact same problem

@jmoralez
Copy link
Member

jmoralez commented Sep 5, 2024

Hey @caseytomlin, thanks for raising this. This is expected for very simple models, since the overhead of running multiprocessing is greater than the actual model training, so if you're using SeasonalNaive, Naive, HistoricAverage, etc. it'll always be faster to set n_jobs=1.

My guess is that you're seeing the locks because %prun isn't able to capture code executed on different processes, the time is being spent waiting for the results that are being executed elsewhere, which you can verify by using line_profiler for example, which will show that when n_jobs=1 the time is spent here

res_i = model.forecast(
h=h, y=y_train, X=X_train, X_future=X_f, fitted=fitted, **kwargs
)

and when n_jobs>1 time time is spent waiting for the results here
for future in iterable:

Here's a colab notebook with the comparison. I don't get a nice speedup there with multiprocessing (4.5s n_jobs=1 vs 4s with n_jobs=2) but on my local machine I get a reasonable speedup.

@caseytomlin
Copy link
Author

@jmoralez thanks for the quick reply and the hints! I will try line_profiler. Just to clarify, we see basically the same behavior with AutoARIMA and AutoETS - these exhibit a longer phase at the beginning where the full allotted compute is being used (according to htop), followed by a phase with the same multiprocessing "long tail" as in SeasonalNaive. Also with these models the run time is dominated by this second phase with very low cpu usage.

@jmoralez
Copy link
Member

jmoralez commented Sep 6, 2024

Do your series differ a lot in length? The scheduling now works per serie, to support progress bars and also (hopefully) better scheduling, since previously we just partitioned all series in n_jobs chunks and processed those separately, which produced less overhead but could also end up with just a few CPUs running their partitions while all others were done. Note that can still happen if for example most of your series are short but you have some that are very long and thus the process will have to wait until those are done, although this wouldn't happen with SeasonalNaive for example.

I think the locks you see are actually how concurrent.futures.as_completed waits for the results to be ready, so that's not really informative of what may be going on. If you can provide an example with AutoARIMA or AutoETS where using n_jobs>1 is slower than n_jobs=1 I can help debug it further.

@caseytomlin
Copy link
Author

caseytomlin commented Sep 7, 2024

The series in our original use case do differ somewhat in length, but I can see the behavior when using dummy data like above. I didn't get to the line profiler yet, but there seems to be a threshold in the number of series (n above) where n_jobs>1 becomes a bad idea.

Using the same code as above on an 8-cpu machine (replacing SeasonalNaive with AutoETS) I see:

  • n = 10_000: n_jobs = 1 finished in ~2 minutes, while n_jobs = 8 took ~40 seconds
  • n = 100_000: n_jobs = 1 finished in ~14 minutes, while n_jobs = 8 is still running after 85 minutes!

Let me know if I can provide more info/tests. Very curious!

@caseytomlin
Copy link
Author

The scheduling now works per serie, to support progress bars and also (hopefully) better scheduling, since previously we just partitioned all series in n_jobs chunks and processed those separately, which produced less overhead but could also end up with just a few CPUs running their partitions while all others were done.

@jmoralez could you point to the package version that does the partitioning the "old" way?

@jmoralez
Copy link
Member

1.7.5

@caseytomlin
Copy link
Author

caseytomlin commented Sep 10, 2024

Rerunning the above example with AutoETS on 1.7.5 indeed brings the n = 100_000 n_jobs = 8 run time down to 3 minutes.

@jmoralez
Copy link
Member

For the SeasonalNaive? How long does it take with n_jobs=1?

@caseytomlin
Copy link
Author

Apologies, I meant with AutoETS - comment is edited.

With SeasonalNaive the run times are 3.7s with n_jobs = 1 and 2.8s with n_jobs = 8 using version 1.7.5 (as opposed to > 80 minutes using version 1.7.6)

@jmoralez
Copy link
Member

jmoralez commented Sep 11, 2024

Thanks a lot for working through this issue with us. I've narrowed down the problems and I'll push a fix soon.

@caseytomlin
Copy link
Author

Thanks @jmoralez for the quick fix, but unfortunately I am still running into similar problems using 1.7.7.1. However the threshold number of series where the "crossover" occurs has gone up (and for my actual use cases this is the relevant scale). Here are some timings from 1.7.7.1:

Model n n_jobs run time
SeasonalNaive 500_000 1 0:19
SeasonalNaive 500_000 8 2:41
SeasonalNaive 1_000_000 1 0:39
SeasonalNaive 1_000_000 8 > 40:00 (aborted)
AutoETS 500_000 1 24:00
AutoETS 500_000 8 7:30
AutoETS 1_000_000 1 48:00
AutoETS 1_000_000 8 > 2:00:00 (aborted)

With 1.7.5 on the other hand e.g.,

Model n n_jobs run time
SeasonalNaive 1_000_000 1 0:39
SeasonalNaive 1_000_000 8 0:24
AutoETS 1_000_000 8 32:00

@jmoralez jmoralez reopened this Sep 18, 2024
@jmoralez
Copy link
Member

Hey. Yeah, sorry I only tested with 100k, it seems that the process pool executor chokes on that many tasks, I'm working on keeping the number of tasks at a given time constant. It already works but I'm trying to improve it a bit, I'll push another PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants