Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly cleanup SqlTask on worker when no longer needed #24729

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

losipiuk
Copy link
Member

Currently SqlTask objects are removed from SqlTaskManager.tasks map (cache) after timeout (15 minutes by default). Even though the object is not huge, we observed increased memory pressure up to OOM on busy clusters.

With this PR entries are dropped form SqlTaskManager as soon as they are no longer needed, when coordinator will no longer query for the information

Note: I this code is prone to race condition if last worker to acknowlege task results retransmits acknowlegement. There is a chance that a this point task is already cleaned by cleanup request from coordinator, and entry will get resurected in tasks cache map. This should be extremly rare though, and not cause any problems.

@POST
@Path("{taskId}/cleanup")
public void cleanupTask(
@PathParam("taskId") TaskId taskId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wrapping not needed

@@ -124,6 +128,7 @@ public synchronized void updateDynamicFiltersVersionAndFetchIfNecessary(long new
private synchronized void stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is not called normally. only in erronous situations. You cannot rely on it to be executed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The external entry point for DynamicFiltersFetcher request is really updateDynamicFiltersVersionAndFetchIfNecessary.
So the conditions for DynamicFiltersFetcher being completed are:

  • there is no pending query
  • all places which could call updateDynamicFiltersVersionAndFetchIfNecessary are stopped (currenty it's continuous fetcher)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 @raunaqmorarka Yeah - you are right. Do you have quick clue if there is natural place where we know that DynFilFetcher finished its work. The way it is called is kinda convoluted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes:

  • there is no pending request from DynamicFilterFetcher
  • all places which could call updateDynamicFiltersVersionAndFetchIfNecessary are stopped (it's only ContinuousTaskStatusFetcher atm IIRC)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL now :)

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add testing (with DF) to TestHttpRemoteTask

import static java.util.Objects.requireNonNull;
import static java.util.concurrent.TimeUnit.MILLISECONDS;

public class RemoteTaskCleaner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate service for this? Cannot HttpRemoteTask do cleanup itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can - just having separate class is cleaner I think. I can put that as inner class too but I do not like that too much.

@losipiuk
Copy link
Member Author

@sopel39 PTAL

@losipiuk
Copy link
Member Author

@sopel39 PTAL

@sopel39 PTAL

@@ -121,6 +124,8 @@ public synchronized void stop()
future.cancel(true);
future = null;
}
remoteTaskCleaner.markTaskStatusFetcherStopped(taskStatus.get().getState());
dynamicFiltersFetcher.noMoreUpdateRequests();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be just DynamicFiltersFetcher#stop method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - as we want to wait for results of last after last DFF.updateDynamicFiltersVersionAndFetchIfNecessary call

@@ -253,7 +258,11 @@ void updateTaskStatus(TaskStatus newValue)
onFail.accept(new TrinoException(REMOTE_TASK_MISMATCH, format("%s (%s)", REMOTE_TASK_MISMATCH_ERROR, HostAddress.fromUri(getTaskStatus().getSelf()))));
}

dynamicFiltersFetcher.updateDynamicFiltersVersionAndFetchIfNecessary(newValue.getDynamicFiltersVersion());
synchronized (this) {
if (running) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it happen in non-aborted query?

I think that's still wrong, because last DF update could be potentially received after ContinuousTaskStatusFetcher#stop was called. We need that last request for DFs to happen, otherwise we loose filtering opportunity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that running can get true bacasue stop() is called as a result of listener due to taskStatus.setIf above?

Maybe we should just move

dynamicFiltersFetcher.updateDynamicFiltersVersionAndFetchIfNecessary(newValue.getDynamicFiltersVersion());

to the top of updateTaskStatus. WDYT?

@@ -113,17 +118,38 @@ public synchronized void start()

public synchronized void updateDynamicFiltersVersionAndFetchIfNecessary(long newDynamicFiltersVersion)
{
if (noMoreUpdateRequests) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, we need that last update to happen to fetch final DFs.

Just make dynamic filters fetcher report when it's finished.

Copy link
Member Author

@losipiuk losipiuk Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my previous comment.

What do you exactly mean by updateDynamicFiltersVersionAndFetchIfNecessary?
LIke return future from updateDynamicFiltersVersionAndFetchIfNecessary and mark it completed when relevant processing is completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return future from updateDynamicFiltersVersionAndFetchIfNecessary and mark it completed when relevant processing is completed?

I tried that in other branch - and it gets extremly ugly and complidcated.

Copy link
Member

@sopel39 sopel39 Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean essentially that the fetchers are not considered closed if there is still a pending request. It may happens that last ContinuousTaskStatusFetcher.java request could trigger DF fetch request, so the end condition is:

  1. ContinuousTaskStatusFetcher is REALLY closed (no pending request)
  2. DF doesn't have any pending request.

I think if you just put checkpoints in right places accounting for the above you should be fine

@@ -121,6 +124,8 @@ public synchronized void stop()
future.cancel(true);
future = null;
}
remoteTaskCleaner.markTaskStatusFetcherStopped(taskStatus.get().getState());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would postpone that only after pending request is finished, so that last DF update can be processed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this one related to pending requst in DF fetcher?

Currently SqlTask objects are removed from SqlTaskManager.tasks map
(cache) after timeout (15 minutes by default). Even though the object is
not huge, we observed increased memory pressure up to OOM on busy
clusters.

With this PR entries are dropped form SqlTaskManager as soon as they are
no longer needed, when coordinator will no longer query for the
information
@losipiuk
Copy link
Member Author

@sopel39 ?

@losipiuk
Copy link
Member Author

@sopel39 ?

actually wait - sth is messed up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants