SO0111-SHARR-sendNotifications lambda (via the layer) returns ResourceAlreadyExistsException when a log stream already exists #218

kpapagno · 2025-01-09T16:36:07Z

Describe the bug
Log streams written with the help of the _create_log_stream in applogger.py in the orchestrator lambda layer. If the log stream name already exists then the lambda fails.

To Reproduce
Have the orchestrator stepfunction run more than one time per day

Expected behavior

The Log Group SO0111-SHARR Log stream names are now custom to reflect the date. It appears that if the notify lambda is run multiple times in the same day, the log stream (SHARR-2025-01-09) will already exist. The expectation is that when a log stream already exists, the orchestrator lambdas will not raise an exception. Instead it should get the next sequence token from the log stream which will be used in subsequent put_log_events calls.

Please complete the following information about the solution:

Version: v2.2.0
Region: us-east-1
Was the solution modified from the version published on this repository? Yes
If the answer to the previous question was yes, are the changes available on GitHub? Yes
Have you checked your service quotas for the sevices this solution uses? Yes, N/A
Were there any errors in the CloudWatch Logs? Troubleshooting
FIRST RUN. (Logs of /aws/lambda/SO0111-SHARR-sendNotifications):

2025-01-09T03:46:57.384Z | INIT_START Runtime Version: python:3.11.v46 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:84bc5fe9641102b252ebbfd80d16f2791f7bcca59e26ce68d95066317adf4503
-- | --
  | 2025-01-09T03:46:57.797Z | INFO successfully patched module sqlite3
  | 2025-01-09T03:46:57.850Z | INFO successfully patched module requests
  | 2025-01-09T03:46:57.855Z | INFO successfully patched module botocore
  | 2025-01-09T03:46:57.860Z | START RequestId: aae012ed-edc1-49db-8299-9a7e16dfd34b Version: $LATEST
  | 2025-01-09T03:46:57.920Z | INFO Found credentials in environment variables.
  | 2025-01-09T03:46:59.453Z | Sending metrics data <snipped for bug report....>
  | 2025-01-09T03:47:00.975Z | Notification message ID 7ae791c3-f712-5231-918a-ce3abfe6a8c0 sent.
  | 2025-01-09T03:47:00.975Z | Creating log stream SHARR-2025-01-09
  | 2025-01-09T03:47:01.328Z | END RequestId: aae012ed-edc1-49db-8299-9a7e16dfd34b

SECOND RUN (Logs of /aws/lambda/SO0111-SHARR-sendNotifications):

2025-01-09T03:46:57.384Z | INIT_START Runtime Version: python:3.11.v46 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:84bc5fe9641102b252ebbfd80d16f2791f7bcca59e26ce68d95066317adf4503
-- | --
  | 2025-01-09T03:46:57.810Z | INFO successfully patched module sqlite3
  | 2025-01-09T03:46:57.864Z | INFO successfully patched module requests
  | 2025-01-09T03:46:57.870Z | INFO successfully patched module botocore
  | 2025-01-09T03:46:57.875Z | START RequestId: e241b664-f7e9-42e5-aa77-526a9742de15 Version: $LATEST
  | 2025-01-09T03:46:57.912Z | INFO Found credentials in environment variables.
  | 2025-01-09T03:46:59.503Z | Sending metrics data  <snipped for bug report....>
  | 2025-01-09T03:47:00.997Z | Notification message ID 07238755-5443-5ad3-a6ef-d8ac1cb734c9 sent.
  | 2025-01-09T03:47:00.997Z | Creating log stream SHARR-2025-01-09
  | 2025-01-09T03:47:01.409Z | LAMBDA_WARNING: Unhandled exception. The most likely cause is an issue in the function code. However, in rare cases, a Lambda runtime update can cause unexpected function behavior. For functions using managed runtimes, runtime updates can be triggered by a function change, or can be applied automatically. To determine if the runtime has been updated, check the runtime version in the INIT_START log entry. If this error correlates with a change in the runtime version, you may be able to mitigate this error by temporarily rolling back to the previous runtime version. For more information, see https://docs.aws.amazon.com/lambda/latest/dg/runtimes-update.html
  | 2025-01-09T03:47:01.410Z | [ERROR] ResourceAlreadyExistsException: An error occurred (ResourceAlreadyExistsException) when calling the CreateLogStream operation: The specified log stream already exists Traceback (most recent call last):   File "/opt/python/lib/python3.11/site-packages/aws_lambda_powertools/tracing/tracer.py", line 324, in decorate     response = lambda_handler(event, context, **kwargs)   File "/var/task/send_notifications.py", line 171, in lambda_handler     build_and_send_notification(   File "/var/task/send_notifications.py", line 223, in build_and_send_notification     notification.notify()   File "/opt/python/layer/sechub_findings.py", line 323, in notify     self.applogger.flush()   File "/opt/python/layer/applogger.py", line 109, in flush     log_stream = self._create_log_stream(log_stream=self.stream_name)   File "/opt/python/layer/applogger.py", line 73, in _create_log_stream     get_logs_connection(self.apiclient).create_log_stream(   File "/opt/python/lib/python3.11/site-packages/botocore/client.py", line 569, in _api_call     return self._make_api_call(operation_name, kwargs)   File "/opt/python/lib/python3.11/site-packages/wrapt/wrappers.py", line 669, in __call__     return self._self_wrapper(self.__wrapped__, self._self_instance,   File "/opt/python/lib/python3.11/site-packages/aws_xray_sdk/ext/botocore/patch.py", line 38, in _xray_traced_botocore     return xray_recorder.record_subsegment(   File "/opt/python/lib/python3.11/site-packages/aws_xray_sdk/core/recorder.py", line 456, in record_subsegment     return_value = wrapped(*args, **kwargs)   File "/opt/python/lib/python3.11/site-packages/botocore/client.py", line 1023, in _make_api_call     raise error_class(parsed_response, operation_name)
  | 2025-01-09T03:47:01.412Z | END RequestId: e241b664-f7e9-42e5-aa77-526a9742de15
  | 2025-01-09T03:47:01.412Z | REPORT RequestId: e241b664-f7e9-42e5-aa77-526a9742de15 Duration: 3537.20 ms Billed Duration: 3538 ms Memory Size: 256 MB Max Memory Used: 103 MB Init Duration: 489.41 ms XRAY TraceId: 1-677f46a9-d153f7c536933a65a3a4aba9 SegmentId: 6245c5fe72c92874 Sampled: true

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context
I made my own change to get "by" this issue, but not confident enough for submitting a pull request at this time.

Updated code in layer applogger.py

   def _create_log_stream(self, log_stream):
        """Create a new log stream"""
        # append today's date to stream name
        log_stream = log_stream + "-" + str(date.today())
        try:
            print(("Creating log stream {}".format(log_stream)))
            get_logs_connection(self.apiclient).create_log_stream(
                logGroupName=self.log_group, logStreamName=log_stream
            )
            self._stream_token = "0"
        except Exception as e:
            # if the stream was created in between the call ignore the error
            if type(e).__name__ == "ResourceAlreadyExistsException":
                print("Log Stream already exists")
+                # Get the sequence token for the existing stream
+               response = get_logs_connection(self.apiclient).describe_log_streams(
+                    logGroupName=self.log_group, logStreamNamePrefix=log_stream, limit=1
+                )
+                if response["logStreams"]:
+                    self._stream_token = response["logStreams"][0].get(
+                        "uploadSequenceToken", "0"
+                    )
            elif type(e).__name__ == "ResourceNotFoundException":
                if self._create_log_group():
                    get_logs_connection(self.apiclient).create_log_stream(
                        logGroupName=self.log_group, logStreamName=log_stream
                    )
                else:
                    raise FailedToCreateLogGroup
            else:
                raise e
        return log_stream

When catching ResourceAlreadyExistsException, we now fetch the current sequence token for the existing stream using describe_log_streams. This sequence token is stored in self._stream_token which will be used in subsequent put_log_events calls. This modification ensures that when you encounter an existing stream, you'll get the correct sequence token right away, which should prevent the error and allow for proper log writing to existing streams. The sequence token is important because CloudWatch Logs uses it to maintain the order of log events and prevent duplicate entries. When a stream already exists, we need to know the current sequence token to append new logs correctly.

Additionally, I added the permission logs:DescribeLogStreams to all the policies which work with writing logs as the layer may use the common code.

The text was updated successfully, but these errors were encountered:

jrgaray27 · 2025-01-09T16:51:07Z

Hi Ken,
Thanks for pointing this out. I will look into this and get back to you once I have more info.

jrgaray27 · 2025-01-09T20:57:52Z

After looking into it, it seems that the error in the logs you provided is occurring on line 73 of applogger.py, which is during the second attempt at calling create_log_stream. It doesn't seem like the execution ever enters the first if-clause at all (if type(e).__name__ == "ResourceAlreadyExistsException":) instead entering the elif clause, which then sparks the exception. For this reason, I don't think your code would be executed in the error scenario you're experiencing. I.e., The first call to create_log_stream raises a ResourceNotFoundException, then the second call raises a ResourceAlreadyExistsException.

The flush method will indeed add the sequence token if the stream exists, since the _create_log_stream method will return the name of the existing log stream when a ResourceAlreadyExistsException is encountered, and the sequence token is added in line 122 of applogger.py.

Examining the timestamps of the logs you've provided, it seems like the first and second run of send_notifications are happening in parallel, which means a race condition is the likely cause of the exception here. In other words, both instances of the lambda are trying to create and use the log stream at the same time. Unfortunately there isn't an easy fix for this, however as a temporary workaround you can set the concurrency of the send_notifications lambda function to 1, meaning only one instance of the function can run at a time.

One important thing to note - the flush method is the final action of the send_notifications function, meaning this exception will not have any effect on the core functionality of the solution or notifications step.

We will prioritize a fix for this issue to include with the next enhancement release of the solution.

kpapagno · 2025-01-10T00:25:36Z

Thanks! My "test" which probably caused the race condition was manually selecting 4 "Athena.4" and taking action of "Remediate with ASR" inside security hub. I did see its the last (or almost unless Jira/ServiceNow) is used; but it still made the orchestration show with the big red failed! Anyway, thanks for the information, and as im working in a complete test PMA/MPA i'm the only one triggering things; It sounds like in "real life" with a bunch of eventbridges triggering the race condition should not happen. I'm glad I found a "bad scenario" even if i didnt find the right solution! I appreciate all the quick analysis! I do try and "figure it out" myself so I can keep my company's fork up to date with our customizations; before running with a "it doesn't work" post.

…

On Thu, Jan 9, 2025 at 3:58 PM Ryan Garay ***@***.***> wrote: After looking into it, it seems that the error in the logs you provided is occurring on line 73 of applogger.py, which is during the second attempt at calling create_log_stream. It doesn't seem like the execution ever enters the first if-clause at all (if type(e).__name__ == "ResourceAlreadyExistsException":) instead entering the elif clause, which then sparks the exception. For this reason, I don't think your code would be executed in the error scenario you're experiencing. I.e., The first call to create_log_stream raises a ResourceNotFoundException, then the second call raises a ResourceAlreadyExistsException. The flush method will indeed add the sequence token if the stream exists, since the _create_log_stream method will return the name of the existing log stream when a ResourceAlreadyExistsException is encountered, and the sequence token is added in line 122 of applogger.py. Examining the timestamps of the logs you've provided, it seems like the first and second run of send_notifications are happening in parallel, which means a race condition is the likely cause of the exception here. In other words, both instances of the lambda are trying to create and use the log stream at the same time. Unfortunately there isn't an easy fix for this, however as a temporary workaround you can set the concurrency of the send_notifications lambda function to 1, meaning only one instance of the function can run at a time. One important thing to note - the flush method is the final action of the send_notifications function, meaning this exception will not have any effect on the core functionality of the solution or notifications step. We will prioritize a fix for this issue to include with the next enhancement release of the solution. — Reply to this email directly, view it on GitHub <#218 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIQIKDBKB2LKUQ4D5NCMX32J3PGPAVCNFSM6AAAAABU4R4LGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOBRGIZTGNZRGI> . You are receiving this because you authored the thread.Message ID: <aws-solutions/automated-security-response-on-aws/issues/218/2581233712@ github.com>

jrgaray27 · 2025-01-10T14:31:20Z

No problem, glad to help. Feel free to reach out if you have any further questions about the solution.

I will leave this issue open until a fix is introduced in the next release.

kpapagno added the bug Something isn't working label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SO0111-SHARR-sendNotifications lambda (via the layer) returns ResourceAlreadyExistsException when a log stream already exists #218

SO0111-SHARR-sendNotifications lambda (via the layer) returns ResourceAlreadyExistsException when a log stream already exists #218

kpapagno commented Jan 9, 2025 •

edited

Loading

jrgaray27 commented Jan 9, 2025

jrgaray27 commented Jan 9, 2025

kpapagno commented Jan 10, 2025 via email

jrgaray27 commented Jan 10, 2025

SO0111-SHARR-sendNotifications lambda (via the layer) returns ResourceAlreadyExistsException when a log stream already exists #218

SO0111-SHARR-sendNotifications lambda (via the layer) returns ResourceAlreadyExistsException when a log stream already exists #218

Comments

kpapagno commented Jan 9, 2025 • edited Loading

jrgaray27 commented Jan 9, 2025

jrgaray27 commented Jan 9, 2025

kpapagno commented Jan 10, 2025 via email

jrgaray27 commented Jan 10, 2025

kpapagno commented Jan 9, 2025 •

edited

Loading