New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[docs] Add new Python multi-lang quickstart using the SchemaTransform framework #33360

Open

ahmedabu98 wants to merge 2 commits into apache:master from ahmedabu98:python_xlang_quickstart

+238 −4

Contributor

ahmedabu98 commented Dec 11, 2024 •

edited

Loading

Part of #33358

Adding a new multi-lang quickstart and marking the old one as "legacy"


          new python multi-lang quickstart

79534af

github-actions bot added the website label

ahmedabu98 marked this pull request as ready for review

December 12, 2024 19:29

ahmedabu98 changed the title ~~Add new Python multi-lang quickstart using the SchemaTransform framework~~ [docs] Add new Python multi-lang quickstart using the SchemaTransform framework

Contributor

github-actions bot commented Dec 12, 2024

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label website.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions bot added the Next Action: Reviewers label

ahmedabu98 mentioned this pull request

[Task]: Add new Python SchemaTransform multi-language documentation #33358

Open

17 tasks

ahmedabu98 linked an issue

that may be closed by this pull request

[Task]: Add new Python SchemaTransform multi-language documentation #33358

Open

17 tasks


          update example code

Contributor

github-actions bot commented Dec 26, 2024

Reminder, please take a look at this pr: @kennknowles

github-actions bot added the slow-review label

Contributor

github-actions bot commented Dec 31, 2024

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm for label website.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions bot removed the slow-review label

damccorm reviewed

View reviewed changes

website/www/site/content/en/documentation/programming-guide.md

		@@ -8040,6 +8042,8 @@ input.apply(

		#### 13.2.2. Using cross-language transforms in a Python pipeline

		For Beam versions 2.60.0+, please follow [this guide](sdks/python-custom-multi-language-pipelines-guide.md#use-the-portable-transform-in-a-python-pipeline) instead.

Contributor

damccorm Jan 2, 2025

Does this section actually need this disclaimer? I think consuming schema transforms is basically the same, right/nothing has changed for this section?

website/www/site/content/en/documentation/programming-guide.md

		@@ -7639,6 +7639,8 @@ In this section, we will use [KafkaIO.Read](https://beam.apache.org/releases/jav

		#### 13.1.1. Creating cross-language Java transforms

		For Beam versions 2.60.0+, please follow [this guide](sdks/python-custom-multi-language-pipelines-guide.md) instead.

Contributor

damccorm Jan 2, 2025

Does this apply to the whole section or just 13.1.1.2? Do we need to recommend away from JavaExternalTransform for cases where it works?

Contributor

damccorm Jan 2, 2025

Also, should we update this section to recommend the new way (even if its just linking to the full doc) by default, and just link to the legacy page for <2.60.0 instead of leaving all the content here?

website/www/site/content/en/documentation/sdks/python-multi-language-pipelines-2.md


		## Create a cross-language transform

		Here's a Java transform provider, [ExtractWordsProvider](https://github.com/apache/beam/blob/master/examples/multi-language/src/main/java/org/apache/beam/examples/multilanguage/schematransforms/ExtractWordsProvider.java), that is uniquely identified with the URN `"beam:schematransform:org.apache.beam:extract_words:v1"`. Given a Configuration object, it will provide a transform:

Contributor

damccorm Jan 2, 2025

Could you describe what the URN does? (in this context allows the transform to be identified across the language barrier)

website/www/site/content/en/documentation/sdks/python-multi-language-pipelines-2.md

+              Beam uses this configuration to generate a Python transform with the following signature:
+              ```python
+              Extract(drop=["foo", "bar"])

Contributor

damccorm Jan 2, 2025

Suggested change

      
            Extract(drop=["foo", "bar"])
          
            class Extract():
          
               def __init__(self, drop: List[str])

Saying the existing code snippet is a signature is not quite right. Thoughts on providing the full Python class definition? This might be a bit clearer.

Alternately, we could change Beam uses this configuration to generate a Python transform with the following signature: to Beam uses this configuration to generate a Python transform which can be instantiated like:.

website/www/site/content/en/documentation/sdks/python-multi-language-pipelines-2.md

+              Extract(drop=["foo", "bar"])
+              ```
+              The transform can be any implementation of your choice, as long as it meets the requirements of a [SchemaTransform](../glossary.md#schematransform). For this example, the transform does the following:

Contributor

damccorm Jan 2, 2025

I think we need to similarly describe what a valid configuration is above. I assume not all field types are valid?

website/www/site/content/en/documentation/sdks/python-multi-language-pipelines-2.md


		When building a job for a multi-language pipeline, Beam uses an [expansion service](../glossary#expansion-service) to expand [composite transforms](../glossary#composite-transform). You must have at least one expansion service per remote SDK.

		Before running a multi-language pipeline, you need to build an expansion service that can access your Java transform. It’s often easier to create a single shaded JAR that contains both. Both Python and Java dependencies will be staged for the runner by the Python SDK.

Contributor

damccorm Jan 2, 2025

It’s often easier to create a single shaded JAR that contains both

I'm not sure what this is saying - both of what?

Contributor

damccorm Jan 2, 2025

It might be nice to include an example command or additional info that shows how you can do this as well

website/www/site/content/en/documentation/sdks/python-multi-language-pipelines-2.md

+              Then, initialize the `ExternalTransformProvider` with your expansion service. This can take two parameters:
+              * `expansion_services`: an expansion service, or list of expansion services
+              * `urn_pattern`: (optional) a regex pattern to match valid transforms

Contributor

damccorm Jan 2, 2025

Suggested change

      
            * `urn_pattern`: (optional) a regex pattern to match valid transforms
          
            * `urn_pattern`: (optional) a regex pattern to match valid transforms. If this is not provided...

It would be good to add information on what this does/what happens if it is missing

website/www/site/content/en/documentation/sdks/python-multi-language-pipelines-2.md


		### Run with direct runner

		In the following command, `input1` is a file containing lines of text:

Contributor

damccorm Jan 2, 2025

Probably worth calling out that the expansion service needs to be started first (here and below in the Dataflow section)

Contributor

github-actions bot commented Jan 10, 2025

Reminder, please take a look at this pr: @damccorm

github-actions bot added the slow-review label

Contributor

damccorm commented Jan 10, 2025

waiting on author

github-actions bot added Next Action: Author and removed Next Action: Reviewers slow-review labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Next Action: Author website