-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Add new Python multi-lang quickstart using the SchemaTransform framework #33360
base: master
Are you sure you want to change the base?
Conversation
Assigning reviewers. If you would like to opt out of this review, comment R: @kennknowles for label website. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Reminder, please take a look at this pr: @kennknowles |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @damccorm for label website. Available commands:
|
@@ -8040,6 +8042,8 @@ input.apply( | |||
|
|||
#### 13.2.2. Using cross-language transforms in a Python pipeline | |||
|
|||
For Beam versions 2.60.0+, please follow [this guide](sdks/python-custom-multi-language-pipelines-guide.md#use-the-portable-transform-in-a-python-pipeline) instead. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this section actually need this disclaimer? I think consuming schema transforms is basically the same, right/nothing has changed for this section?
@@ -7639,6 +7639,8 @@ In this section, we will use [KafkaIO.Read](https://beam.apache.org/releases/jav | |||
|
|||
#### 13.1.1. Creating cross-language Java transforms | |||
|
|||
For Beam versions 2.60.0+, please follow [this guide](sdks/python-custom-multi-language-pipelines-guide.md) instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this apply to the whole section or just 13.1.1.2
? Do we need to recommend away from JavaExternalTransform
for cases where it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we update this section to recommend the new way (even if its just linking to the full doc) by default, and just link to the legacy page for <2.60.0 instead of leaving all the content here?
|
||
## Create a cross-language transform | ||
|
||
Here's a Java transform provider, [ExtractWordsProvider](https://github.com/apache/beam/blob/master/examples/multi-language/src/main/java/org/apache/beam/examples/multilanguage/schematransforms/ExtractWordsProvider.java), that is uniquely identified with the URN `"beam:schematransform:org.apache.beam:extract_words:v1"`. Given a Configuration object, it will provide a transform: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe what the URN does? (in this context allows the transform to be identified across the language barrier)
|
||
Beam uses this configuration to generate a Python transform with the following signature: | ||
```python | ||
Extract(drop=["foo", "bar"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract(drop=["foo", "bar"]) | |
class Extract(): | |
def __init__(self, drop: List[str]) |
Saying the existing code snippet is a signature is not quite right. Thoughts on providing the full Python class definition? This might be a bit clearer.
Alternately, we could change Beam uses this configuration to generate a Python transform with the following signature:
to Beam uses this configuration to generate a Python transform which can be instantiated like:
.
Extract(drop=["foo", "bar"]) | ||
``` | ||
|
||
The transform can be any implementation of your choice, as long as it meets the requirements of a [SchemaTransform](../glossary.md#schematransform). For this example, the transform does the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to similarly describe what a valid configuration is above. I assume not all field types are valid?
|
||
When building a job for a multi-language pipeline, Beam uses an [expansion service](../glossary#expansion-service) to expand [composite transforms](../glossary#composite-transform). You must have at least one expansion service per remote SDK. | ||
|
||
Before running a multi-language pipeline, you need to build an expansion service that can access your Java transform. It’s often easier to create a single shaded JAR that contains both. Both Python and Java dependencies will be staged for the runner by the Python SDK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s often easier to create a single shaded JAR that contains both
I'm not sure what this is saying - both of what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to include an example command or additional info that shows how you can do this as well
Then, initialize the `ExternalTransformProvider` with your expansion service. This can take two parameters: | ||
|
||
* `expansion_services`: an expansion service, or list of expansion services | ||
* `urn_pattern`: (optional) a regex pattern to match valid transforms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `urn_pattern`: (optional) a regex pattern to match valid transforms | |
* `urn_pattern`: (optional) a regex pattern to match valid transforms. If this is not provided... |
It would be good to add information on what this does/what happens if it is missing
|
||
### Run with direct runner | ||
|
||
In the following command, `input1` is a file containing lines of text: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth calling out that the expansion service needs to be started first (here and below in the Dataflow section)
Reminder, please take a look at this pr: @damccorm |
waiting on author |
Part of #33358
Adding a new multi-lang quickstart and marking the old one as "legacy"