Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Add new Python multi-lang quickstart using the SchemaTransform framework #33360

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Dec 11, 2024

Part of #33358

Adding a new multi-lang quickstart and marking the old one as "legacy"

@ahmedabu98 ahmedabu98 marked this pull request as ready for review December 12, 2024 19:29
@ahmedabu98 ahmedabu98 changed the title Add new Python multi-lang quickstart using the SchemaTransform framework [docs] Add new Python multi-lang quickstart using the SchemaTransform framework Dec 12, 2024
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label website.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

Reminder, please take a look at this pr: @kennknowles

Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm for label website.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@@ -8040,6 +8042,8 @@ input.apply(

#### 13.2.2. Using cross-language transforms in a Python pipeline

For Beam versions 2.60.0+, please follow [this guide](sdks/python-custom-multi-language-pipelines-guide.md#use-the-portable-transform-in-a-python-pipeline) instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this section actually need this disclaimer? I think consuming schema transforms is basically the same, right/nothing has changed for this section?

@@ -7639,6 +7639,8 @@ In this section, we will use [KafkaIO.Read](https://beam.apache.org/releases/jav

#### 13.1.1. Creating cross-language Java transforms

For Beam versions 2.60.0+, please follow [this guide](sdks/python-custom-multi-language-pipelines-guide.md) instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply to the whole section or just 13.1.1.2? Do we need to recommend away from JavaExternalTransform for cases where it works?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we update this section to recommend the new way (even if its just linking to the full doc) by default, and just link to the legacy page for <2.60.0 instead of leaving all the content here?


## Create a cross-language transform

Here's a Java transform provider, [ExtractWordsProvider](https://github.com/apache/beam/blob/master/examples/multi-language/src/main/java/org/apache/beam/examples/multilanguage/schematransforms/ExtractWordsProvider.java), that is uniquely identified with the URN `"beam:schematransform:org.apache.beam:extract_words:v1"`. Given a Configuration object, it will provide a transform:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe what the URN does? (in this context allows the transform to be identified across the language barrier)


Beam uses this configuration to generate a Python transform with the following signature:
```python
Extract(drop=["foo", "bar"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Extract(drop=["foo", "bar"])
class Extract():
def __init__(self, drop: List[str])

Saying the existing code snippet is a signature is not quite right. Thoughts on providing the full Python class definition? This might be a bit clearer.

Alternately, we could change Beam uses this configuration to generate a Python transform with the following signature: to Beam uses this configuration to generate a Python transform which can be instantiated like:.

Extract(drop=["foo", "bar"])
```

The transform can be any implementation of your choice, as long as it meets the requirements of a [SchemaTransform](../glossary.md#schematransform). For this example, the transform does the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to similarly describe what a valid configuration is above. I assume not all field types are valid?


When building a job for a multi-language pipeline, Beam uses an [expansion service](../glossary#expansion-service) to expand [composite transforms](../glossary#composite-transform). You must have at least one expansion service per remote SDK.

Before running a multi-language pipeline, you need to build an expansion service that can access your Java transform. It’s often easier to create a single shaded JAR that contains both. Both Python and Java dependencies will be staged for the runner by the Python SDK.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s often easier to create a single shaded JAR that contains both

I'm not sure what this is saying - both of what?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to include an example command or additional info that shows how you can do this as well

Then, initialize the `ExternalTransformProvider` with your expansion service. This can take two parameters:

* `expansion_services`: an expansion service, or list of expansion services
* `urn_pattern`: (optional) a regex pattern to match valid transforms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `urn_pattern`: (optional) a regex pattern to match valid transforms
* `urn_pattern`: (optional) a regex pattern to match valid transforms. If this is not provided...

It would be good to add information on what this does/what happens if it is missing


### Run with direct runner

In the following command, `input1` is a file containing lines of text:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth calling out that the expansion service needs to be started first (here and below in the Dataflow section)

Copy link
Contributor

Reminder, please take a look at this pr: @damccorm

@damccorm
Copy link
Contributor

waiting on author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: Add new Python SchemaTransform multi-language documentation
2 participants