Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/enrichment docs #33012 #33561

Merged
merged 11 commits into from
Jan 16, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions examples/notebooks/beam-ml/bigtable_enrichment_transform.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -603,6 +603,37 @@
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What is a Cross-Join?\n",
"A cross-join is a Cartesian product operation where each row from one table is combined with every row from another table. It is useful when we want to create all possible combinations of two datasets.\n",
"\n",
"**Example:**\n",
"- Table A:\n",
Vishesh-Tripathi marked this conversation as resolved.
Show resolved Hide resolved
" | A1 | A2 |\n",
" |----|----|\n",
" | 1 | X |\n",
" | 2 | Y |\n",
"\n",
"- Table B:\n",
" | B1 | B2 |\n",
" |----|----|\n",
" | 10 | P |\n",
" | 20 | Q |\n",
"\n",
"**Result of Cross-Join:**\n",
" | A1 | A2 | B1 | B2 |\n",
" |----|----|----|----|\n",
" | 1 | X | 10 | P |\n",
" | 1 | X | 20 | Q |\n",
" | 2 | Y | 10 | P |\n",
" | 2 | Y | 20 | Q |\n",
"\n",
"Cross-joins can be computationally expensive for large datasets, so use them judiciously.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you combine this cell and the next one into a single cell? I think they're discussing the same thing.

]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,54 @@ The following examples demonstrate how to create a pipeline that use the enrichm
| Vertex AI Feature Store (Legacy) | [Enrichment with Legacy Vertex AI Feature Store](/documentation/transforms/python/elementwise/enrichment-vertexai/#example-2-enrichment-with-vertex-ai-feature-store-legacy) |
{{< /table >}}

## BigQuery Support

The enrichment transform supports integration with **BigQuery** to dynamically enrich data using BigQuery datasets. By leveraging BigQuery as an external data source, users can execute efficient lookups for data enrichment directly in their Apache Beam pipelines.
damccorm marked this conversation as resolved.
Show resolved Hide resolved

To use BigQuery for enrichment:
- Configure your BigQuery table as the data source for the enrichment process.
- Ensure your pipeline has the appropriate credentials and permissions to access the BigQuery dataset.
- Specify the query to extract the data to be used for enrichment.

This integration is particularly beneficial for use cases that require augmenting real-time streaming data with information stored in BigQuery.

---

## Batching

To optimize requests to external services, the enrichment transform uses batching. Instead of performing a lookup for each individual element, the transform groups multiple elements into a batch and performs a single lookup for the entire batch.
damccorm marked this conversation as resolved.
Show resolved Hide resolved

### Advantages of Batching:
- **Improved Throughput**: Reduces the number of network calls.
- **Lower Latency**: Fewer round trips to the external service.
- **Cost Optimization**: Minimizes API call costs when working with paid external services.

Users can configure the batch size by specifying parameters in their pipeline setup. Adjusting the batch size can help fine-tune the balance between throughput and latency.

---

## Caching with `with_redis_cache`

For frequently used enrichment data, caching can significantly improve performance by reducing repeated calls to the remote service. Apache Beam's `with_redis_cache` method allows you to integrate a Redis cache into the enrichment pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Benefits of Caching:
- **Reduced Latency**: Fetches enrichment data from the cache instead of making network calls.
- **Improved Resilience**: Minimizes the impact of network outages or service downtimes.
- **Scalability**: Handles large volumes of enrichment requests efficiently.

To enable caching:
1. Set up a Redis instance accessible by your pipeline.
2. Use the `with_redis_cache` method to configure the cache in your enrichment transform.
3. Specify the time-to-live (TTL) for cache entries to ensure data freshness.

Example:
```python
from apache_beam.transforms.enrichment import with_redis_cache

# Enrichment pipeline with Redis cache
enriched_data = (input_data
damccorm marked this conversation as resolved.
Show resolved Hide resolved
| 'Enrich with Cache' >> with_redis_cache(redis_config=redis_config, enrichment_transform=my_enrichment_transform))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this generated code is right, with_redis_cache should be appended to the enrichment transform like this -

| Enrichment(handler).with_redis_cache(self.host, self.port))


## Related transforms

Not applicable.
Expand Down
Loading