Skip to content

CodingMatte/spark-streaming-playground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Streaming Playground

Welcome to the Spark Streaming Playground! This project showcases various aspects of Apache Spark Streaming, providing a hands-on environment to explore stateful computations, structured streaming, utility functions, and integrations.

Project Structure

The project is organized into several packages, each focusing on different functionalities within Spark Streaming. Below is a detailed overview of each Java class and its purpose:

1. Advanced Package

  • StatefulComputations.java
    This class demonstrates how to perform stateful computations using Spark's MapGroupsWithStateFunction. It utilizes Spark's stateful operations to maintain and update state information across streaming batches, enabling complex processing tasks like aggregations and transformations that rely on historical data.

  • Watermarks.java
    In this class, watermarks are implemented to handle late data in streaming applications. It allows Spark to manage event time and define thresholds for when data can be considered too late for processing. This is crucial in scenarios where data arrival times are unpredictable.

  • EventTimeWindows.java
    This class focuses on processing data within defined event time windows. It leverages Spark's time-based windowing functions to group data by specified time intervals, allowing for aggregate computations over the window of time in which the events occur.

2. Structured Streaming Package

  • StreamingDatasets.java
    This class introduces the concept of streaming datasets, illustrating how to create and manipulate datasets that are continually updated. It integrates Spark's Dataset API to facilitate operations on streaming data in a structured format, enhancing readability and performance.

  • StreamingJoins.java
    Here, the class implements streaming joins between datasets. It demonstrates how to perform real-time joins on streaming data, ensuring that related data from different sources can be combined seamlessly during the streaming process.

  • StreamingDataFrames.java
    This class provides functionality to work with streaming DataFrames, focusing on the differences between static and streaming DataFrames. It emphasizes the importance of managing the schema and applying transformations to the streaming data efficiently.

  • StreamingAggregations.java
    This class covers various aggregation functions applied to streaming data. It showcases how to compute aggregations like counts, sums, and averages in real-time, emphasizing the continuous nature of streaming data processing.

3. Utilities Package

  • SparkUtils.java
    The utility class that simplifies the creation and management of Spark sessions. It provides a flexible approach to configure Spark applications, ensuring ease of scalability and testability for future extensions.

4. Integrations Package

  • IntegratingKafka.java
    This class illustrates the integration of Spark Streaming with Kafka. It demonstrates how to consume and produce messages from Kafka topics, facilitating the seamless handling of streaming data sourced from a message broker.

5. Common Package

  • Schemas.java
    This class defines schemas used across various DataFrames and Datasets. It encapsulates the structure of the data, ensuring consistency and reliability when processing structured data.

  • Stock.java
    This class represents a model for stock data, encapsulating the properties and behaviors related to stock information within the application.

  • Car.java
    Similar to the Stock class, this class models car data, defining its attributes and any necessary methods to manipulate car-related information in the streaming context.

6. Low Level Package

  • DStreams.java
    This class provides a low-level abstraction for working with DStreams (Discretized Streams) but it has been deprecated.

Conclusion

The Spark Streaming Playground serves as a valuable resource for exploring the capabilities of Apache Spark Streaming. By examining the various classes and their functionalities, users can gain insights into stateful processing, structured streaming, and integrations with external data sources like Kafka.

About

Spark Streaming Playground

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published