Spark Streaming

A stream of entities is called an discretized stream or DStream
DStream = Sequence of RDD
Entities in a Stream are grouped into batches
Each batch = 1 RDD
Batch is formed based on Batch Interval
All entities received within a batch is one RDD
Within DStream Spark will perform operations on individual RDDs

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
 
# When running from Shell the current sc needs to be stopped and a new one has to be created
# At a time we can only have 1 Spark Context
 
sc.stop()
 
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 10)
 
# Read data from socket
 
lines = ssc.socketTextStream("localhost", 9999)
words = lines.flatMap(lambda line: line.split(" "))
 
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
 
wordCounts.pprint()
ssc.start()
 
# Run job till port is not closed
ssc.awaitTermination()
 
# Stop the streaming job
ssc.stop()

We can use Netcat for creating streaming data source
nc -lk 9999
-l: Listen on port
-k: Keep active (Remain listening)

Spark Streaming - Spark 3.2.1 Documentation

scala - What does setMaster local[*] mean in spark? - Stack Overflow

Digital Archive

Explorer

Spark Streaming

Backlinks

Graph View