Articles

Spark-streaming can be used to read the data from a source in a streaming fashion. We just have to create a read-stream from the data source and then we can create the write-stream to load the data into a target datasource. 

For this demo, I will assume that we have different JSON payloads coming into a kafka topic that we need to transform and write it to another kafka topic.

Source de l’article sur DZONE

I am excited to share my experience with Spark Streaming, a tool which I am playing with on my own. Before we get started, let’s have a sneak peak at the code that lets you watch some data stream through a sample application.

from operator import add, sub
from time import sleep
from pyspark import SparkContext
from pyspark.streaming import StreamingContext # Set up the Spark context and the streaming context
sc = SparkContext(appName="PysparkNotebook")
ssc = StreamingContext(sc, 1) # Input data
rddQueue = []
for i in range(5): rddQueue += [ssc.sparkContext.parallelize([i, i+1])] inputStream = ssc.queueStream(rddQueue) inputStream.map(lambda x: "Input: " + str(x)).pprint()
inputStream.reduce(add) .map(lambda x: "Output: " + str(x)) .pprint() ssc.start()
sleep(5)
ssc.stop(stopSparkContext=True, stopGraceFully=True)

Spark Streaming has a different view of data than Spark. In non-streaming Spark, all data is put into a Resilient Distributed Dataset, or RDD. That isn’t good enough for streaming. In Spark Streaming, the main noun is DStream — Discretized Stream. Thats basically the sequence of RDDs. The verbs are pretty much the same thing — the way we have actions and transformations with RDDs, we also have actions and transformations with DStreams.

Source de l’article sur DZONE