Ankaa Pmo®

Introduction

While many of us are habituated to executing Spark applications using the ‘spark-submit’ command, with the popularity of Databricks, this seemingly easy activity is getting relegated to the background. Databricks has made it very easy to provision Spark-enabled VMs on the two most popular cloud platforms, namely AWS and Azure. A couple of weeks ago, Databricks announced their availability on GCP as well. The beauty of the Databricks platform is that they have made it very easy to become a part of their platform. While Spark application development will continue to have its challenges – depending on the problem being addressed – the Databricks platform has taken out the pain of having to establish and manage your own Spark cluster.

Using Databricks

Once registered on the platform, the Databricks platform allows us to define a cluster of one or more VMs, with configurable RAM and executor specifications. We can also define a cluster that can launch a minimum number of VMs at startup and then scale to a maximum number of VMs as required. After defining the cluster, we have to define jobs and notebooks. Notebooks contain the actual code executed on the cluster. We need to assign notebooks to jobs as the Databricks cluster executes jobs (and not Notebooks). Databricks also allows us to setup the cluster such that it can download additional JARs and/or Python packages during cluster startup. We can also upload and install our own packages (I used a Python wheel).

Source de l’article sur DZONE

from operator import add, sub from time import sleep from pyspark import SparkContext from pyspark.streaming import StreamingContext # Set up the Spark context and the streaming context sc = SparkContext(appName="PysparkNotebook") ssc = StreamingContext(sc, 1) # Input data rddQueue = [] for i in range(5): rddQueue += [ssc.sparkContext.parallelize([i, i+1])] inputStream = ssc.queueStream(rddQueue) inputStream.map(lambda x: "Input: " + str(x)).pprint() inputStream.reduce(add) .map(lambda x: "Output: " + str(x)) .pprint() ssc.start() sleep(5) ssc.stop(stopSparkContext=True, stopGraceFully=True)

Notre différence : Redonner du sens à la performance

– En centrant les compétences des managers d’activité internes Ankaa PMO®, chargés de votre suivi dès les premiers contacts commerciaux, sur l’organisation, le pilotage et le suivi Qualité de vos projets.
– En restant objectifs dans le conseil car détachés de tout partenariat constructeur et/ou éditeur.
– En sollicitant de façon flexible et dynamique notre équipe d’experts métiers, fonctionnels ou techniques constituée exclusivement de consultants externes accrédités, gage de vous fournir le meilleur niveau d’expertise sur les projets.
– En apportant une organisation éprouvée sur des projets stratégiques de PME-PMI ou les attentes des grands comptes

Notre organisation est ainsi totalement établie sur l’écoute de vos besoins et l’élaboration de réponse les plus adaptées à vos challenges pour vous permettre d’apporter une réelle valeur ajoutée à vos processus et projets d’entreprise.

Articles

How to Perform Distributed Spark Streaming With PySpark