RDD Commands

Transformation

Read file from Local FS

For HDFS specify path as “hdfs://localhost:9000/lti/sample.txt”

lines = sc.textFile("file:///home/ak/sample.txt", <partition>)

Create Partitions of List

sc.parallelize(['b', 'c', 'a'])
sc.parallelize(['b', 'c', 'a'], 4)

Flat Map

Used to break data into smaller chunks

words = textsrc.flatMap(lambda line: line.split())

Flat Map Values

Used to flatten values of an key value pair. Can only access the value field

x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])])
x.flatMapValues(lambda value : value).collect()

Map

Create key, value pairs of data. Returns an list

wdmap =words.map(lambda word: (word, 1))
ratings = lines.map(lambda x : x.split()[2])

Map Values

Used to apply Map function of Key, Value pairs. Can only access the value field

x = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])])
x.mapValues(lambda f : len(f)).collect()

Filter

Returns an Boolean value. True data is allowed to pass through

newRDD = rdd.filter(lambda x : "TMIN" in x[1])

Caching of Data

text_file.cache()
text_file.persist()
text_file.unpersist()

NOTE

A action needs to be performed before the data is actually cached

The cached files are visible under the Jobs tab in Spark Web UI

Number of Partitions

textfile.getNumPartitions()

Partitions

repartition(4) 		# Increase or decrease partitions. Shuffling of data takes place
coalesce(<num>) 	# Reduce the no. of partitions. Shuffling does not take place

Actions

Print the RDD

count.collect()
rdd.take(<num>)

Save the RDD as a text file

counts.saveAsTextFile("/lti/sparkwc")

Count By Value

Groups same data together. Returns a DefaultDict

result = ratings.countByValue()
 
for key, value in result.items():
	print(str(key) + " " + str(value))

Reduce by Key

Perform aggregation of data with the same key
The reduction is performed at the mapper stage and combined at the reducer

counts=wdmap.reduceByKey(lambda a, b: a + b)

Group By Key

Group values with the same key. The reduction is performed at the reducer stage

groupByKey()

Sort By Key

Sort RDD by key values

wordCount.map(lambda (x, y) : (y, x)).sortByKey()

Create RDD of just Keys or Values

keys(), values()

visualapi.pdf - Google Drive

Spark Programming Guide - Spark 2.2.0 Documentation

Digital Archive

Explorer

RDD Commands

Transformation

Read file from Local FS

Create Partitions of List

Flat Map

Flat Map Values

Map

Map Values

Filter

Caching of Data

Number of Partitions

Partitions

Actions

Print the RDD

Save the RDD as a text file

Count By Value

Reduce by Key

Group By Key

Sort By Key

Create RDD of just Keys or Values

Backlinks

Table of Contents

Graph View