-
Notifications
You must be signed in to change notification settings - Fork 1
ICP2
Sub-Team Members Class ID: 5-2 15 Naga Venkata Satya Pranoop Mutha 5-2 23 Geovanni West
Spark Transformations and Actions:
A Spark transformation simply calls a Spark job, which may be written in Scala or Python and use SQL and Hive contexts. The Spark job resides in jar files or Python script files.
Transformations is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output.Actions are performed that does not change the RDD data. It performs action on the data without changing the data.
The below transformations and actions are being used by us in this lab assignment.
-
map(func) : Return a new distributed dataset formed by passing each element of the source through a function func.
-
flatMap(func) : Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
-
groupByKey(func,[numTasks]) : When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.
-
sortByKey([ascending], [numTasks]) : When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
-
saveAsTextFile : rite the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
This ICP is mainly about applying Spark Transformations and Actions
The use case will be like, we will be given a set of words or paragraph. We need to identify the words which start with the same letter, group them and form tuples and show them in such a way that it contains the starting letter and the words starting with it. Let us take an example for this ICP

So, now we got to know how we need to show the ouptut. We now try out with some random paragraph of input data.
Input:
The 2007 United States Air Force nuclear weapons incident occurred on 29–30 August 2007. Six AGM-129 ACM cruise missiles, each loaded with a W80-1 variable yield nuclear warhead, were mistakenly loaded onto a United States Air Force (USAF) B-52H heavy bomber at Minot Air Force Base in North Dakota and transported to Barksdale Air Force Base in Louisiana. The nuclear warheads in the missiles were supposed to have been removed before the missiles were taken from their storage bunker. The missiles with the nuclear warheads were not reported missing, and remained mounted to the aircraft at both Minot and Barksdale for 36 hours. During this period, the warheads were not protected by the various mandatory security precautions for nuclear weapons. The incident was reported to the top levels of the United States military and referred to by observers as a Bent Spear incident, which indicates a nuclear weapon incident that is of significant concern but does not involve the immediate threat of nuclear war. In response to the incident, the United States Department of Defense (DoD) and USAF conducted an investigation, the results of which were released on 19 October 2007. The investigation concluded that nuclear weapons handling standards and procedures had not been followed by numerous USAF personnel involved in the incident. As a result, four USAF commanders were relieved of their commands, numerous other USAF personnel were disciplined and/or decertified to perform certain types of sensitive duties, and further cruise missile transport missions from – and nuclear weapons operations at – Minot Air Force Base were suspended. In addition, the USAF issued new nuclear weapons handling instructions and procedures. Separate investigations by the Defense Science Board and a USAF "blue ribbon" panel reported that concerns existed on the procedures and processes for handling nuclear weapons within the Department of Defense but did not find any failures with the security of United States nuclear weapons. Based on this and other incidents, on 5 June 2008, Secretary of the Air Force Michael Wynne and Chief of Staff of the Air Force General T Michael Moseley were asked for their resignations, which were given. In October 2008, in response to recommendations by a review committee, the USAF announced the creation of Air Force Global Strike Command to control all USAF nuclear bombers, missiles, and personnel.
Source Code:

Steps to reporduce the output:
-
- First we load the TextFile into a variable file.
-
- Then we split the text with blankspace parameter in such a way that all the words get separated and becomes an individual word.
-
- Then we apply a flatMap transformation to the individual words.
-
- Then we apply map transformation to the words as the first letter of each individual word and then the word. Eg: (K, Kangaroo)(K, Kingfisher)
-
- Then we have done the groupByKey transformation and append them to a list. (K, Kangaraoo, Kingfisher)
-
- Now we write this list into a text file using saveAsTextFile action.
Output
