Project1.2 Parallel Programming using EMR

Goal: filter and analyze a large dataset using Hadoop MapReducer through AWS EMR.

MapReduce configuration

Please do not use “.” (periods) or, in general, any other non alphanumeric characters in your bucket name (the bucket to which your mapper and reducer code is uploaded), otherwise the EMR job might fail.
You may want to preserve your cluster by unchecking the “Terminate on failure” option and adding steps manually in the EMR web console
Config:

jar –cvf mapper.jar Mapper.class
jar –cvf reducer.jar Reducer.class

phoenixpan@Ghost:~/Desktop$ javac -cp test.jar Main.java doesn’t work
phoenixpan@Ghost:~/Desktop$ java -cp test.jar Main will work

java -cp Mapper.jar Mapper
java -cp Reducer.jar Reducer

-files s3://ccproject0102/Mapper.jar,S3://ccproject0102/Reducer.jar

Extra practice: find common friends

mapr

Project1.1 Sequential Programming

Goal: Write script and code to filter and analyze a medium dataset locally.

Filter

Firstly, we need to filter the data with our programs before analyzing it. I used Java, but you can use Python or Bash if you feel comfortable with them. In my program, I used a BufferedReader to check each line. Since the result data set has to be sorted, I temporarily store all matches in an ArrayList, which will be sorted at the end. However, for a larger data set, this approach is not recommended, as it will exhausts your memory. If you’re confused by some filtering conditions, please ask your TAs to make it clear, as it may affect your result greatly.

Analysis

As for analysis, we have to answer nine questions using the data set we got. If you think your answer-hunting programs are correct, you’d better go back to chcek your filter program to see whether you’ve misunderstood some conditions. For the first several questions, such as counting the total number of lines, I strongly recommand you to use Bash. One reason is because the prorams will be really simple. You may need only one line to complete your job, another reason is that you will have to use Bash in a future project anyway. Yes, you’ll not run away from it, so enjoy. For complited questions, you will use regular expression to match the works. This site can check your expressions: https://regex101.com/. To fill in the answer sheet (runner.sh), you need to open it with a compiler (I used vim), and put your code in.

For Java users, replace the “echo” with “javac” and “java” there. Please also upload your code to the same folder:

answer_2() {
        javac Project1_1.java
        java Project1_1
}

For Bash users, replace the “echo” with your command:

1
2
3

answer_1() {
	grep -P 'Keyword' FileName.csv | wc -l
}

Test your answers by running

1	./runner.sh

If everything seems alright, submitted it. You should have unlimited tries throughout all the projects, but please confirm this with your TAs.

Generally, it’s an easy project to warm you up. I have no programming foundation, but it costs me only one day to finish. Cheers.

15619 Project1.2 Guide