016.3Sum Closest

Posted on 2017-09-13 | In LeetCode

Solution 1: skipped

Brute force with 3 loops, time complexity O(n^3)

Solution 2: accepted 66%

Time:O(n^2), Space: O(1)
Two pointer solution, similar to Question 015, 3 Sum.

public class Solution {
    public int threeSumClosest(int[] nums, int target) {
        int len = nums.length;
        int gap = Integer.MAX_VALUE;
        int result = 0;
        if (len == 3)
          return nums[0] + nums[1] + nums[2];
        Arrays.sort(nums);
        for (int i = 0; i < len - 2; i++) {
            int j = i + 1;
            int k = len - 1;
            while (k > j) {
              int sum = nums[i] + nums[j] + nums[k];
              if (Math.abs(sum - target) < gap) {
                gap = Math.abs(sum - target);
                result = sum;
              }
              if (sum == target) {
                return sum;
              }
              else if (sum > target) {
                k--;
              }
              else {
                j++;
              }
            }
        }
        return result;
    }
}

017.Letter Combinations of a Phone Number

Posted on 2017-09-13 | In LeetCode

Solution 1: accepted 4ms

Pretty massy, need to be improved.

public List<String> letterCombinations(String digits) {
  List<String> result = new ArrayList<>();
  String[] map = {"", "", "abc","def","ghi","jkl","mno","pqrs","tuv","wxyz"};
  for (int i = 0; i < digits.length(); i++) {
    char digit = digits.charAt(i);
    if (digit == '0' || digit == '1') 
      return new ArrayList<>();
    result = appendAll(result, map[digit - '0']);  
  }
  return result;
}
public List<String> appendAll(List<String> result, String chars) {
  List<String> update = new ArrayList<>();
  Iterator<String> it = result.iterator();
  
  if (result.size() == 0) {
    for (int i = 0; i < chars.length(); i++) {
      update.add(chars.charAt(i) + "");
    }
  } else {
    while(it.hasNext()) {
      String current = it.next();
      for (int i = 0; i < chars.length(); i++) {
        update.add(current + chars.charAt(i));
      }
    }
  }
  return update;
}

110.Balanced Binary Tree

Posted on 2017-09-13 | In LeetCode

Solution 1: accepted

Use -1 as flag of inbalanced tree. Use another variable for better practice in production.

class Solution {
    public boolean isBalanced(TreeNode root) {
        return maxDepth(root) != -1;
    }
    
    private int maxDepth(TreeNode root) {
        if (root == null) {
            return 0;
        }
        
        int left = maxDepth(root.left);
        int right = maxDepth(root.right);
        if (Math.abs(left - right) > 1 || left == -1 || right == -1) {
            return -1;
        }
        return Math.max(left, right) + 1;
    }
}

Linux - screen

Posted on 2017-08-16 | In Linux

Screen

Basic

All commends lead by ctrl+a, followed by ? to check the entire reference.

c create new window
n and p: next window / previous window
w to see all windows
(ctrl+)" goes directly to the wanted window
A change the window’s name (default: bash)
k kill this window
\ quit screen and kill all windows
d detach from a screen session to the main console, screen -r return to the screen session
What if we have multiple screen sessions? First, use screen -ls to check all screen sessions with ids, then usescreen -r 1234 to attach a certain session

Keep processes running after we disconnected

Enter screen by screen
Create a new window c
Run the process here
Detach screen d
Disconnect

Reference1

Twitter Analytics Web Service

Posted on 2017-07-28 | In 15619 Cloud Computing

Social Networking Timeline with Heterogeneous Backends

Posted on 2017-07-25 | In 15619 Cloud Computing

Overall

architecture
3 data sources are MySQL, HBase, and MongoDB. We need to load the given data to the databases and optimized them to have required performance.

Source code: https://github.com/PhoenixPan/CloudComputing/tree/master/Project3.4

Dataset:

users.csv [UserID, Password]
userinfo.csv [UserID, Name, Profile Image URL]

Request format: GET /task1?id=[UserID]&pwd=[Password]
Response format: {"name":"my_name", "profile":"profile_image_url"}

If the Id and password combination was found in database, return and display user profile image(as the image below), otherwise return and display error message.

task1

Dataset:
links.csv [Followee, Follower]

Request format: GET /task2?id=[UserID]
Response format:

{  
   "followers":[  
      {  
         "name":"follower_name_1",
         "profile":"profile_image_url_1"
      },
      {  
         "name":"follower_name_2",
         "profile":"profile_image_url_2"
      }
   ]
}

In HBase, we store the database as key:follower, column: followee1, followee2, followee3,....

Get all followees of the given user
Extract the followees’ IDs
Sort by name in ascending alphabet order
Find their profile image in MySQL database just like we did in Task1 and return

Eventally, we could display something like:
task2

Task3: Build Homepage using MongoDB

Dataset: posts.csv [posts in JSON format]
Request format: GET /task3?id=[UserID]
Response format: {"posts":[{post1_json}, {post2_json}, ...]}

Find posts which match the required “uid”, sort them in ascending timestamp order, and return as response.

Remove the _id field when getting JSON from MongoDB.

Example results:
task3_1

Task4: Put Everything Together

Request format: http://backend-public-dns:8080/MiniSite/task4?id=99
Response format:
One single JSON object includes user name, user profile, an array of followers, and an array of posts.

jsonResponse.put("name", name);
jsonResponse.put("profile", profile);
jsonResponse.put("followers", followersArray);
jsonResponse.put("posts", postsArray);

As the title suggests, we put the previous three tasks together. We need to:

Get user profile
Get all the followers
Get the most recent 30 posts for each user, which are sorted by timestamp and then by post id
Return the most recent 30 posts from all selected posts in step 3

This is a complicated system, I put details of design such a News Feed System in an separate article.

Final Result:
task4

Bonus Task: Basic Recommendation

Request format: http://backend-public-dns:8080/MiniSite/task5?id=<user_id>
Response format: 10 users that appeared most frequently

We were asked to implement a very simple and yet successful recommendation model, Collaborative filtering. Simply speaking, in a directed graph, we need to find all qualified user R, where min_distance(me, R) = 2. For example:

Given:
Followee A follows {B, C, D}
Followee B follows {C, E, A}
followee C follows {F, G}
followee D follows {G, H}
To recommend to A, we collaborate B, C, D and get:
{A:1, C: 1, E: 1, F: 1, G: 2, H: 1}
Then we remove A’s direct followee {B, C, D}
Eventually we have {G: 2, E: 1, F: 1, H: 1}

In our project, we will do:

Find all followees(E) of the user
Find all followees of the followees(EE)
Store them(EE) in HashMap
Remove direct followee(E)
Return the first 10 user with the most frequent appearence

Final Result:
task5

15619 Project3.2 Guide

Posted on 2017-07-17 | In 15619 Cloud Computing

Project3.2 Partitioning (Sharding) And Replication

Basics

Replication: keep multiple copies of data to improve GET performance and recover from database outages.
Partitioning(Sharding): separate data among many nodes to improve PUT performance. Horizental partitioning: shard by rows, vertical partitioning: shard by columns.

Part 1: Implementing Replication with Strong Consistency

Strong consistency:

Strict Consistency: same key has same value all the time
Strict Ordering: fulfill requests in the order they arrive
Atomic Operations: one fails all fail
Controlled Access: lock the item while writing is being fulfilled

Create a lockshop HashMap<String, ReentrantReadWriteLock> lockShop to set lock for each key

Part 2: Implementing Sharding with Even Distribution

Consistency hashing: assign keys to nodes evenly with fault tolerance. The hashing algorithm must return the same value for the same key at all times.

15619 Project3.1 Guide

Posted on 2017-07-14 | In 15619 Cloud Computing

Project3.1 Files and Databases

Goal: compare the efficiency of three kinds of queries:

Query with .csv files using grep and awk
Query with MySQL
Query with HBase

Part 1. File query (skipped)

Part 2. MySQL

Configure database

Install MySQL using
1
sudo apt-get install mysql-server
Connect to the root user, type your password following -p
1
2
mysql -u root -ppassword
SHOW DATABASES;

Create new user

1	CREATE USER 'newuser'@'localhost' IDENTIFIED BY 'team';

Grant partial or all priviledges and login again, here we grant all priviledges:
1
GRANT PRIVILEDGES ON teamproj.* to 'newuser' @ 'localhost';
Create database and table which support full unicode
1
CREATE DATABASE blah;

Answer questions

Q8 & Q9

use index to dramatically decrease query time

Q10

Here, we use LIKE to match a regular expression case insensitive, use LIKE BINARY to match case sensitive, which is required here. Binary match a character’s ASC code instead of its value, so it’s an exact match.
As release is a reserved word in MySQL, we have to use songs.release to escape it. Notice, we may use 'release' (with single quotation mark) when we create a table, but tableName.columnName in a query.

Q11

At the end of the query, we use AS filtered otherwise, we will get Exception:Every derived table must have its own alias

1	SELECT artist_name FROM (SELECT artist_id, artist_name, COUNT(artist_id) AS frequency FROM songs GROUP BY artist_id ORDER BY frequency DESC LIMIT 2,1) AS filtered

Bonus: Load data

1	mysql -u root -pdb15319root song_db --local-infile

and then execute

1	load data infile 'million_songs_metadata.csv' into table songs fields terminated by ','

Part 3. Disk performance test

We benchmark the performance of FileIO using four settings, t1.micro and m3.large, magnetic disk and SSD.

t1.micro magnetic
136.24
179.84
193.00
t1.micro SSD
629.33
627.00
626.69
m3.large magnetic
1168.99
1336.60
1484.45
m3.large SSD
1959.00
2279.25
2328.66

Part 4. HBase

First, we build an EMR with one master and one core (slave), and SSH into the master core. You shoud be able to see your hard drive condition through these two commands
(PS: Don’t mistakenly ssh into the slave. You can find instance ID under EMR)

Launch an EMR cluster with 1 master and 1 core node.
1. Select “Go to advanced options” on the top of create cluster page
2. Make sure all instances are m1.large.
3. Make sure that the EMR cluster is created within the same VPC as your project instance (the one with your runner.sh).
4. Choose AMI version 3.11.0 (hadoop version 2).
5. Remove any existing services such as Pig and Hive and choose to install HBase version 0.94.
6. You must specify a key-pair to be able to SSH to the individual instances. Note that the username to use while SSHing to these instances is hadoop.
7. Do not forget to set a tag that will be automatically propagated to the individual instances in the cluster.
8. Enable “termination protection” and “keep-alive” for your cluster.
Set the security group of both master and core node to allow ALL traffic. 11. Use the Master public DNS as the SSH DNS.
After you SSH into master node, you can run the following command to verify that HDFS is healthy given how it’s being reported per datanode:
hadoop dfsadmin -report
1
2
sudo parted -l
df -h
Create a directory in guest machine (where you are right now) and move the .csv file to that location
1
2
3
mkdir P3_1
cd P3_1
wget https://s3.amazonaws.com/15319-p31/million_songs_metadata.csv
Create a directory in HDFS to store the file
1
hadoop fs -mkdir /csv
Put the file into HDFS. HowToUse -put: -put /local/directory /hdfs/directory
1
hadoop fs -put /home/hadoop/P3_1/million_songs_metadata.csv /csv
Create a new table in HBase by log in into shell using:
1
hbase shell
Create a new table whose name is songdata and column family data
1
create 'songdata','data'

Exit shell, and gety ready the data for inputing it to HBase, you should see a process bar telling you the progress of input like:
[======> ]45%

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,data:title,data:song_id,data:release,data:artist_id,data:artist_mbid,data:artist_name,data:duration,data:artist_familiarity,data:artist_hotttnesss,data:year" -Dimporttsv.separator="," -Dimporttsv.bulk.output=hdfs:///output songdata hdfs:///csv

Load the data into table

1	hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs:///output songdata

Now, scan the table to make sure data has been loaded successfully, use Ctrl+C to stop scanning
1
scan 'songdata'
Note down the private IP and enter it into the java file, run demo through commands
1
2
javac HbaseTasks.java
java HbaseTasks demo

If you see the result of 96 rows, you could process to answer the questions now : )

15619 Project2.3 Guide

Posted on 2017-07-12 | In 15619 Cloud Computing

Project2.3 Advanced Scaling Concepts: Caching

Goal: Explain and implement cache.

Temporal locality: set caceh TTL
Spatial locality: get a batch of data which are likely to be accessed together

In this project, we will create a simple cache using HashMap. Some suggested that we should use LinkedHashMap, why? As I tried, I didn’t see any improvement in performance. We suppose to apply a sophisticated strategy, however I end up with this simple strategy:

else {
	int start = targetInt - 30;
	int end = targetInt + 49;		
	
	// Update most recent ID
	mostRecentID = targetInt;
	
	// Return the response of target ID
	return storeRangeIntoMap(start, end, targetInt);
	
}

I tried to distinguish decrease and increase but didn’t see any improvement neither. Is there a better way to do this?

15619 Project2.1 Guide

Posted on 2017-07-10 | In 15619 Cloud Computing

Project2.1 Horizontal Scaling and Advanced Resource Scaling

Goal: Using APIs to programatically control AWS Load Balancer and Auto Scaling Group.

Workflow

Create working instances
Create an elastic load balancer LB1: set up health check page
Create an auto-scaling launch configuration
Create an auto-scaling group: attach the ELB just created (advanced - load balancing) and use the same subnet

Elastic Load Balancer

Elastic Load Balancer automatically divides traffic to connected instances and handles instance failures by doing health checks. The default distribution strategy is Round-Robin. If an instance is unhealthy, it will stop routing requests to it.

Auto Scaling Group

Auto Scaling Group automatically adds or removes identical resources by responding to changes in demands.

Auto Scaling Policy: Auto Scaling Group reacts when user-specified threadhols(i.e. CPU utilization) are triggered(by CloudWatch). It maintains a desired number(between specified minium and maximum) of instances at all times. It can also scale based on specific schedule(i.e. more server during the weekend). Auto Scaling Group does health checks as well. If an instance is unhealthy, it will launch new instances to replace them.

Auto Scaling Template: instance AMI, instance type, key pairs, security group, etc. The applications in the instances should run automatically when the instances are up.

Tips

The auto-generated instances will NOT be tagged if you terminate it too soon after you launched them!

Solution 1: skipped

Solution 2: accepted 66%

Solution 1: accepted 4ms

Solution 1: accepted

Screen

Basic

Keep processes running after we disconnected

Social Networking Timeline with Heterogeneous Backends

Overall

Task1: Implementing Basic Login with MySQL on RDS

Task2: Storing Social Graph using HBase

Task3: Build Homepage using MongoDB

Task4: Put Everything Together

Bonus Task: Basic Recommendation

Project3.2 Partitioning (Sharding) And Replication

Basics

Part 1: Implementing Replication with Strong Consistency

Part 2: Implementing Sharding with Even Distribution

Project3.1 Files and Databases

Part 1. File query (skipped)

Part 2. MySQL

Configure database

Answer questions

Q8 & Q9

Q10

Q11

Bonus: Load data

Part 3. Disk performance test

Part 4. HBase

Project2.3 Advanced Scaling Concepts: Caching

Project2.1 Horizontal Scaling and Advanced Resource Scaling

Workflow

Elastic Load Balancer

Auto Scaling Group

Tips