Libby Hemphill bio photo

Libby Hemphill

Principal Investigator

Email Twitter Github

I use RStudio to simplify my R analysis work, and my data outgrew the analysis capacity of the database server a long time ago. To start my clusters, I followed Amazon’s advice, sort of. I ran into validation errors with the command line version, so instead I started my cluster “by hand” throught the AWS web interface. I ran into some security issues that were resolved by making a new key pair. Here are the basic steps:

1. Start a cluster through the AWS interface

Amazon has some info about setting up EMR clusters. I often find their tutorials assume too much knowledge or experience for beginners, though. If you’ve had practice setting up EC2 instances, then setting up an EMR cluster will be a reasonable next step. If you’ve never set up EC2 or used AWS before, then their tutorial and my steps here are probably not detailed enough for you.

So, I used the “go to advanced options” link after the “Create Cluster” button on the EMR page of the AWS Console. This is where you’re able to set your bootstrap (see below) stuff (in Step 3: General Cluster Settings). I don’t use any Task nodes, just Master and Core, and that works for my analysis (< 1 M nodes in a network, using igraph in RStudio).

  1. Create a new key pair for that cluster
  2. Add TCP for port 8787 to the ElasticMapReduce-master security group (this lets you login to RStudio in a browser without an SSH tunnel)

1a. Bootstrap

If you’re starting your own cluster, you can use this bootstrap command to install all the software you’ll need. You should set the username and password since you’re leaving it open to various IPs in that TCP step above. If you want to use an SSH tunnel and SOCKS approach, you can probably leave those as the defaults (hadoop/hadoop).

s3://aws-bigdata-blog/artifacts/aws-blog-emr-rstudio-sparklyr/rstudio_sparklyr_emr5.sh--sparklyr--rstudio--sparkr--rexamples--plyrmr --user <USER FOR LOGGING IN TO RSTUDIO> --user-pwd <PASSWORD FOR THAT USER> --rstudio-port <PORT YOU WANT TO USE>

1b. Amazon CLI

When you start a cluster, Amazon generates a CLI command for you to start a similar one. Here’s what they generated for me:

aws emr create-cluster --auto-scaling-role EMR_AutoScaling_DefaultRole --termination-protected --applications Name=Hadoop Name=Hive Name=Pig Name=Hue --bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/aws-blog-emr-rstudio-sparklyr/rstudio_sparklyr_emr5.sh","Args":["--sparklyr","--rstudio","--sparkr","--rexamples","--plyrmr"],"Name":"RStudio"}]' --ec2-attributes '{"KeyName":"<YOUR KEY>","InstanceProfile":"EMR_EC2_DefaultRole","AvailabilityZone":"us-east-1a","EmrManagedSlaveSecurityGroup":"sg-ad28b8bb","EmrManagedMasterSecurityGroup":"sg-ba27b7ac"}' --service-role EMR_DefaultRole --enable-debugging --release-label emr-5.4.0 --log-uri 's3n://aws-logs-192477427265-us-east-1/elasticmapreduce/' --name 'Mil2.0' --instance-groups '[{"InstanceCount":4,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core - 2"},{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master - 1"}]' --scale-down-behavior TERMINATE_AT_INSTANCE_HOUR --region us-east-1

2. Login to RStudio

Fire up your browser and head to

  • http://ec2-<YOUR MASTER PUBLIC DNS>.compute-1.amazonaws.com:<YOUR RSTUDIO PORT>

and login with whatever username and password combination you chose during the bootstrap step.