Setting up standalone Spark on ubuntu(With Docker Image Provided)
Spark is a computational framework quite popular recently. Compared with Hadoop, it offers much nicer abstractions(e.g. Resilient Distributed Datasets). Here’s how to configure a standalone-mode spark instance on ubuntu(I use 12.04). If you don’t have the patience and just want to play, you can download the docker image from here and follow the instruction at the end of this post.
1. Set up Java
In this case Oracle JDK is better than OpenJDK. To install Oracle JDK, first run:
Then install the desired version of JDK, I used JDK 7:
If you have multiple installations of Java, you can run this command to choose the one you want to use:
2. Set up Hadoop(Optional)
We can setup hadoop and ask spark to read file from hdfs in later section. However this is optional because you can also specify a local path in spark shell. Still, this is a tutorial so I try to be as comprehensive as possible. This section talks about configuring a hadoop instance(also in standalone mode). You can also refer to here for the configuration of the newest version hadoop and skip this part(though in my configuration the procedure is a little different).
First download hadoop from official site. At this times the newest is 2.5.2, download it. Note that in spark’s official site it only gives pre-compiled version compatible with hadoop up to 2.4, but the one works with 2.4 can also work with 2.5 after testing. Extract the tar file to a local directory, say, /usr/local/hadoop
1.Make sure your system has ssh and rsync:
2.Configure environment variables:
3.Modify configuration file:
etc/hadoop/core-site.xml:
etc/hadoop/hdfs-site.xml:
check that you can ssh to the localhost without a passphrase:
If you cannot ssh to localhost without a passphrase, execute the following commands:
4.Last step: start hadoop
Here you might ran into the same problem while starting dfs as I did: it shows JAVA_HOME not found. To fix this, edit libexec/hadoop-config.sh, and add the following line at the top:
3. Finally, spark!
Download spark from here, choose pre-built version of hadoop 2.4. Extract to a local directory.
Before we actually start to play with spark, let’s download some text file, say, this page, and put it in hdfs as /user/(Your user name)/input/Apache_Spark
. You can also just leave it on the hard-drive, and use its absolute path when reading the file.
Run the spark shell, exectue:
Now let’s do the classic one-line word count:
Just a random example. There’re many operators supported by spark(actually it’s a super set of that of hadoop) and can be found in the official documentation.
Finally, here’s a docker image with everything configured. You can download it from here. Execute /root/start-hadoop.sh
first. Spark is in the directory of /root/spark