Coursera Spark week 1 Wikipedia

So this week is an official start of 4th course from Scala Specialization on Coursera. I mean Spark course. This is an outstanding event, because so many students have been waited for this almost 1 year 🙂 What does it mean for me? Firstly I need to keep focused on video lectures and assignments. Secondly, I decided to share my impressions about the course here.

General impressions

As you may guess, the first week was totally about Spark basics. Within video lectures Heather Miller (lector) told a lot of things related to place of Spark in a big data processing. There was a pretty comprehensive comparison of Spark and Hadoop. Hence I’m as a student of this course, have more or less objective understanding of all Spark advantages over Hadoop.

A course content is good enough: Heather’s English is nice, slides contain a lot of useful information. But sometimes slides are blurred, and reading becomes pretty tricky. A theory predominates over practice. This means that you would know how Spark works, how it could be used in more efficient way, but an acquaintance with Spark API is given to students on self-learning.

Exactly in this moment you understand, that the previous three courses of the Specialization come in handy! Yes, I’m talking about Scala. Knowledge of Scala is extremely important for Spark. Especially useful is Scala collections API. Despite of distinction in internals of Spark API and Scala collections API, they look similar.

Impressions from assignments

Well, after watching the video lectures you need to execute some practical assignments. Based on my previous experience I prepared myself for the worst scenario (it was rather hard to solve tasks in the first three courses). But everything went smoothly! After attentive reading of the assignments, I started browse Spark API and step by step implemented all of the functions.

4 out of 5 functions passed the tests on the Sunday evening. And on the Monday morning I completed the 5th function. Three… Two… One… I submitted the task for verification. And voila! ZERO! ZERO POINTS! Functions were running too long and I had 0 out of 10!

So I rapidly looked at the functions once again and re-developed them. All tasks were related to processing of Wikipedia articles and frequency of programming languages names in them. My new result was 9 / 10. Wow! I figured out what was wrong and improved the code in order to get 10 / 10 points.

For those of you who do not walk through this course, I want to share the solutions. So be kind and do not copy paste these solutions 🙂

Occurrences number of programming language in wikipedia articles:

def occurrencesOfLang(lang: String, rdd: RDD[WikipediaArticle]): Int = {
  rdd.map(article => article.text)
    .filter(article => article.split(" ").contains(lang))
    .collect()
    .length
}

Rank programming languages based on occurrencesOfLang function:

def rankLangs(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
  langs.map(lang => (lang, occurrencesOfLang(lang, rdd)))
    .sortBy(pair => pair._2)
    .reverse
}

Inverted index of the wikipedia articles:

def makeIndex(langs: List[String], rdd: RDD[WikipediaArticle]): RDD[(String, Iterable[WikipediaArticle])] = {
  rdd.flatMap(article => langs.filter(lang => article.text.split(" ").contains(lang))
    .map(lang => (lang, article)))
    .groupByKey()
}

Rank programming languages based on inverted index:

def rankLangsUsingIndex(index: RDD[(String, Iterable[WikipediaArticle])]): List[(String, Int)] = {
  index.map(pair => (pair._1, pair._2.size))
    .sortBy(pair => pair._2)
    .collect()
    .toList
    .reverse
}

Rank programming languages using reduceByKey function:

def rankLangsReduceByKey(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
    rdd.flatMap(article => langs
      .filter(lang => article.text.split(" ").contains(lang))
      .map(lang => (lang, 1)))
      .reduceByKey(_ + _)
      .collect()
      .toList
      .sortWith(_._2 > _._2)
  }

Summary

First week of the “Big Data Analysis with Scala and Spark” course is interesting. And I recommend to focus on learning Spark API and practice on some small samples in order to see what each particular function can do. If you have no any previous with Scala it would definitely not easy to move fast through the assignments.

I’m going to publish next assignments for week #2 & #3 in the nearest future.

About The Author

Mathematician, programmer, wrestler, last action hero... Java / Scala architect, trainer, entrepreneur, author of this blog

  • Pingback: How I Passed First Week of “Big Data Analysis with Scala and Spark” on Coursera()

  • Pingback: How I Passed First Week of “Big Data Analysis with Scala and Spark” on Coursera | StratCom()

  • Francisco

    Hi Alex I did the asignment and score 0/10 while all tests are passing and watching your solution, mine is quite similar, indeed it is just basic Spark,

    In my solution I just cached wikiRdd

    val wikiRdd: RDD[WikipediaArticle] = sc.textFile(WikipediaData.filePath).map(x =>WikipediaData.parse(x)).cache

    any suggestion on what could be wrong?

    Fran

    • Francisco,

      I’m not sure that in this particular assignment, caching of the RDD would help. But I may be wrong 🙂

      Apart of passing tests, I want to suggest you to run the `main` method in the `WikipediaRanking` class
      If it would fail, just rethink your implementation of functions which cause failures

  • Francisco

    wow I just copied your first solution which you implemented with a simple map and I had implemented with aggregate as they suggested , and … 10/10

    I give you my code in case you could find what could be the problem, as far as I see it works perfectly:

    rdd.aggregate(0)((x,y)=>
    {if (y.text.split(” “).contains(lang)) x+1 else x}, (_+_))

    //I just do x++ in case an article of the rdd contains the reference and later just add up the different partitions

    Thanks,

    Javi

  • Jesus de Diego

    Hi Alex,

    What about:

    def rankLangsUsingIndex(index: RDD[(String, Iterable[WikipediaArticle])]): List[(String, Int)] = {
    index.mapValues(i => i.toArray.length).collect().toList.reverse
    }

    It gives better metrics:
    “Processing Part 2: ranking using inverted index took 20945 ms.”

    Before (with index.map) was “Processing Part 2: ranking using inverted index took 24878 ms.”

    Of course it depends on the machine, etc, but probably mapValues can give us a better performance.

    Cheers!

    Jesus

  • Pingback: Big Data Analysis with Spark week 2 & 3: Stackoverflow()

  • JB

    Very interesting page!
    I have a question about the line:

    def makeIndex(langs: List[String], rdd: RDD[WikipediaArticle]): RDD[(String, Iterable[WikipediaArticle])] =
    rdd.flatMap(article => langs.filter(lang => article.text.split(” “).contains(lang)).map(lang => (lang, article))).groupByKey()
    How does scala know that article is your rdd and not your list of arguments? In case you find two articles that contain the word JAVA, in the end you wil produce JAVA, “article1”, and JAVA, “article2”. How did he figure out that your article is the RDD[WikipediaArticle] ??

    Thanks and keep up the good work.

    • JB, I’m not sure that I entirely understood your questions. But when I apply flatMap that’s means that I expect that I work articles and not with list of arguments

Close