Big data processing and programming – TNG Group @ Politecnico di Torino

Professors: Martino Trevisan

Official description of the course: here

In the big data era, traditional data management and analytic systems are no more adequate to efficiently and effectively analyzed a large amount of data. Hence, novel data models, programming paradigms and database management systems are needed. The course addresses the challenges arising in the Big Data era, examining big data processing and knowledge extraction for big data. Specifically, the course covers how to collect, store, retrieve, and analyze big data to mine useful knowledge. The course covers not only data analytics aspects but also novel programming paradigms (Spark RDD and DataFrame programs in particular) and discusses how they can be exploited to support engineers and researchers to extract knowledge from data. The course includes laboratory activity during which the students will complete assignments consisting of simple Apache Spark applications.

Lectures (15 hours) • Introduction to Big data: characteristics, problems, opportunities (1.5 hours) • Hadoop and its ecosystem: infrastructure and basic components (1.5 hours) • Apache Spark Architecture (2 hours) • Spark RDD programming (5 hours) • Spark DataFrame programming (5 hours) Laboratory activities (5 hours) • Developing of applications by means of Spark using Python (5 hours)