CS5312 - Big Data Analytics


Course Description:
This course introduces the essential mathematical concepts and computational methods to store, retrieve, and process to store, retrieve, process, and analyze data sets that are too large, complex, and heterogeneous for conventional methods. We will learn how to handle different types of big data, such as text, streaming, or network data. The course covers data preprocessing, data visualization, text analytics, graph analytics, recommendation systems, clustering, classification, regression, dimensionality reduction, matrix factorization, and locality-sensitive hashing.

Prerequisites:

  • An undergraduate course in Discrete Mathematics, Data Structures, Algorithms and Probability.
  • An undergraduate course in Databases and Linear Algebra is useful but not required.
  • Strong problem solving skills and some background in programming.

Textbook(s)/Supplementary Readings:

  • Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman

Lectures:

Topic Slides Notes
Introduction to Big Data Analytics
  • Big Data Generation and Growth
  • Industries benefiting from Data Analytics
  • Sources, Aspects and Types of Big Data
  • The Analytics Process
Slides
Getting to Know Data and EDA
  • EDA: Purpose & Benefits
  • Statistical EDA
  • Graphical EDA
Slides
Data Preprocessing and Transformation
  • Issues with Data
  • Data Cleaning: Missing value, Noise, Outliers
  • Data Integration: Inconsistencies, Deduplication
  • Data Reduction: Sampling, Feature Selection
  • Data Transformation: Standardization, Numeric transformation
Slides
Vector Norms and Proximity Measures
  • Vector Norms and Unit Circles
  • Proximity Measures
  • Distance Between Non-numeric Vectors
  • Distance Between Mixed Feature Vectors
  • Distance Between Non-vectors Data Objects
Slides
Data Analytics Tasks and Methods
  • Classification:
    • Motivation and Applications
    • Train-Validation Split & Cross-Validation
    • Evaluation Metrics and Class Imbalance
    • Overfitting
    • Classifiers: k-NN, Naïve Bayes, Decision Tree
    • Information Gain & Entropy
  • Regression:
    • Linear Regression
    • Fitting Zero-degree Function and Line
    • Interpreting Coefficients
    • Partitioning the Variance
    • Multiple Linear Regression
    • Polynomial Regression
    • Logistic Regression
  • Clustering:
    • Aspects and Definition
    • Point Assignment Methods:
      • k-Means, k-Medoid and k-Mode
    • Agglomerative Clustering
    • Clustering Validation and Evaluation
    • External and External Measures
  • Recommendation Systems:
    • Raw Averages based Recommendation
    • ANOVA and Bayesian Filtering
    • Content Based Filtering
    • Collaborative Filtering
    • Matrix Factorization
  • Text Analytics:
    • Applications, Concepts, & Terminology
    • Text EDA
    • Vector Space Modeling:
      • Set/Bag-of-Words
      • TF-IDF
      • Word Embedding
  • Graphs Analytics:
    • Major Classes of Graphs
    • Network Descriptive, Connectivity and Centrality Analytics
    • Large-Scale Network Structure
    • Clustering and Communities
    • Graph Representation Learning

Slides







Slides







Slides







Slides






Slides






Slides
Data Visualization
  • Visual Perception
    • Gestalt Principles
    • Context, Preattention
    • Magnitude Estimation
  • Visual Attributes and Mapping
  • Evaluating Visualization
Slides
Information Retrieval (IR)
  • Term-Document Incidence Matrix
  • Inverted Index
  • Ranked Retrieval
Slides
Linear Algebra Review
  • Vector Operations
  • Linear Functions
  • Linear Transformation
  • Change of Bases
  • Eigen Value & Eigen Vectors
  • Powers of Matrices
  • Random Walk
  • Markov Chain
Slides
Data Structures Review
  • List, Stack, Queue, Set, Dictionary, Heaps
Slides
Web Search
  • Web Searches Challenges
    • Content Spamming
  • Trustworthy Webpages and Node Centrality
  • PageRank: Dangling Nodes, Spider Traps, Random Teleporting
  • Link Spamming
  • Personalized and Topic Sensitive Pagerank
  • Hyperlink-Induced Topic Search
Slides
Spectral Clustering
  • Limitations of Distance Based Clustering
  • Graph Partition and Cuts
  • Spectral Graph Theory
  • (Un)Normalized Graph Laplacians
  • Relation of Graph Laplacian and Partition
  • Spectral Clustering into k-Clusters
Slides
Proximity Problems & Curse of Dimensionality
  • Proximity Problems on High Dimensional Data
    • Distance Matrix Computation
    • k-Nearest Neighbor Problem
    • Fixed-Radius Nearest Neighbors
    • Approaches for kNN problem
  • Curse of Dimensionality
    • Processing and Storage
    • Data Sparsity
    • Issues for Nearest Neighbors
      • Huge Search Space
      • Diminishing volume of n-ball
      • Nearest neighbor instability
    • Distance Concentration
    • Angle Concentration
Slides
Notes





Notes
Data Preparation & Representation Learning
  • Data Preparation
    • Data Compression
    • Low Distortion Embedding
    • Dimensionality Reduction
    • Multi-dimesnsinal Scaling
  • Dimensionality Reduction
    • Feature Selection and Extraction
    • Johnson-Lindenstrauss Lemma
Slides Notes
Locality Sensitive Hashing (LSH)
  • LSH for k-NN and Near Duplicate Problems
  • LSH for:
    • Hamming Distance
    • Jaccard Distance
    • Cosine Distance
    • Euclidean Distance
  • Constructing New LSH Families
  • Non-LSH-able Distance Measures
  • Data Dependent LSH
Slides Notes
Data Streams
  • Data Stream: Model of Computation
  • Synopsis Based Exact Stream Computation
  • Sliding Window, Histogram and Wavelets
  • Sketches: Count, Count-Min & AMS
Slides Notes
Principal Component Analysis (PCA)
  • Aims of PCA
  • PCA vs. JL-Transform
  • Variance-Covariance Matrix
  • PCA Objective:
    • Reconstruction Error and Projected Variance
    • Linear Algebraic Formulation
  • Eigen Decomposition of Covariance Matrix
  • Power Iteration Method
  • Eigenfaces
  • Limitation of PCA
Slides Notes
Matrix Factorization and SVD
  • Rank Factorization of a Matrix
  • Low Rank Approximation
  • Singular Value Decomposition (SVD)
  • SVD Applications:
    • Recommendation System
    • Latent Semantic Analysis
    • Data Denoising
  • PCA and SVD
Slides Notes

Back to Top