CS5312 - Big Data Analytics
Course Description:
This course introduces the essential mathematical concepts and computational methods to store, retrieve,
and process to store, retrieve, process, and analyze data sets that are too large, complex, and heterogeneous
for conventional methods. We will learn how to handle different types of big data, such as text, streaming,
or network data. The course covers data preprocessing, data visualization, text analytics, graph analytics,
recommendation systems, clustering, classification, regression, dimensionality reduction, matrix factorization,
and locality-sensitive hashing.
Prerequisites:
- An undergraduate course in Discrete Mathematics, Data Structures, Algorithms and Probability.
- An undergraduate course in Databases and Linear Algebra is useful but not required.
- Strong problem solving skills and some background in programming.
Textbook(s)/Supplementary Readings:
-
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman
Lectures:
Topic |
Slides |
Notes |
Introduction to Big Data Analytics
- Big Data Generation and Growth
- Industries benefiting from Data Analytics
- Sources, Aspects and Types of Big Data
- The Analytics Process
|
Slides |
|
Getting to Know Data and EDA
- EDA: Purpose & Benefits
- Statistical EDA
- Graphical EDA
|
Slides |
|
Data Preprocessing and Transformation
- Issues with Data
- Data Cleaning: Missing value, Noise, Outliers
- Data Integration: Inconsistencies, Deduplication
- Data Reduction: Sampling, Feature Selection
- Data Transformation: Standardization, Numeric transformation
|
Slides |
|
Vector Norms and Proximity Measures
- Vector Norms and Unit Circles
- Proximity Measures
- Distance Between Non-numeric Vectors
- Distance Between Mixed Feature Vectors
- Distance Between Non-vectors Data Objects
|
Slides
|
|
Data Analytics Tasks and Methods
- Classification:
- Motivation and Applications
- Train-Validation Split & Cross-Validation
- Evaluation Metrics and Class Imbalance
- Overfitting
- Classifiers: k-NN, Naïve Bayes, Decision Tree
- Information Gain & Entropy
- Regression:
- Linear Regression
- Fitting Zero-degree Function and Line
- Interpreting Coefficients
- Partitioning the Variance
- Multiple Linear Regression
- Polynomial Regression
- Logistic Regression
- Clustering:
- Aspects and Definition
- Point Assignment Methods:
- k-Means, k-Medoid and k-Mode
- Agglomerative Clustering
- Clustering Validation and Evaluation
- External and External Measures
- Recommendation Systems:
- Raw Averages based Recommendation
- ANOVA and Bayesian Filtering
- Content Based Filtering
- Collaborative Filtering
- Matrix Factorization
- Text Analytics:
- Applications, Concepts, & Terminology
- Text EDA
- Vector Space Modeling:
- Set/Bag-of-Words
- TF-IDF
- Word Embedding
- Graphs Analytics:
- Major Classes of Graphs
- Network Descriptive, Connectivity and Centrality Analytics
- Large-Scale Network Structure
- Clustering and Communities
- Graph Representation Learning
|
Slides
Slides
Slides
Slides
Slides
Slides
|
|
Data Visualization
- Visual Perception
- Gestalt Principles
- Context, Preattention
- Magnitude Estimation
- Visual Attributes and Mapping
- Evaluating Visualization
|
Slides
|
|
Information Retrieval (IR)
- Term-Document Incidence Matrix
- Inverted Index
- Ranked Retrieval
|
Slides
|
|
Linear Algebra Review
- Vector Operations
- Linear Functions
- Linear Transformation
- Change of Bases
- Eigen Value & Eigen Vectors
- Powers of Matrices
- Random Walk
- Markov Chain
|
Slides
|
|
Data Structures Review
- List, Stack, Queue, Set, Dictionary, Heaps
|
Slides
|
|
Web Search
- Web Searches Challenges
- Trustworthy Webpages and Node Centrality
- PageRank: Dangling Nodes, Spider Traps, Random Teleporting
- Link Spamming
- Personalized and Topic Sensitive Pagerank
- Hyperlink-Induced Topic Search
|
Slides
|
|
Spectral Clustering
- Limitations of Distance Based Clustering
- Graph Partition and Cuts
- Spectral Graph Theory
- (Un)Normalized Graph Laplacians
- Relation of Graph Laplacian and Partition
- Spectral Clustering into k-Clusters
|
Slides |
|
|
Proximity Problems & Curse of Dimensionality
- Proximity Problems on High Dimensional Data
- Distance Matrix Computation
- k-Nearest Neighbor Problem
- Fixed-Radius Nearest Neighbors
- Approaches for kNN problem
- Curse of Dimensionality
- Processing and Storage
- Data Sparsity
- Issues for Nearest Neighbors
- Huge Search Space
- Diminishing volume of n-ball
- Nearest neighbor instability
- Distance Concentration
- Angle Concentration
|
Slides |
Notes
Notes |
|
Data Preparation & Representation Learning
- Data Preparation
- Data Compression
- Low Distortion Embedding
- Dimensionality Reduction
- Multi-dimesnsinal Scaling
- Dimensionality Reduction
- Feature Selection and Extraction
- Johnson-Lindenstrauss Lemma
|
Slides |
Notes |
|
Locality Sensitive Hashing (LSH)
- LSH for k-NN and Near Duplicate Problems
- LSH for:
- Hamming Distance
- Jaccard Distance
- Cosine Distance
- Euclidean Distance
- Constructing New LSH Families
- Non-LSH-able Distance Measures
- Data Dependent LSH
|
Slides |
Notes |
|
Data Streams
- Data Stream: Model of Computation
- Synopsis Based Exact Stream Computation
- Sliding Window, Histogram and Wavelets
- Sketches: Count, Count-Min & AMS
|
Slides |
Notes |
|
Principal Component Analysis (PCA)
- Aims of PCA
- PCA vs. JL-Transform
- Variance-Covariance Matrix
- PCA Objective:
- Reconstruction Error and Projected Variance
- Linear Algebraic Formulation
- Eigen Decomposition of Covariance Matrix
- Power Iteration Method
- Eigenfaces
- Limitation of PCA
|
Slides |
Notes |
|
Matrix Factorization and SVD
- Rank Factorization of a Matrix
- Low Rank Approximation
- Singular Value Decomposition (SVD)
- SVD Applications:
- Recommendation System
- Latent Semantic Analysis
- Data Denoising
- PCA and SVD
|
Slides |
Notes |
|
Back to Top