Crosswalk

Assessment language

Outcome:

Students are able to access and import a dataset;
they are able to clean, transform, and visualize the dataset appropriately for a well-described analytic goal.

KB & PB

Crosswalk of learning goals – KB & PB – from KB’s 11 modules (slides), 12th module: project presentations

module	KB	PB	differences
1 – Intro	course expectations and policies	✓
	tools and resources	✓	PB: git, command-line, colab
	introduction to data science	✓
	introduction to R	Python (assumed)	PB: github-classroom (no notebooks)
2 – Data Viz	common statistical graphics	✓
	how to look at data	✓
	key ingredients of useful plots	✓
	grammar of graphics	seaborn vs matplotlib	ggplot vs OO nature of Python
3 – Data Processing	Types of data	✓	PB calls the module: 03-Tidy
	Structuring data for data science	✓
	Data wrangling and transformation	✓
	Summarizing data	✓
4 – EDA	What is EDA?	in PB modules 2 & 3	not a separate module
	Variation and covariation in data	’’	’’
	“Interesting” visualizations	’’	’’
5 – SQL	What is relational data?	✓	PB calls the module: 04-Relational
	What is SQL?	✓
	Basics of relational algebra	✓
	Types of joins	✓
6 – Modeling I	What are the goals of modeling?	✓	PB calls it: 05-Regression
	Why linear regression?	✓
	Fitting linear models	✓
	Model diagnostics	✓
7 – Modeling II	Criteria for evaluating models	✓	PB calls it: 07-Resampling
	Overfitting and how to avoid it	✓
	Performing cross-validation	✓
	Selecting models	✓
8 – Statistical Inference	What is statistical inference?	in 05-Regression	not a separate module
	Distributions of statistics	’’	’’
	Confidence intervals	’’	’’
	Hypothesis tests	’’	’’
9 – SupervisedML	What are the goals of supervised ML?	✓	PB calls it: 06-Classification
	Building classification models	✓
	Dealing with class imbalance	✓
10 – UnsupervisedML	Goals of unsupervised ML	✓	PB calls it: 09-Unsupervised
	Dimension reduction	✓
	Clustering	✓
11 – Text mining	Structuring text data	✓	PB calls it: 11-Text
	EDA using term frequency	✓
	Sentiment analysis	✓
	Topic models	✓
12 – Projects	✓

Following content only in PB’s version

module	KB	PB
10-Trees	N/A	beyond linear models with Trees & SVMs
	N/A	decision-tree basics, random forest, ensembles of weak learners
	N/A	SVM with nonlinear kernel (relationship to logistic regression)
	N/A	intro to image processing (faces and digits)
12-Deep	N/A	optional module (there is no related homework)
	N/A	comes before projects only when holidays and scheduling allow
	N/A	intro to neural networks (perceptron, relationship to logistic regression)
	N/A	function approximation, nonlinearity, stochastic GD (tensorflow playground)

PB outline (detail)

1 – Intro
- Course overview, expectations and policies
- Tools – Git, Github vs Colab, DS packages
- Intro to data visualization
- Reproducible analysis – Git & Github vs Jupyter & Colab
2 – DataViz
- Simple statistical plots (histograms, box plots, scatterplots)
- Relationship to probability distributions (normalizing histograms)
- Random number generators (central limit theorem demo)
- Seaborn & Matplotlib – OO nature of core DS packages
3 – Tidy
- Processing and managing tidy/tabular data – faceting
- Converting messy data to tidy table
- Pandas & Numpy – OO nature, troubleshooting indexing gotchas (mistakes without Errors)
4 – Relational
- Intro to relational databases & SQL
- Working with relational tables
- SQLite – loading and querying a database with SQLite3
5 – Regression
- Intro to linear regression with statsmodels
- Statistical hypothesis testing (p-values, $R^2$ , etc.) with statsmodels
- Data processing with 2-D arrays, processing pipelnes, the estimators API
- Visualizing residuals, assessing model assumptions (e.g., Gaussian errors)
6 – Classification
- Categorical data, one-hot encoding
- Linear regression vs logistic regression (polynomials vs sigmoid)
- Optimization criteria (relatiohship between least squares & maximum likelihood)
7 – Resampling
- Train/test split, validation set, cross validation
- Bias-variance tradeoff, estimating standard errors
- Troubleshooting non-random data, missing data, problematic resampling algorithms
8 – Selection
- high-dimensional data and the curse of dimensionality (ISLR2 Chapter 6.1-6.4)
- feature scaling, implications for knn vs logistic regression, algorithmic convergence
- model/feature selection, regularization
9 – Unsupervised
- PCA for dimension reduction and clustering (and imputation)
- K-means, silhouette analysis
- Processing image datasets as 2-D arrays
10 – Trees
- beyond linear models (trees and SVMs)
- decision-tree basics, random forest, ensembles of weak learners,
- SVM with nonlinear kernels
- intro to image processing (faces and digits)
11 – Text
- Intro to text as high-dimensional data (sparse matrices, timing code execution)
- Text mining – structuring, cleaning, tf-idf
- Text classification (naive Bayes), topic modeling (LSA & PCA)
12 – Projects
13 – Deep (optional, no related homework)
- Deep-learning module comes before projects only when holidays and scheduling allows
- Introduction to neural networks (perceptron and relatihsip to logistic regression)
- Function approximation, nonlinearity, stochastic GD (tensorflow playground)

Approach & Issues

Data visualization & EDA
- These aren’t separate modues – EDA is part of Data visualization
- starting point for every dataset/assignment
- material related to data types & structures are part of 5010 (NaN, int(), etc)
Reproducibility
- colab for prototyping, not permitted for assignments (github-classroom instead)
- concise communications – results summarized markdown
- command-line reproducility, modular code, DRY, Makefiles, nicely organized repo
Flipped
- ISL & PDS selected reading in advance of in-class exercises
- poll everywhere instead of quizzes
- in-class exercises in small groups
- homeworks reinforce and extend in-class exercises
- no midterm or final – emphasis on homeworks instead
- resubmission policy (max grade 90%)
Applications of OOP
- Estimator API
- Strengths and weaknesses (Matplotlib <-> Seaborn, Numpy <-> Pandas)
- Python vs C/C++
Projects
- data/story/stakeholder requirement
- no kaggle datasets
- these invariably emphasize data management and processing
Challenges – wide range of backgrounds
- some students don’t have align background
- some don’t have linear algebra
- many students have lots of SQL