DS 5110

Introduction to Data Management and Processing

Course description

Introduces students to the core tasks in data science, including data collection, storage, tidying, transformation, processing, management, and modeling for the purpose of extracting knowledge from raw observations. Programming is a cross-cutting aspect of the course. Offers students an opportunity to gain experience with data science tasks and tools through short assignments. Includes a term project based on real-world data. (This course description is from the Academic Catalog.)

Texts

Selected readings from two outstanding books provide context and analytic goals for case studies from a range of disciplines.

Schedule

Each week, selected reading provides context and analytic objectives for in-class sessions, which emphasize coding and troubleshooting. Classes involve collaborative coding with shared, executable (Jupyter) notebooks running in the cloud (Colab). However, assignments emphasize reproducibility for the entire data-processing pipeline in version-controlled (Github) repositories.

Week Module ISL PDS
1 Intro Ch 1 Ch 1&2 (ipython & numpy)
2 DataViz § 2.1 Ch 4 (matplotlib & seaborn)
3 Tidy Ch 3 (tidying & transformation with pandas)
4 Relational*
5 Regression § 3.1-3 § 5.06 (linear regression)
6 Classification § 4.1-4.3 § 5.04-5 (categorical data & one-hot encoding)
7 Resampling § 2.2, § 5.1 § 5.03 (cross validation)
9 Selection § 6.1 § 5.09 (PCA & dimension reduction)
10 Clustering § 12.1-2, § 12.4 § 5.11 (K-means)
11 Nonlinearity § 8.1-2, 9.1, 9.3 § 5.07-8 (random forest, SVMs)
12 Text Mining § 5.04-5 (feature engineering & naive Bayes)
13 Project Presentations

* The “Relational” module introduces relational databases with SQLite using a selection of online readings.

Approach

By the end of the course students should be able to access and import a dataset, then clean, transform, and visualize the dataset appropriately for a well-described analytic goal. Case studies investigated in class use data types ranging from simple tables to relational databases, images and text.

Project

Term projects allow students to gain experience working in small teams on practical problems with real-world data. Typically, these are XN-style projects that involve external stakeholders who help review prototypes and provide feedback along the way.

Development environment

Students are responsible for a standard Python development environment installed on their computer, including a text editor and the ability to run code and manage a git repo from the command line. The first class reviews recommendations for easy installation on any modern laptop.

Assessment

Activity Contribution
Homework ~60%
Project ~30%
Class Participation ~10%