Introduces students to the core tasks in data science, including data collection, storage, tidying, transformation, processing, management, and modeling for the purpose of extracting knowledge from raw observations. Programming is a cross-cutting aspect of the course. Offers students an opportunity to gain experience with data science tasks and tools through short assignments. Includes a term project based on real-world data. (This course description is from the Academic Catalog.)
Selected readings from two outstanding books provide context and analytic goals for case studies from a range of disciplines.
Working scientists and data crunchers familiar with reading and writing Python code will find this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data ot build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python.
This book is intended for anyone who is interested in using modern statistical methods for modeling and prediction from data. This group includes scientists, engineers, data analysts, data scientists, and quants, but also less technical individuals with degrees in non-quantitative fields such as the social sciences or business.
Each week, selected reading provides context and analytic objectives for in-class sessions, which emphasize coding and troubleshooting. Classes involve collaborative coding with shared, executable (Jupyter) notebooks running in the cloud (Colab). However, assignments emphasize reproducibility for the entire data-processing pipeline in version-controlled (Github) repositories.
Week | Module | ISL | PDS |
---|---|---|---|
1 | Intro | Ch 1 | Ch 1&2 (ipython & numpy) |
2 | DataViz | § 2.1 | Ch 4 (matplotlib & seaborn) |
3 | Tidy | — | Ch 3 (tidying & transformation with pandas) |
4 | Relational* | — | — |
5 | Regression | § 3.1-3 | § 5.06 (linear regression) |
6 | Classification | § 4.1-4.3 | § 5.04-5 (categorical data & one-hot encoding) |
7 | Resampling | § 2.2, § 5.1 | § 5.03 (cross validation) |
9 | Selection | § 6.1 | § 5.09 (PCA & dimension reduction) |
10 | Clustering | § 12.1-2, § 12.4 | § 5.11 (K-means) |
11 | Nonlinearity | § 8.1-2, 9.1, 9.3 | § 5.07-8 (random forest, SVMs) |
12 | Text Mining | — | § 5.04-5 (feature engineering & naive Bayes) |
13 | Project Presentations | — | — |
* The “Relational” module introduces relational databases with SQLite using a selection of online readings.
By the end of the course students should be able to access and import a dataset, then clean, transform, and visualize the dataset appropriately for a well-described analytic goal. Case studies investigated in class use data types ranging from simple tables to relational databases, images and text.
Term projects allow students to gain experience working in small teams on practical problems with real-world data. Typically, these are XN-style projects that involve external stakeholders who help review prototypes and provide feedback along the way.
Students are responsible for a standard Python development environment installed on their computer, including a text editor and the ability to run code and manage a git repo from the command line. The first class reviews recommendations for easy installation on any modern laptop.
Activity | Contribution |
---|---|
Homework | ~60% |
Project | ~30% |
Class Participation | ~10% |