DS 5110

Introduction to Data Management and Processing

COMING SOON: The new name for this course: “Foundations of the Data Science Pipeline” (same content)
Instructor: Philip Bogden, p.bogden@northeastern.edu
Office Hours: Fridays 2-4pm or by appointment (on Teams: https://teams.northeastern.edu)
Canvas: Schedule, assignments, grades
Github Classroom: Assignment submissions – you’ll be invited using your northeastern.edu email
Colab: Prototyping and in-class exercises only – use your husky.neu.edu email: https://colab.research.google.com

Course description

Introduces students to the core tasks in data science, including data collection, storage, tidying, transformation, processing, management, and modeling for the purpose of extracting knowledge from raw observations. Programming is a cross-cutting aspect of the course. Offers students an opportunity to gain experience with data science tasks and tools through short assignments. Includes a term project based on real-world data. (This course description is from the Academic Catalog.)

Texts

Selected readings from two outstanding books provide context and analytic goals for case studies from a range of disciplines.

PDS: Python Data Science Handbook (2022) by Jake VanderPlas

Working scientists and data crunchers familiar with reading and writing Python code will find this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data ot build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python.
- The entire book is available for free on github in the form of Jupyter notebooks that launch automatically in Colab.
- This text provides an introduction to core tasks in data science, implemented in Python using standard packages.
- Chapters 2-4 cover data management, processing and visualization with Numpy, Pandas and Matplotlib.
- Chapter 5 uses Scikit-Learn to implement many of the topics described in ISL.
ISL: Introduction to Statistical Learning with Applications in Python, 2nd Ed (2023) by Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani and Jonathan Taylor

This book is intended for anyone who is interested in using modern statistical methods for modeling and prediction from data. This group includes scientists, engineers, data analysts, data scientists, and quants, but also less technical individuals with degrees in non-quantitative fields such as the social sciences or business.
- The entire book (pdf) is available for free from the website along with video lectures by Hastie and Tibshirani.
- ISL concentrates more on applications than mathematical details.
- Each chapter has a lab in R or Python that is independent of the concept sections.
- Selected readings from the concept sections are assigned weekly.
- For those new to Python, the lab in Chapter 3 provides a nice introduction.

Schedule

Each week, selected reading provides context and analytic objectives for in-class sessions, which emphasize coding and troubleshooting. Classes involve collaborative coding with shared, executable (Jupyter) notebooks running in the cloud (Colab). However, assignments emphasize reproducibility for the entire data-processing pipeline in version-controlled (Github) repositories.

Week	Module	ISL	PDS
1	Intro	Ch 1	Ch 1&2 (ipython & numpy)
2	DataViz	§ 2.1	Ch 4 (matplotlib & seaborn)
3	Tidy	—	Ch 3 (tidying & transformation with pandas)
4	Relational*	—	—
5	Regression	§ 3.1-3	§ 5.06 (linear regression)
6	Classification	§ 4.1-4.3	§ 5.04-5 (categorical data & one-hot encoding)
7	Resampling	§ 2.2, § 5.1	§ 5.03 (cross validation)
9	Selection	§ 6.1	§ 5.09 (PCA & dimension reduction)
10	Clustering	§ 12.1-2, § 12.4	§ 5.11 (K-means)
11	Nonlinearity	§ 8.1-2, 9.1, 9.3	§ 5.07-8 (random forest, SVMs)
12	Text Mining	—	§ 5.04-5 (feature engineering & naive Bayes)
13	Project Presentations	—	—

* The “Relational” module introduces relational databases with SQLite using a selection of online readings.

Approach

By the end of the course students should be able to access and import a dataset, then clean, transform, and visualize the dataset appropriately for a well-described analytic goal. Case studies investigated in class use data types ranging from simple tables to relational databases, images and text.

Reading – Selections from ISL and PDS should be read in advance of class.
Polling – Reading comprehension and class participation are measured with online polling during classroom discussions.
Lecture – Lectures provide context for in-class exercises and comprise a fraction of total class time.
Exercises – In-class activities involve collaborative coding and typically account for most of the class time.
Homework – These give students practice coding, working with data and creating reproducible pipelines on their own.
Project – Small groups (2-3 students) collaborate on a practical stakeholder-driven XN project.

Project

Term projects allow students to gain experience working in small teams on practical problems with real-world data. Typically, these are XN-style projects that involve external stakeholders who help review prototypes and provide feedback along the way.

Code development occurs with a shared github repository using basic tools for collaborative coding such as prototyping in branches, pull requests, merging after independent collaborator review, discussing new functionality with “issues”, etc.
Reproducibility, documentation, attribution and clear explanations are all critically important. Project documentation should have sufficient detail so that another technical team could pick up and expand upon the project at a later date.
Projects include a front-facing github-pages site that provides an overview understandable to a non-technical audience.
Ideally, the github repo and gh-pages site will contribute to student portfolios.

Development environment

Students are responsible for a standard Python development environment installed on their computer, including a text editor and the ability to run code and manage a git repo from the command line. The first class reviews recommendations for easy installation on any modern laptop.

Assessment

Activity	Contribution
Homework	~60%
Project	~30%
Class Participation	~10%