DS 5110
Introduction to Data Management and Processing
- Instructor: Philip Bogden, p.bogden@northeastern.edu
- Office Hours: Fridays 2-4pm or by appointment (on Teams: https://teams.northeastern.edu)
- Piazza: Class-related Q&A, sign up here: piazza.com/northeastern/summer2022/ds5110
- Canvas: Schedule, assignments, grades: https://northeastern.instructure.com/courses/103324
- Github: Assignment submissions – you’ll be invited to github classroom using your northeastern email.
- Colab: Prototyping and in-class exercises. Use your husky.neu.edu email: https://colab.research.google.com
Introduction to Data Management & Processing
Introduces students to the core tasks in data science, including data collection, storage, tidying,
transformation, processing, management, and modeling for the purpose of extracting knowledge from raw observations.
Programming is a cross-cutting aspect of the course.
Offers students an opportunity to gain experience with data science tasks and tools through short assignments.
Includes a term project based on real-world data. (This course description is from the Academic Catalog.)
Texts
- ISL: Introduction to Statistical Learning with Applications in R, 2nd Ed (2021) by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani.
- The entire book (pdf) is available for free from the website along with video lectures by Hastie and Tibshirani.
- Weekly modules loosely follow the chapter progression in the text.
- We will not be covering the Labs and Exercises in each chapter; they use R.
- We will use Python throughout the course for all in-class exercises and outside assignments.
- PDS: Python Data Science Handbook (2022) by Jake VanderPlas
- The entire 1st edition is available for free on github as Jupyter notebooks that launch automatically in Colab.
- The 2nd edition updated in Dec 2022 is available on O’Reilly media with your Northeastern credentials
- We’ll use this text for implementation in Python
- Chapters 2-4 cover data management, processing and visualization with Numpy, Pandas and Matplotlib.
- Chapter 5 uses Scikit-Learn to implement many of the topics described in ISL.
- Other Python references and on-line documentation will be used as well, as needed.
- For those new to Python, A Whirlwind Tour of Python by Jake VanderPlas is freely available on Github as a collection of executable Jupyter notebooks.
Schedule
Week |
Topic |
ISL |
PDS |
1 |
Intro |
— |
– |
2 |
Data Viz |
Ch 1 |
Ch 2-4 |
3 |
Tidy Data |
Ch 2 |
Ch 2-3 |
4 |
Relational Data |
— |
Ch 3 |
5 |
Regression |
Ch 3 |
– |
6 |
Classification |
Ch 4 |
Ch 5 |
7 |
Resampling |
Ch 5 |
Ch 5 |
8 |
Selection |
Ch 6 |
Ch 5 |
9 |
Trees |
Ch 8 |
Ch 5 |
10 |
SVMs |
Ch 9 |
Ch 5 |
11 |
Unsupervised Learning |
Ch 12 |
Ch 5 |
12 |
Text Mining |
— |
– |
13 |
Deep Learning |
Ch 10 |
– |
14 |
Project Presentations |
— |
– |
- Weekly topics loosely follow the chapter progression in ISL.
- Complementary selected readings come from PDS.
Approach
By the end of the course students should be able to access and import a dataset,
then clean, transform, and visualize the dataset appropriately for a well-described analytic goal.
- Reading – Assignments from ISL and PDS should be read in advance of class.
- Polling – Reading comprehension is measured with online polling during classroom discussions.
- Lecture – Lectures provide context for in-class exercises and will comprise a small fraction of total class time.
- Exercises – In-class activities involve collaborative coding and account for most of the class time.
- Homework – Assignments are designed to give students practice coding and using tools on their own.
- Projects – Small groups (2-3 students) will work collaboratively on a practical data-science project.
Projects
Term projects allow students to gain experience working in small teams on practical problems with real-world data.
Ideally, these are XN-style projects that involve external stakeholders who help review prototypes
and provide feedback along the way.
- Code development occurs with a shared github repository using basic tools for collaborative coding
such as prototyping in branches, pull requests, merging after independent collaborator review,
discussing new functionality with “issues”, etc.
- Project documentation, attribution and reproducibility are critically important.
Documentation should have sufficient detail so that another technical teams could pick up and expand
upon the project at a later date.
- Projects include a front-facing github-pages site that provides an overview
understandable by a non-technical audience.
- The repo and gh-pages site can contribute to student porfolios.
Example from Spring 2022:
Development environment
Students should have a standard Python development environment installed on their computer, including
a text editor or IDE, and git (as described here).
All coding assignments will involve github repositories administered with github classroom.
Assessment
Activity |
Contribution |
Homework |
~55% |
Project |
~30% |
Class Participation |
~15% |