Introduction to Data Management and Processing
- Instructor: Philip Bogden, email@example.com
- Office Hours: Fridays 2-4pm or by appointment (on Teams: https://teams.northeastern.edu)
- Piazza: Class-related Q&A, sign up here: piazza.com/northeastern/summer2022/ds5110
- Canvas: Schedule, assignments, grades: https://northeastern.instructure.com/courses/103324
- Github: Assignment submissions – you’ll be invited to github classroom using your northeastern email.
- Colab: Prototyping and in-class exercises. Use your husky.neu.edu email: https://colab.research.google.com
Introduction to Data Management & Processing
Introduces students to the core tasks in data science, including data collection, storage, tidying,
transformation, processing, management, and modeling for the purpose of extracting knowledge from raw observations.
Programming is a cross-cutting aspect of the course.
Offers students an opportunity to gain experience with data science tasks and tools through short assignments.
Includes a term project based on real-world data. (This course description is from the Academic Catalog.)
- ISLR2: Introduction to Statistical Learning with Applications in R, 2nd Ed (2021) by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani.
- The entire book (pdf) is available for free from the website along with video lectures by Hastie and Tibshirani.
- Weekly topics closely follow the chapter progression in the text.
- We will not be covering the Labs and Exercises in each chapter; they use the R language.
- We will use Python throughout the course for all in-class exercises and outside assignments.
- For exercises and implementations, we’ll use Python Data Science Handbook (2016) by Jake VanderPlas
- The entire book is available for free on github in the form of Jupyter notebooks that launch automatically in Colab.
- Chapters 2-4 cover data management, processing and visualization with Numpy, Pandas and Matplotlib.
- Chapter 5 uses Scikit-Learn to implement many of the topics described in ISLR2.
- Other Python references and on-line documentation will be used as well, as needed.
- For those new to Python, A Whirlwind Tour of Python by Jake VanderPlas is freely available on Github as a collection of executable Jupyter notebooks.
- Reading – Assignments from ISLR2 should be read in advance of class.
- Quizzes – Frequent short (~15-minute) quizzes will be used to assess reading comprehension.
- In-class discussion – Immediately following quizzes we’ll discuss quiz answers, assignments, etc.
- Lecture – Lectures that provide context for in-class exercises will comprise a small fraction of total class time.
- In-class exercises – We’ll code collaboratively using a variety of datasets and online resources.
- Homework – Assignments are designed to give students practice coding and using tools on their own.
- Project – Small groups (2-3 students) will work collaboratively in github on a practical data-science project.
See the github repo for course notes and additional detail.
All assigments will be assigned in Canvas using github classroom.
||Support Vector Machines
Projects allow students to gain experience working in small teams on practical problems with real-world data.
Ideally, these are XN projects that involve external stakeholders who help review prototypes
and provide feedback along the way.
- Code development occurs with a shared github repository using basic tools for collaborative coding
such as prototyping in branches, pull requests, merging after independent collaborator review,
discussing new functionality with “issues”, etc.
- Project documentation, attribution and reproducibility are critically important.
Documentation should have sufficient detail so that another technical teams could pick up and expand
upon the project at a later date.
- Projects include a front-facing github-pages site that provides an overview
understandable by a non-technical audience.
- The repo and gh-pages site can contribute to student porfolios.
Example from Spring 2022:
Students should have a standard Python development environment installed on their computer, including
a text editor or IDE, and git (as described here).
All coding assignments will involve github repositories administered with github classroom.