DS 5010

Introduction to Programming for Data Science

Offers an introductory course on fundamentals of programming and data structures. Covers lists, arrays, trees, hash tables, etc.; program design, programming practices, testing, debugging, maintainability, data collection techniques, and data cleaning and preprocessing. Includes a class project, where students use the concepts covered to collect data from the web, clean and preprocess the data, and make it ready for analysis. (This course description is from the Academic Catalog.)

Instructor: Philip Bogden, p.bogden@northeastern.edu
Office Hours: Fridays 2-4pm or by appointment (on Teams: https://teams.northeastern.edu)
Canvas: Schedule, assignments, grades: https://northeastern.instructure.com/courses/
Github: Assignment submissions – you’ll be invited to github classroom using your northeastern email.
Colab: Prototyping and in-class exercises. Use your husky.neu.edu email: https://colab.research.google.com

Course objectives

This course builds on the foundation provided by CS 5001, 5002 and 5003. Students will apply programming basics to the design and implementation of data science applications. In particular, students will be introduced to programming as a collaborative discipline. The primary language is Python. At the end of DS 5010, a student should be able to do the following:

Acquire data from various sources
- Design and write a program that uses a Python’s built-in data structures to…
- Read and process a large collection (more than 100) of files on a local or remote file system
- Read and process data from a web-accessible API
- Compute basic statistics and create standard charts (histograms, scatterplots, etc.) with Matplotlib
Document code so others can use it, including:
- Detailed in-line documentation for functions and modules
- API documentation for packages
- Clean, well-organized, self-documenting code (and knowing when that’s sufficient)
Write modular code for an entire data processing pipeline that includes:
- Testing (and test-driven development when appropriate) with standard tools
- Error checking, raising exceptions that help with debugging data
- Automating tasks when appropriate
Manage version-controlled source code for a small project
- Write project documentation for reproducibility, with proper and authoritative attribution.
- Develop code collaboratively using git and github command-line tools
- Use pull requests to communicate and discuss issues with collaborators
Use common packages for data manipulation and basic visualization (e.g., Numpy, Pandas, Matplotlib)
- Know when to use these packages for efficient implementation of tasks learned in the beginning of the course
- Know how to read and use the standard API documentation for usage and customization.
- Know when to use (and not to use) stackoverflow, google and random blogs on the Internet
Read & process common non-tabular data formats (e.g., JSON, GeoJSON, shapefiles)
- Combine disparate datasets and data types (e.g., with FIPS codes) to facilitate geospatial analysis
- Perform basic geospatial data visualization using matplotlib
- Use other standard tools for geospatial visualization and analysis
Read code in other common languages (e.g., R or JavaScript) for understanding
- Write functionally equivalent code in Python

Approach

Case study – Students will develop code collaboratively in class one or more extended case studies.
Project – A final XN-style case study will span several weeks at the end of the course.
Lecture – Lectures will provide context for in-class acitivies, but account for a small fraction of total class time.
In-class exercises – collaborative software development will comprise a large fraction of class time.
Homework assignments – individual assignments and collaborative coding will continue between class sessions.
Reading – Reading assignments will be assigned in advance of related in-class exercises.
Quizzes – Short in-class polling/quizzing will be used to monitor progress and assess reading comprehension.

Case studies

Projects allow students to gain experience working in small teams on practical problems. Code development occurs with a shared github repository using basic tools for collaborative coding such as prototyping in branches, pull requests, merging after independent collaborator review, discussing new functionality with “issues”, etc. Project documentation, attribution and reproducibility are critically important. Documentation should have sufficient detail so that another technical teams could pick up and expand upon the project at a later date. Projects include a front-facing github-pages site that provides an overview understandable by a non-technical audience. The repo and gh-pages site can contribute to student porfolios.

Examples from previous classes:

Texts

Python Data Science Handbook (2016) by Jake VanderPlas
- We’ll use Chapters 1-4, which cover data management, processing and visualization with Numpy, Pandas and Matplotlib. DS 5110 covers material in Chapter 5 on machine learning.
- The entire book is available for free on github in the form of Jupyter notebooks that launch automatically in Colab.

Other texts

A Whirlwind Tour of Python (2016) by Jake VanderPlas. This “fast-paced introduction to essential features of the Python language” is freely available github as a collection of executable Jupyter notebooks. Most of this material should be review.
Python for Data Analysis, 3rd Ed (August 2022) by Wes McKinney. This alternative text by the lead developer of Pandas covers data wrangling with Pandas and Numpy. The latest (3rd) edition is open-access.
Learning Python, 5th Edition by Mark Lutz
- Previous versions of the course used this book, which was published in 2013 (i.e., it’s old). It has complementary material on OOP, but a lot has changed since 2013.
R for Data Science (R4DS) by Wickham & Grolemund
- This introductory book uses R, one of the two other data science programming languages (the third is JavaScript).

Development environment

You should have a standard Python development environment installed on your computer, including a text editor or IDE. The texts by VanderPlas and McKinney provide modern recommendations.

Assessment

Activity	Contribution
Homework	~60%
Project	~30%
Class participation	~10%