A practical guide for students
2025-06-24
Day 1 of grad school vs. Day ???
| Principle | Tool | What You’ll Learn | |
|---|---|---|---|
| Efficiency | FASRC | Remote computing | |
| Transparency | Notebooks | Narrated and reproducible scientific analysis | |
| Modularity | here() + renv |
Robust file paths & environment isolation | |
| Traceability | Git + GitHub | Version control and team collaboration | |
| Flexibility | googledrive + pins |
Reproducible I/O with shared cloud data |
All screenshots and examples are in the google drive folder: Climate-Smart-Public-Health/Lab Organization/Meetings & Communications/2024-2025/20250624_Tinashe_DataScienceWorkflows/Screenshots_Examples
https://drive.google.com/drive/folders/11wpDo_sJ434pPAh8maXA6TbOSke6WHZE?usp=sharing
Check your email!
Your work is now on a backed-up, high-performance server.
OH NO‼️ MY INTERNET WENT DOWN/FIREFOX CRASHED/MY LAPTOP WAS EATEN BY NEMATOADS ‼️
FASRC is a high performance computing resource whose professional responsibility is to save you (and your data) from yourself 4 5
Let’s go back to our work from yesterday… wait, what was I doing again…?
library(ggplot2)
library(dplyr)
model <- lm(mpg ~ wt + hp, data = mtcars)
new_data <- data.frame(wt = c(2.5, 3.0, 3.5), hp = c(100, 150, 200))
predictions <- predict(model, newdata = new_data)
plot <- ggplot(new_data, aes(x = wt, y = hp)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
geom_text(aes(label = round(predictions, 2)), vjust = -1, color = "red") +
labs(title = "Predicted MPG based on Weight and Horsepower",
x = "Weight (1000 lbs)",
y = "Horsepower")
print(plot)---
title: "My fantastic analysis"
format:
html:
self-contained: true
date: now
author: "Squidward P. Tentacles"
---
In this analysis I'm going to use the `mtcars` dataset to
demonstrate regression and prediction.
## Libraries
Here are the necessary libraries you'll need to replicate this...
```{r}
library(ggplot2)
```
## Fitting the model
I chose to use the `wt` an `hp` variables as predictors because blah blah blah...
```{r}
model <- lm(mpg ~ wt + hp, data = mtcars)
```
## Predict on New Data
I'm creating some new data to predict on...
```{r}
new_data <- data.frame(wt = c(2.5, 3.0, 3.5), hp = c(100, 150, 200))
predictions <- predict(model, newdata = new_data)
```
And plotting it with ggplot:
```{r}
plot <- ggplot(new_data, aes(x = wt, y = hp)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
geom_text(aes(label = round(predictions, 2)), vjust = -1, color = "red") +
labs(title = "Predicted MPG based on Weight and Horsepower",
x = "Weight (1000 lbs)",
y = "Horsepower")
plot
```
## Conclusion
In this notebook I ran an experiment and documented it so that next time
I or someone else sees it they can know exactly what I did
and why 😁2_LiterateProgramming_step1_example2.qmd to your project)“💡Documentation is like a love letter you write to your future self” (Damien Conway) 6
(Knuth, 1984 7) Literate programming is the intertwining of written and machine language to self-document and explain code.
A program’s source code is made primarily to be read and understood by other people, and secondarily to be executed by the computer.
In data science, we use notebook tools like Quarto, RMarkdown, or JuPyter to combine narrative, code, and outputs












🤯🤯🤯
THIS PRESENTATION IS A NOTEBOOK
Your lab mate is not convinced that lm() and glm() are equivalent in
R, so you decide to demo it and share your notebook with them…
3_Modularity_step1_example3.qmd to your project and try to run it)


here(), configs, and renvconfigs to manage paths outside of the projectrenv and/or conda to manage packageshere() to manage file and document pathsrenv or conda to manage package dependenciesIt’s time to publish your groundbreaking paper on the equivalence of
glmandlm! You’ve run the analysis three times now:fantastic_analysis_v3.R,final_FINAL.Rmd, andFINAL_revised_with_comments_v2.qmd.
During a meeting, the PI asks:
“Can you show me what changed between the version we worked on 6 weeks ago and this one you sent me yesterday?”



Why Git?
usethis::create_github_token()usethis::use_git()usethis::use_github()No more file clutter: Replace final_v3_revised_REAL_FINAL.Rmd with clean version tracking
Precise change history: Git tracks edits line by line — you know what changed, when, and why.
Intentional work: Use git as a daily lab notebook where commits encourage reflection (what did I do today, and why?)
Safe collaboration: Work in parallel without overwriting each other’s code using branches and pull requests. Experimentation is encouraged
Work with the garage door up: Share your process early, even if it’s not polished. GitHub lets collaborators see your progress and offer feedback or help sooner
You’ve spent all of this time working on reproducibility, but your collaborators simply do not care for it…
“This is so much overkill for a small analysis”
“I’m not a computer scientist I really don’t care about any of this! I don’t want to learn a new package”
Doing all of this is so slow! I can just use Excel, my old workflow has worked fine so far
Labs that don’t…14
We want reproducibility.
They might want convenience.
You can’t force everyone to use Git, FASRC, Quarto, etc. all the time
You can compromise — and still keep your workflow clean — by finding and implementing creative middle-ground solutions.
For e.g., the pins package 📌
pins📌Google Drive (and friends) don’t neatly fit into the workflow (for reasons we’ve discussed)
How do you share data with outside Harvard collaborators?🧐
How do you couple your data science to other academic workflows?
pins solves this by providing versioned programmatic access to Google, Box, OneDrive etc.
Treats data objects in R/Python/Javascript as “pins” and online locations as “pinup boards”
pins📌install.packages("pins")install.packages("googledrive")drive_id <- googledrive::as_id("https://drive.google.com/drive/folders/1MYoaffvU9ogu7nFo3Whz7XYVG1fq4wHY?usp=share_link")board <- board_gdrive(drive_id, versioned=TRUE)glm() resultspin_write(board, salary_glm_model, name="glm results", "Showing that the glm results are identical to the lm results")pins and Google Drive pins gives you:
| Type of Data | Go To |
|---|---|
| 🔐 Raw data from collaborators | Google Drive & FASRC |
💻 Analysis code (.R, .Rmd, .qmd, .py, plain text files) |
FASRC & GitHub |
| 🖼️ Notebooks, tables, plots, etc. for review (PII-safe) | GitHub or Google Drive |
| ⚠️ Any form of sensitive intermediate outputs or code | Google Drive |
| 🧼 Cleaned/shareable datasets | FASRC always; Google Drive if PII; Dataverse if anonymized |
| Principle | Tool | What We Learned |
|---|---|---|
| Efficiency | FASRC | Remote computing |
| Transparency | Notebooks | Narrated and reproducible scientific analysis |
| Modularity | here() + renv |
Robust file paths & environment isolation |
| Traceability | Git + GitHub | Version control and team collaboration |
| Flexibility | googledrive + pins |
Reproducible I/O with shared cloud data |
Congratulations! You’ve earned all 5 stars
https://www.nature.com/articles/533452a
https://www.mdpi.com/2072-4292/17/9/1482
https://www.nature.com/articles/sdata201618
https://www.rc.fas.harvard.edu/cluster/publications/
https://docs.rc.fas.harvard.edu
https://www.azquotes.com/quote/1463174
https://www.cs.tufts.edu/~nr/cs257/archive/literate-programming/01-knuth-lp.pdf
https://www.nature.com/articles/d41586-018-07196-1
https://www.mdpi.com/2624-5175/6/1/1
https://link.springer.com/article/10.3758/s13428-020-01436-x
https://journals.lww.com/epidem/citation/2025/05000/advancing_reproducible_research_through_version.8.aspx
https://link.springer.com/article/10.1186/1751-0473-8-7
https://www.nature.com/articles/s41597-025-04451-9
https://www.bbc.com/news/magazine-22223190