Case Study SILAC → Applying basics of functions and objects to real data

Element 6: Tidyverse – tidyr → Pivot between wide and long data – why and how?

Course Description & Curriculum

Data Analysis with R

The Data Analysis with R online course consists of video lectures, an eBook and an R project to complete exercises, as well as 3 live sessions to provide personal instruction.

Course Description

Who is the intended audience of this course?

Active research scientists, primarily in the life sciences
Doctoral candidates in the first years of their research project
Individuals required to analyze their own data
Absolute beginners in programming with no experience in R

What will participants be able to do after this course?

Understand and use the fundamentals of base R and tidyverse syntax
Make basic multivariate plots
Perform statistical tests and extract the results
Develop your own data analyse workflows
Identify and correct the most common and frustrating problems that beginners encounter

What are the prerequisites?

Basic knowledge of the scientific process
An RStudio Cloud account
An installation of R and RStudio on your personal computer
Data from your own research project with a plan for how it should be analysed

Course Curriculum and Schedule

The course will run for approximately 1,5 – 2 weeks, depending on how the live sessions have been scheduled. The course is divided into three phases:

Phase 1 – Independent Study

You’ll granted access to the online course content about 1 week prior to our first scheduled group session. During this period you should review the course material and practice with the exercises. We suggest the you set aside at least 1 hour a day for learning. You can cover the content on Day 6 & Day 7 after Live Session #1

Day 1 – Getting started

Welcome to the Course
Before we Begin
Case Study Plant Growth → A small & relevant case study for biologists, introduces tidyverse and data visualization concepts

Day 2 – The Fundamentals

Element 1: Functions → Using functions in R
Element 2: Objects → Working with data structures in R

Day 3 – Case Study

Case Study SILAC → Applying basics of functions and objects to real data

Day 4 – Asking questions

Element 3: Logical Expressions → Asking & combining YES/NO questions with relational and logical operators, expand on SILAC case study

Day 5 – Query your data

Element 4: Indexing → Extract information according to position, expand on SILAC case study
Element 5: Factor Variables → Working with categorical variables in R

Day 6 – Tidy Data

Element 6: Tidyverse – tidyr → Pivot between wide and long data – why and how?

Day 7 – The Grammar of Data Analysis

Element 7: Tidyverse – dplyr → The 5 verbs of data analysis

Course accompanying ebook – Table of Contents

The following chapters are included in the eBook. Not all sections are explicitly discussed during the live group sessions.

Getting Started

Workshop Preparation → Links to software
Introduction → Broad overview
Getting Started → Basics of R & RStudio
Case Study Plant Growth → A small & relevant case study for biologists, introduces tidyverse and data visualization concepts
Reproducible Research → Writing reports with rmarkdown

Fundamentals I

Element 1: Functions → Using functions in R
Element 2: Objects → Working with data structures in R

Longer Case Study

Case Study SILAC → Applying basics of functions and objects to real data

Fundamentals II

Element 3: Logical Expressions → Asking & combining YES/NO questions with relational and logical operators, expand on SILAC case study
Element 4: Indexing → Extract information according to position, expand on SILAC case study
Element 5: Factor Variables → Working with categorical variables in R

The Tidyverse Review

The Limits of Untidy Data → What is tidy data and what problems does it solve?
Element 6: Tidyverse – tidyr → Pivot between wide and long data – why and how?
Element 7: Tidyverse – dplyr → The 5 verbs of data analysis

Additional Case Studies I

Case Study Effects of Meditation → Complete walk-through of a dataset simulated from a publication
Case Study SILAC as tidy → Revisiting the courses’s main case study in a tidy mindframe

Additional Topics

Regular Expressions → Working with patterns in text
Control Structures → Using conditional and reiterative statements

Appendix

Different Kinds of Data → Review of basics types covered in Statistical Literacy course
Short-cuts & Cheat Sheets → A collection of resources

Phase 2 – Group Learning

There are two 2-hour group sessions. We’ll cover some of the basics but I’ll also challenge you to apply your knowledge to some new exercises. Be sure to complete the exercises.

1st Live Session

Q & A with Instructor
Review content until Day 4/5
SILAC case study exercises

Break period

Complete exercises
Review content for days 5 – 7
Begin work on your own dataset

2nd Live Session

Q & A with Instructor
Complete SILAC Case Study
Tidyverse review & exercises

Phase 3 – 1:1 Mentoring

Each participant will have a 30-minute 1:1 call with the instructor.

Break period

Review content
Begin scripting on own data

3rd Live Session – 1:1 Mentoring

Ask specific questions about the course content
Present briefly what you’ve accomplished on your own data
Work with the instructor to develop your script and plan for future work

Administration

Registration & Admission

You coordinator will administer the course and handle registrations.
Registered participants will receive an email invitation approximately 1 week before the first group session.
This email will contain instructions on attending the live sessions and the accessing the course material.

Cancellation

Please inform your coordinator of cancellations promptly, so that someone from the waiting list can take your place.
Your access to the online material and group will be revoked once you withdraw.

Attendance

You are expected to attend all live sessions and to come prepared to participate in the discussion.
This includes completing the appropriate exercises beforehand.
Attendance reports will be provided to your coordinator.

FAQs

Suitability

I've never used a programming language in my life! Am I ready for this?

Yes! If you’ve never programmed before, R’s learning curve is very comfortable. Here are some reasons why you don’t need to worry.

R is a scripting language, which means we can begin with a single text file.
R can run in interactive mode, which means we can execute a single command and see the result immediately.
R relies heavily on functional programming, which is intuitive for beginners.

I’m amazed by how quickly course participants progress from complete beginners to working on their own data.

What do you think code this does? I bet you can explain this pretty easily.

> xx <- c(6,7,3,2,7,7,8)
> mean(xx)
[1] 5.714286

I have an old computer, how do I know if it's good enough?

This is hard to say, since it depends on the kind of work you want to do. Most beginners start with small tabular dataset in a csv or txt file. These are seldom a problem for your laptop’s processor & memory to handle. Nonetheless, you probably already know if our computer’s performance has deteriorated without me listing system requirements. If you have serious concerns try to attend the course on a desktop or laptop computer with better performance.

Will this course cover statistics?

We won’t cover the what & why parts of statistics. You should know what you want to do to your data and why you’re doing it before coming to class. That’s literally your job as the researcher and domain expert. My job as the instructor is to show you how to do it in R.

If you want to brush up on your statistic knowledge, watch out for the Statistical Literacy course at your institute.

I just started my PhD and I don't have any data of my own, what should I do?

There are two opinions here and both are acceptable. I’ll let you decide what is appropriate for your skill and interest level.

Skill first, data second
This says you should develop the skills to handle data before you start collecting it. I completely agree with this. Actually, designing experiments is already performing statistics, i.e. data collection. Storing, naming and backing up your data are all crucial decisions in your data analysis workflow that can have severe consequence later on. So, knowing how you’ll work with your data in R before you even begin your project will help you out a lot. Bringing relevant data from a colleague (or even your own M.Sc.) would be an acceptable alternative to not brining any data. Bringing in no data at all is not recommended!
The downside is that by the time you actually collect your own data your skills have have gotten rusty, and you may lack the motivation to learn when it’s not for your own data.
Data first, skills second
This says that analysing your own data, with the genuine interest you have as the researcher, is incredibly motivating. You just don’t get the same motivation working on generic case studies or a colleague’s data as you do with working on your own. The 1:1 mentoring session can also be extremely beneficial at this stage.
The downside here is that your work may be unnecessarily difficult because of poor design choices you made because you didn’t know R.

Preparation

What is raw data and why does everyone go bonkers when I use excel?

For our purposes, raw data means never touched by human hands. If your data collection was computer-automated (e.g. 96 & 384-well plates, mass spec, flow cytometry, etc.), your raw data is the original data file you obtained from the machine. If you have options to export the data, choose csv or txt instead of Excel. If you have many files exported as Excel already, that’s ok, but please make sure they are really the raw data. e.g. If you opened it in Excel I’d prefer not to touch it Manually-curated data, i.e. data you collected and entered into Excel by hand will necessarily be in Excel, but I’d still prefer that you save it as csv or txt.

What kind of data should I bring?

We’ll discuss this in the first group session and you’ll receive an email with more details. In short, it really depends on how comfortable you feel with R throughout the course. I’d recommend starting with something easy, that you know the solution to (e.g. having already worked with in another program) and fits into a single csv or txt file.

I want to work with data in a special format (FASTA, raw RNA-Seq, BAM, Microarray, etc.). What should I do?

As an absolute beginner, I want to build up your confidence. Ironically, that’s why it’s important to not get too ambitious too quickly. If you set your goals too high and don’t achieve them, you may become frustrated & discouraged enough to revert back to your old habits and give up on R altogether.

Also, don’t underestimate the difficulty and complexity of small data sets. If you’re not comfortable with R, they are a good training ground to improve your skills. We’ll work with some “play” data, and simulating “fake” data is a common strategy that allows you to learn before working on real, large and messy data.

You’ll get to more advanced analytics eventually, but it’s a progression that you’ll undergo at your own pace. So, if you feel comfortable or you have prior knowledge of R and are motivated, then go ahead and try out some BioConductor packages for bioinformatics.

Can R combine data from multiple files into one big data set?

Yes! Nonetheless, working with multiple files can pose challenges and you may want to stick with a single file if it’s possible.

NB. Almost every “Can R do this?” question is going to be answered with a “Yes!“. How exactly, and whether it’s a good idea, are different stories. But that’s what this class is for.