Course Description & Curriculum

Data Analysis with R

The Data Analysis with R online course consists of video lectures, an eBook and an R project to complete exercises, as well as 3 live sessions to provide personal instruction.

Course Description

Who is the intended audience of this course?

  • Active research scientists, primarily in the life sciences
  • Doctoral candidates in the first years of their research project
  • Individuals required to analyze their own data
  • Absolute beginners in programming with no experience in R

What will participants be able to do after this course?

  • Understand and use the fundamentals of base R and tidyverse syntax
  • Make basic multivariate plots
  • Perform statistical tests and extract the results
  • Develop your own data analyse workflows
  • Identify and correct the most common and frustrating problems that beginners encounter

What are the prerequisites?

  • Basic knowledge of the scientific process
  •  An RStudio Cloud account
  • An installation of R and RStudio on your personal computer
  • Data from your own research project with a plan for how it should be analysed

Course Curriculum and Schedule

The course will run for approximately 1,5 – 2 weeks, depending on how the live sessions have been scheduled. The course is divided into three phases:

Phase 1 – Independent Study

You’ll granted access to the online course content about 1 week prior to our first scheduled group session. During this period you should review the course material and practice with the exercises. We suggest the you set aside at least 1 hour a day for learning. You can cover the content on Day 6 & Day 7 after Live Session #1

  • Welcome to the Course
  • Before we Begin
  • Case Study Plant Growth → A small & relevant case study for biologists, introduces tidyverse and data visualization concepts
  • Element 1: Functions → Using functions in R
  • Element 2: Objects → Working with data structures in R
  • Case Study SILAC → Applying basics of functions and objects to real data
  • Element 3: Logical Expressions → Asking & combining YES/NO questions with relational and logical operators, expand on SILAC case study
  • Element 4: Indexing → Extract information according to position, expand on SILAC case study
  • Element 5: Factor Variables → Working with categorical variables in R
  • Element 6: Tidyverse – tidyr → Pivot between wide and long data – why and how?
  • Element 7: Tidyverse – dplyr → The 5 verbs of data analysis

The following chapters are included in the eBook. Not all sections are explicitly discussed during the live group sessions.

Getting Started

  • Workshop Preparation → Links to software
  • Introduction → Broad overview
  • Getting Started → Basics of R & RStudio
  • Case Study Plant Growth → A small & relevant case study for biologists, introduces tidyverse and data visualization concepts
  • Reproducible Research → Writing reports with rmarkdown

 

Fundamentals I

  • Element 1: Functions → Using functions in R
  • Element 2: Objects → Working with data structures in R

 

Longer Case Study

  • Case Study SILAC → Applying basics of functions and objects to real data

 

Fundamentals II

  • Element 3: Logical Expressions → Asking & combining YES/NO questions with relational and logical operators, expand on SILAC case study
  • Element 4: Indexing → Extract information according to position, expand on SILAC case study
  • Element 5: Factor Variables → Working with categorical variables in R

 

The Tidyverse Review

  • The Limits of Untidy Data → What is tidy data and what problems does it solve?
  • Element 6: Tidyverse – tidyr → Pivot between wide and long data – why and how?
  • Element 7: Tidyverse – dplyr → The 5 verbs of data analysis

 

Additional Case Studies I

  • Case Study Effects of Meditation → Complete walk-through of a dataset simulated from a publication
  • Case Study SILAC as tidy → Revisiting the courses’s main case study in a tidy mindframe

 

Additional Topics

  • Regular Expressions → Working with patterns in text
  • Control Structures → Using conditional and reiterative statements

 

Appendix

  • Different Kinds of Data → Review of basics types covered in Statistical Literacy course
  • Short-cuts & Cheat Sheets → A collection of resources

Phase 2 – Group Learning

There are two 2-hour group sessions. We’ll cover some of the basics but I’ll also challenge you to apply your knowledge to some new exercises. Be sure to complete the exercises.

1st Live Session

  • Q & A with Instructor
  • Review content until Day 4/5
  • SILAC case study exercises

Break period

  • Complete exercises
  • Review content for days 5 – 7
  • Begin work on your own dataset

2nd Live Session

  • Q & A with Instructor
  • Complete SILAC Case Study
  • Tidyverse review & exercises

Phase 3 – 1:1 Mentoring

Each participant will have a 30-minute 1:1 call with the instructor.

Break period

  • Review content
  • Begin scripting on own data

3rd Live Session – 1:1 Mentoring

  • Ask specific questions about the course content
  • Present briefly what you’ve accomplished on your own data
  • Work with the instructor to develop your script and plan for future work

Administration

Registration & Admission

  • You coordinator will administer the course and handle registrations.
  • Registered participants will receive an email invitation approximately 1 week before the first group session.
  • This email will contain instructions on attending the live sessions and the accessing the course material.

Cancellation

  • Please inform your coordinator of cancellations promptly, so that someone from the waiting list can take your place.
  • Your access to the online material and group will be revoked once you withdraw.

Attendance

  • You are expected to attend all live sessions and to come prepared to participate in the discussion.
  • This includes completing the appropriate exercises beforehand.
  • Attendance reports will be provided to your coordinator.

FAQs

Suitability

Yes! If you’ve never programmed before, R’s learning curve is very comfortable. Here are some reasons why you don’t need to worry.

  • R is a scripting language, which means we can begin with a single text file.
  • R can run in interactive mode, which means we can execute a single command and see the result immediately.
  • R relies heavily on functional programming, which is intuitive for beginners.


I’m amazed by how quickly course participants progress from complete beginners to working on their own data.

What do you think code this does? I bet you can explain this pretty easily.

> xx <- c(6,7,3,2,7,7,8)
> mean(xx)
[1] 5.714286

This is hard to say, since it depends on the kind of work you want to do. Most beginners start with small tabular dataset in a csv or txt file. These are seldom a problem for your laptop’s processor & memory to handle. Nonetheless, you probably already know if our computer’s performance has deteriorated without me listing system requirements. If you have serious concerns try to attend the course on a desktop or laptop computer with better performance.

We won’t cover the what & why parts of statistics. You should know what you want to do to your data and why you’re doing it before coming to class. That’s literally your job as the researcher and domain expert. My job as the instructor is to show you how to do it in R.

If you want to brush up on your statistic knowledge, watch out for the Statistical Literacy course at your institute.

There are two opinions here and both are acceptable. I’ll let you decide what is appropriate for your skill and interest level.

  • Skill first, data second

    This says you should develop the skills to handle data before you start collecting it. I completely agree with this. Actually, designing experiments is already performing statistics, i.e. data collection. Storing, naming and backing up your data are all crucial decisions in your data analysis workflow that can have severe consequence later on. So, knowing how you’ll work with your data in R before you even begin your project will help you out a lot. Bringing relevant data from a colleague (or even your own M.Sc.) would be an acceptable alternative to not brining any data. Bringing in no data at all is not recommended!

    The downside is that by the time you actually collect your own data your skills have have gotten rusty, and you may lack the motivation to learn when it’s not for your own data.

  • Data first, skills second

    This says that analysing your own data, with the genuine interest you have as the researcher, is incredibly motivating. You just don’t get the same motivation working on generic case studies or a colleague’s data as you do with working on your own. The 1:1 mentoring session can also be extremely beneficial at this stage.

    The downside here is that your work may be unnecessarily difficult because of poor design choices you made because you didn’t know R.

Preparation

For our purposes, raw data means never touched by human hands. If your data collection was computer-automated (e.g. 96 & 384-well plates, mass spec, flow cytometry, etc.), your raw data is the original data file you obtained from the machine. If you have options to export the data, choose csv or txt instead of Excel. If you have many files exported as Excel already, that’s ok, but please make sure they are really the raw data. e.g. If you opened it in Excel I’d prefer not to touch it Manually-curated data, i.e. data you collected and entered into Excel by hand will necessarily be in Excel, but I’d still prefer that you save it as csv or txt.
We’ll discuss this in the first group session and you’ll receive an email with more details. In short, it really depends on how comfortable you feel with R throughout the course. I’d recommend starting with something easy, that you know the solution to (e.g. having already worked with in another program) and fits into a single csv or txt file.

As an absolute beginner, I want to build up your confidence. Ironically, that’s why it’s important to not get too ambitious too quickly. If you set your goals too high and don’t achieve them, you may become frustrated & discouraged enough to revert back to your old habits and give up on R altogether.

Also, don’t underestimate the difficulty and complexity of small data sets. If you’re not comfortable with R, they are a good training ground to improve your skills. We’ll work with some “play” data, and simulating “fake” data is a common strategy that allows you to learn before working on real, large and messy data.

You’ll get to more advanced analytics eventually, but it’s a progression that you’ll undergo at your own pace. So, if you feel comfortable or you have prior knowledge of R and are motivated, then go ahead and try out some BioConductor packages for bioinformatics.

Yes! Nonetheless, working with multiple files can pose challenges and you may want to stick with a single file if it’s possible.

NB. Almost every “Can R do this?” question is going to be answered with a “Yes!“. How exactly, and whether it’s a good idea, are different stories. But that’s what this class is for.

Reliable knowledge for students, researchers and professionals

Kunstformen:
Inspiration for our artwork

The drawings used throughout our website come from Ernst Haeckel’s Kunstformen der Natur, published between 1899 and 1904.

As an accomplished German naturalist and artist, Haeckel was already familiar to Rick Scavetta from his studies in Evolutionary Biology. When developing the company’s visual identity, Rick was drawn to these images for a number of reasons.

Kunstformen der Natur is one of the most influential works bridging the gap between science and art. This is reflected in the marriage of the hard and soft skills in every workshop, such as the presentation of difficult technical material in the Presentation Skills workshop.