Author: John Blischak
In this lesson we will learn about the basics of R by inspecting a
biological dataset. I have created a spreadsheet-like dataset using data on
the human genome from the Ensembl Biomart database. Each row contains the
Ensembl transcript that has the longest coding sequence for a given Ensembl
gene ID. I also removed any genes that were not translated at all in order to
keep the size of the dataset manageable. For the exact details, you can see
data/create_datasets.R
.
My goal is twofold. First is for you to become comfortable with the basics of R and get some intuition for how it works. Second I want to provide you the motivation to actually use R the next time you analyze data instead of using Excel by showing you how powerful it can be. Please ask a question directly to me or on the EtherPad if you need more explanation.
Learning objectives:
As a first step, let's clear out the working environment and verify that
we are in the correct working directory. We'll use the trick we learned in the
first section, rm(list = ls())
to ensure that there are no objects in the
working space that could potentially interfere with our analysis if we
accidentally referred to them. The other option would be to go to the Session
tab and click the option Restart R
.
rm(list = ls())
getwd()
## [1] "/home/johnubuntu/Dropbox/software_carpentry/2013-09_bootcamp/intro_to_R"
# Your code here
Instead of opening the file in Excel, we use R's read.table
function.
genes <- read.table("data/ensembl_human_genes.txt", header = TRUE, sep = "\t",
quote = "", stringsAsFactors = FALSE)
head(genes)
## ensembl_transcript_id ensembl_gene_id chromosome_name transcript_start
## 1 ENST00000373020 ENSG00000000003 X 99883667
## 2 ENST00000373031 ENSG00000000005 X 99839799
## 3 ENST00000367770 ENSG00000000457 1 169822215
## 4 ENST00000286031 ENSG00000000460 1 169764550
## 5 ENST00000374003 ENSG00000000938 1 27939180
## 6 ENST00000367429 ENSG00000000971 1 196621008
## transcript_end external_gene_id percentage_gc_content gene_biotype
## 1 99891803 TSPAN6 40.87 protein_coding
## 2 99854882 TNMD 40.80 protein_coding
## 3 169858029 SCYL3 40.34 protein_coding
## 4 169823221 C1orf112 39.22 protein_coding
## 5 27953080 FGR 52.97 protein_coding
## 6 196716634 CFH 35.20 protein_coding
## source name_1006 cds_length
## 1 ensembl integral to membrane 738
## 2 ensembl integral to membrane 954
## 3 ensembl ATP binding 2229
## 4 ensembl 2562
## 5 ensembl positive regulation of mast cell degranulation 1590
## 6 ensembl complement activation, alternative pathway 3696
We have provided R the following information to retrieve our dataset:
data/ensembl_human_genes.txt
"\t"
stands for “tab”)The function head
gave us a preview of the dataset, which does look very
much like a spreadsheet. There are column names like ensembl_transcript_id
and chromosome_name
, and there are row numbers for each gene entry.
But how is R representing the data? Let's use some R function to further inspect this dataset. As we explore what we have imported, we'll learn about the different data types in R.
class(genes)
## [1] "data.frame"
dim(genes)
## [1] 21056 11
str(genes)
## 'data.frame': 21056 obs. of 11 variables:
## $ ensembl_transcript_id: chr "ENST00000373020" "ENST00000373031" "ENST00000367770" "ENST00000286031" ...
## $ ensembl_gene_id : chr "ENSG00000000003" "ENSG00000000005" "ENSG00000000457" "ENSG00000000460" ...
## $ chromosome_name : chr "X" "X" "1" "1" ...
## $ transcript_start : int 99883667 99839799 169822215 169764550 27939180 196621008 143816614 53362139 41040684 24683527 ...
## $ transcript_end : int 99891803 99854882 169858029 169823221 27953080 196716634 143832827 53409927 41067715 24740230 ...
## $ external_gene_id : chr "TSPAN6" "TNMD" "SCYL3" "C1orf112" ...
## $ percentage_gc_content: num 40.9 40.8 40.3 39.2 53 ...
## $ gene_biotype : chr "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
## $ source : chr "ensembl" "ensembl" "ensembl" "ensembl" ...
## $ name_1006 : chr "integral to membrane" "integral to membrane" "ATP binding" "" ...
## $ cds_length : int 738 954 2229 2562 1590 3696 1404 1914 1044 1005 ...
Thus, the data is in a data.frame with 21,056 rows and 11 columns. The 11
columns each contain a different type of information, which is summarized by
the function str
, which shows the structure of an object. str
is an
amazing tool for both beginner and advanced programmers (it was voted the
most useful R trick on a StackOverflow post),
so use it often! For our data.frame genes
, it explains the data in each
column. The individual columns are each a specific type of vector.
One big difference between R and other programming languages you may be familiar with like C++, Java, Python, Perl, etc. is that there are no scalar data types. In other words, the base object is a vector. The code below demonstrates that even though a variable may look like a scalar, it is actually a vector with a length of one. We will see the ramifications of this soon.
x <- 3
is.vector(x)
## [1] TRUE
length(x)
## [1] 1
A discussion of data types can be an extremely advanced topic. As a pragmatic consideration, I will present a simplified view of the main types of vectors in R. Below I will describe four types of vectors:
A numeric vector in our dataset is percentage_gc_content
, which as the name
implies is the percentage of G and C bases in the transcript sequence. As the
output from str(genes)
hinted, we can access the columns of a data.frame
using the dollar sign character, $
. Let's learn more about what it contains:
is.numeric(genes$percentage_gc_content)
## [1] TRUE
head(genes$percentage_gc_content)
## [1] 40.87 40.80 40.34 39.22 52.97 35.20
summary(genes$percentage_gc_content)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.0 40.6 46.0 47.1 52.8 83.7
Note that R has designated some vectors as int for integers. In our simplified view, we are not going to worry about this distintion because R considers them to be numeric as well, e.g.:
is.numeric(genes$cds_length)
## [1] TRUE
Since R is designed for exploratory data analysis, it is very simple to visualize data.
plot(genes$percentage_gc_content)
The graph generated by plot
is not very informative in this case since the
x-axis is simply the order of the data in the vector, i.e. the index, and
also the plot is extremely over-plotted.
hist(genes$percentage_gc_content)
On the other hand, the histogram created by hist
provides a nice
visualization of the information returned by the summary
function.
R has the typical mathematical functions you would expect:
+
, -
addition, subtraction*
, /
multiplication, division^
exponentiation%/%
integer division%%
modulo (returns the remainder of a division operation)But first, let's create some toy vectors that are more manageable to work with. There are multiple ways to create a numeric vector.
(num_vec_one <- c(1, 5, 8, 3))
## [1] 1 5 8 3
(num_vec_two <- 1:8)
## [1] 1 2 3 4 5 6 7 8
(num_vec_three <- seq(from = 3, to = 12, by = 3))
## [1] 3 6 9 12
Note: I used a trick above. Normally for an assignment, R generates no output. In order to see what is contained in the variable as it is assigned, you can surround the expression with parentheses.
In other programming languages, in order to do something simple like add one to each number of a vector, you would have to use some sort of looping mechanism to add each number separately. However, because R is vector-based this is extremely simple.
num_vec_one + 1
## [1] 2 6 9 4
num_vec_two/num_vec_three
## [1] 0.3333 0.3333 0.3333 0.3333 1.6667 1.0000 0.7778 0.6667
num_vec_three/num_vec_three
## [1] 1 1 1 1
c(0, 0, 0, 0) + 1:3 # alternatively could have used rep(0, 4)
## Warning: longer object length is not a multiple of shorter object length
## [1] 1 2 3 1
The examples above demonstrate the comcept of recycling. The first element of the first vector is paired with the first element of the second, and then the second elements of each vector, and so on. If one vector is shorter than than the other, it starts again at the beginning of the vector. As we saw in the last example, if the shorter vector is not a multiple of the longer vector, R issues a warning since it is likely a mistake.
As you may have deduced, many R functions are "vectorized.”
sqrt(c(2, 4, 9, 16))
## [1] 1.414 2.000 3.000 4.000
abs(c(-4, 6, -987))
## [1] 4 6 987
Utilizing vectorized functions will not only make your code easier to understand, but it also will speed up your code. This is because many base R functions are actually written in C, so one function call to a C function is much faster than multiple calls to C and/or R functions (see this blog post) for details).
Lastly, you may have noticed that there is always a “[1]” next to the output. This tells us the index of the first element on the line. It is useful when the output spans more than one line, e.g.
1:100
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
This also implies that R indexing starts at 1, instead of at 0 like most other programming languages. We can use this index information to extract specific elements of a vector.
num_vec_three
## [1] 3 6 9 12
num_vec_three[2:4]
## [1] 6 9 12
num_vec_three[c(1, 4)]
## [1] 3 12
genes$external_gene_id[c(1, 4, 15)]
## [1] "TSPAN6" "C1orf112" "CFTR"
Exercise
The column cds_length
contains the number of base pairs that constitute
the protein-coding portion of the gene. Thus, we would expect each entry to
be a multiple of three base pairs. Check to see if our assumption is
correct.
# your code here
The other main mode of vector that we observe in the genes
data.frame is
character. Character vectors contain non-numeric data, often referred to as
“strings.”
Question: Why is chromosome_name
a character vector and not a numeric
vector?
Create character vectors using c
. As expected, the functions for strings
are also vectorized.
dna_seqs <- c("AGTCTATGCTAGC", "ACT", "ATCGTCTCTCGGCTGGCGGCAA")
dna_len <- nchar(dna_seqs) # counts the number of characters in each string
dna_len
## [1] 13 3 22
One very useful string function is paste
. It pastes together the vectors it
is supplied, recycling shorter vectors if necessary. Below we convert the
chromosomes to the necessary format to make a BED file.
chroms <- c(1:23, "X", "Y", "MT") # the numbers are converted to characters
# Add the prefix 'chr'
paste("chr", chroms, sep = "") # alternatively could use paste0
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8"
## [9] "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
## [17] "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chr23" "chrX"
## [25] "chrY" "chrMT"
substr
extracts a substring of each element. Below we extract the last six
elements of each string (all the gene ID's start with ENSG00000) to create
shorter unique identifiers.
ensembl_gene_id_short <- substr(genes$ensembl_gene_id, 10, 15)
head(ensembl_gene_id_short)
## [1] "000003" "000005" "000457" "000460" "000938" "000971"
gsub
replaces a specific pattern in each string.
(biotypes <- unique(genes$gene_biotype))
## [1] "protein_coding" "polymorphic_pseudogene"
## [3] "IG_V_gene" "TR_V_gene"
## [5] "TR_J_gene" "TR_C_gene"
## [7] "TR_D_gene" "IG_D_gene"
## [9] "IG_C_gene" "IG_J_gene"
gsub("_", "-", biotypes)
## [1] "protein-coding" "polymorphic-pseudogene"
## [3] "IG-V-gene" "TR-V-gene"
## [5] "TR-J-gene" "TR-C-gene"
## [7] "TR-D-gene" "IG-D-gene"
## [9] "IG-C-gene" "IG-J-gene"
grep
searches for a specific pattern and returns the indexes that contain
that pattern. This is useful for selecting specific subsets of a dataset. Below
we search for the genes whose gene ontology (GO) description includes the
string “mast cell”.
head(genes$name_1006) # the cryptic name Ensembl uses for GO description
## [1] "integral to membrane"
## [2] "integral to membrane"
## [3] "ATP binding"
## [4] ""
## [5] "positive regulation of mast cell degranulation"
## [6] "complement activation, alternative pathway"
grep("mast cell", genes$name_1006)
## [1] 5 1934 14726
If you find yourself performing lots of grep
searches, you will want to
invest some time in learning regular expressions. These allow you to create
more complex search patterns. See ?regex
for details.
Factors represent categorical variables. Unfortunately they are complicated.
They can be very useful at times, especially when
building statistical models or graphing. However, they can behave in very
unexpected ways to produce strange results. This is why when we orginally
read in our dataset, we set the read.table option stringsAsFactors = FALSE
.
It is best to keep everything as strings, and only convert to a factor when
you need it (and usually R will automatically convert a character vector
to a factor when it makes sense).
Here we convert the character vector gene_biotype
to a factor. The main
feature of a factor is the levels, which correspond to the set of categorical
variables.
genes$gene_biotype <- factor(genes$gene_biotype)
str(genes$gene_biotype)
## Factor w/ 10 levels "IG_C_gene","IG_D_gene",..: 6 6 6 6 6 6 6 6 6 6 ...
levels(genes$gene_biotype)
## [1] "IG_C_gene" "IG_D_gene"
## [3] "IG_J_gene" "IG_V_gene"
## [5] "polymorphic_pseudogene" "protein_coding"
## [7] "TR_C_gene" "TR_D_gene"
## [9] "TR_J_gene" "TR_V_gene"
table(genes$gene_biotype)
##
## IG_C_gene IG_D_gene IG_J_gene
## 9 28 6
## IG_V_gene polymorphic_pseudogene protein_coding
## 45 29 20760
## TR_C_gene TR_D_gene TR_J_gene
## 5 3 74
## TR_V_gene
## 97
boxplot(genes$percentage_gc_content ~ genes$gene_biotype)
If you wanted to change the order of the x-axis on the boxplot, we can order the factor.
genes$gene_biotype <- ordered(genes$gene_biotype, levels = c("protein_coding",
"polymorphic_pseudogene", "IG_C_gene", "IG_D_gene", "IG_J_gene", "IG_V_gene",
"TR_C_gene", "TR_D_gene", "TR_J_gene", "TR_V_gene"))
boxplot(genes$percentage_gc_content ~ genes$gene_biotype)
To obtain the median value for the percent GC content for each of the factor
levels that we observe in the boxplot, we can use the function tapply
.
tapply(genes$percentage_gc_content, genes$gene_biotype, median)
## protein_coding polymorphic_pseudogene IG_C_gene
## 45.89 49.06 62.99
## IG_D_gene IG_J_gene IG_V_gene
## 45.08 59.16 53.36
## TR_C_gene TR_D_gene TR_J_gene
## 43.66 55.56 46.55
## TR_V_gene
## 47.44
Here tapply
takes the first argument, genes$percentage_gc_content
, and
splits these values into the categories as specified by the factor
genes$gene_biotype
provided as the second. Then it applies the third
argument, the function median
, to each of the subcategories.
So factors are clearly powerful. But again you'll need to be careful since
they are more complicated than other vectors. In fact, even though in many
ways they are similar to vectors, they aren't even a vector. This is because
factors have the additional attribute levels
.
is.vector(genes$gene_biotype)
## [1] FALSE
is.factor(genes$gene_biotype)
## [1] TRUE
We have already seen that some functions return TRUE
or FALSE
. In R these
are of mode “logical” (also more generally know as Boolean values). And of
course all the same vector properties apply to these as well.
relational operator
Logical operators
Here are some examples:
1:10 < 5
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
!c(TRUE, FALSE, TRUE, TRUE)
## [1] FALSE TRUE FALSE FALSE
c(TRUE, TRUE, FALSE) & c(TRUE, FALSE, FALSE)
## [1] TRUE FALSE FALSE
c(TRUE, TRUE, FALSE) | c(TRUE, FALSE, FALSE)
## [1] TRUE TRUE FALSE
xor(c(TRUE, TRUE, FALSE), c(TRUE, FALSE, FALSE))
## [1] FALSE TRUE FALSE
c(TRUE, TRUE, TRUE) && c(TRUE, FALSE, TRUE) && c(FALSE, FALSE, TRUE)
## [1] FALSE
Best practice: TRUE
and FALSE
can also be abbreviated to T
and F
,
respectively. However, you should avoid this. TRUE
and FALSE
cannot be
used as variables names, but T
and F
can. This could lead to strange errors.
T == TRUE
## [1] TRUE
T <- "abc"
T == TRUE
## [1] FALSE
# TRUE <- 'abc' # throws an error
rm(T)
Another useful aspect of logical vectors is that they can be used for counting.
This is because TRUE
is synonymous with 1, and FALSE
with 0.
TRUE == 1
## [1] TRUE
FALSE == 0
## [1] TRUE
# How many are numbers in my vector are greater than 10?
my_nums <- c(5, 54, 9, 4, 1, 84, 47, 35, 18)
my_nums > 10
## [1] FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
sum(my_nums > 10)
## [1] 5
# How many are greater than 10 but less than 40?
sum(my_nums > 10 & my_nums < 40)
## [1] 2
Logical vectors are also great for indexing. We already saw how we can select specific subsets of data by referring to the index numbers. However, we can also select elements of a list with a logical vector: elements in the vector that match up with TRUE in the logical vector are selected.
x <- 1:10
(divisible_by_2 <- x%%2 == 0)
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
x[divisible_by_2]
## [1] 2 4 6 8 10
# This can also be done in just one line of code
x[x%%2 == 0]
## [1] 2 4 6 8 10
The various vector modes can be converted to other modes.
as.character(-5:5)
## [1] "-5" "-4" "-3" "-2" "-1" "0" "1" "2" "3" "4" "5"
as.character(c(TRUE, T, F, FALSE))
## [1] "TRUE" "TRUE" "FALSE" "FALSE"
as.numeric(c("1", "2", "3"))
## [1] 1 2 3
as.numeric(c(TRUE, T, F, FALSE))
## [1] 1 1 0 0
as.logical(c("TRUE", "T", "F", "FALSE"))
## [1] TRUE TRUE FALSE FALSE
as.logical(-10:10)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
But be careful! Especially when using factors. See below for an error that can be disatrous since R will not even give a warning.
chroms <- c(1, 5, 3, 1, "X")
chroms_fac <- as.factor(chroms)
as.numeric(chroms_fac)
## [1] 1 3 2 1 4
str(chroms_fac)
## Factor w/ 4 levels "1","3","5","X": 1 3 2 1 4
If you are sure it is supposed to be numeric, first convert it to a character vector before converting to a numeric vector.
numeric_fac <- factor(c(87, 3, 1, 7, 25, 3, 87))
as.numeric(as.character(numeric_fac))
## [1] 87 3 1 7 25 3 87
Recall the histogram we made earlier of the distribution of percent GC content
for human genes, hist(genes$percentage_gc_content)
. There are lots of
options available to customize the appearance of the plot. See ?hist
for all
of the available options. Here are some ideas to get you started:
# png('figs/distribution-gc-content.png')
# Customize the histogram here
hist(genes$percentage_gc_content)
# dev.off()
After you have customized the plot, uncomment the first and last lines of code
above. Before you were always sending the plot to the plotting area of
RStudio. The function png
directs the plot to be saved in the given filename.
The last line, dev.off
, tells R to stop sending information to the external
file.
Extra challenge: Use the function abline
to add a vertical line at
50% GC content. Then add another vertical line at the median percentage GC
content (the vertical lines should be different colors). Re-run the code and
confirm that the file figs/distribution-gc-content.png
was updated.
We've already seen that we can extract a specific column of a data.frame using
the dollar sign character, $
. Additionally we can extract specific rows of
that column using square brackets, []
. But a data.frame can also be indexed
specifically using its 2D structure. Again the square brackets are used, but
this time there are two arguments separated by a comma. The first vector
specifies which rows to select, and the second vector specifies the columns.
genes[158:165, 4:6]
## transcript_start transcript_end external_gene_id
## 158 133634113 133702765 CDKL3
## 159 158958382 158992478 UPP2
## 160 2867228 2871720 PRSS21
## 161 45754516 45806951 MARK4
## 162 15969849 16077741 PROM1
## 163 18043825 18054746 CCDC124
## 164 42082601 42092916 CEACAM21
## 165 26083792 26127525 NOS2
Also similar to vectors, we can choose rows and columns with logical vectors.
A row/column that matches TRUE
in the logical vector is kept and one that
matches FALSE
is removed.
# Only keep every other column
genes[1:10, c(TRUE, FALSE)]
## ensembl_transcript_id chromosome_name transcript_end
## 1 ENST00000373020 X 99891803
## 2 ENST00000373031 X 99854882
## 3 ENST00000367770 1 169858029
## 4 ENST00000286031 1 169823221
## 5 ENST00000374003 1 27953080
## 6 ENST00000367429 1 196716634
## 7 ENST00000002165 6 143832827
## 8 ENST00000229416 6 53409927
## 9 ENST00000341376 6 41067715
## 10 ENST00000337248 1 24740230
## percentage_gc_content source cds_length
## 1 40.87 ensembl 738
## 2 40.80 ensembl 954
## 3 40.34 ensembl 2229
## 4 39.22 ensembl 2562
## 5 52.97 ensembl 1590
## 6 35.20 ensembl 3696
## 7 39.90 ensembl 1404
## 8 40.16 ensembl 1914
## 9 40.00 ensembl 1044
## 10 44.09 ensembl 1005
If the row or column argument is omitted, all rows or columns are included, respectively.
dim(genes[3, ]) # select only the 3rd row, returns a data.frame
## [1] 1 11
length(genes[, 8]) # select only the 8th column, returns a vector
## [1] 21056
We usually want to subset the data based on the values contained in the data.frame.
# Select only genes with the biotype 'protein_coding'
dim(genes[genes$gene_biotype == "protein_coding", ])
## [1] 20760 11
The argument genes$gene_biotype == "protein_coding"
creates a logical vector,
thus only rows where this condition is TRUE
are added to the new variable. All
the columns are copied as well.
1) Create a new data.frame called genes_sub
that only contains genes that are
on the X chromosome and are labeled “protein_coding.”
# Your code here
2) Now make genes_sub
contain only genes who have a GO category with the
word “development” in it. Recall that the GO column is named “names_1006”
and you can search for words in strings using the function grep
.
# Your code here
Note: There is a convenience function called subset
that makes these
sorts of operations more succint. However, it comes with the warning that it is
“intended for use interactively,” so you do not want to become too dependent
on it. See this StackOverflow post if you are curious about the dangers of
subset.
Lastly, you can also subset a data.frame by specifically referring to the
row or column names. Our data.frame genes
does not have informative row
names (they are the same as the index), but it does have column names.
colnames(genes)
## [1] "ensembl_transcript_id" "ensembl_gene_id"
## [3] "chromosome_name" "transcript_start"
## [5] "transcript_end" "external_gene_id"
## [7] "percentage_gc_content" "gene_biotype"
## [9] "source" "name_1006"
## [11] "cds_length"
genes[1:10, c("ensembl_transcript_id", "cds_length")]
## ensembl_transcript_id cds_length
## 1 ENST00000373020 738
## 2 ENST00000373031 954
## 3 ENST00000367770 2229
## 4 ENST00000286031 2562
## 5 ENST00000374003 1590
## 6 ENST00000367429 3696
## 7 ENST00000002165 1404
## 8 ENST00000229416 1914
## 9 ENST00000341376 1044
## 10 ENST00000337248 1005
Referring to the column specifically by its name not only saves you from having to figure out which column it is, but it allows you to change the order of the columns of your input data without effecting the code exectution.
Let's create BED file
of our data.frame genes_sub
, which contains protein-coding genes on the X
chromosome that are involved in any developmental process. This BED file could
then be used for a variety of purposes, e.g. find the number of ChIP-seq reads
that fall within each gene.
First we need to add the prefix “chr” to all of our chromosome names.
genes_sub$chromosome_name <- paste0("chr", genes_sub$chromosome_name)
Second, we have to convert from Ensembl coordinates, which are 1-based and both start and end coordinats are inclusive, to UCSC coordinates, which are 0-based and teh end coordinate is exclusive (details). Thus we need to subtract one from our starting positions.
genes_sub$transcript_start <- genes_sub$transcript_start - 1
Third, we need to select the right columns in the right order.
genes_bed <- genes_sub[, c("chromosome_name", "transcript_start", "transcript_end",
"ensembl_gene_id")]
And fourth, we write the data to a file.
write.table(genes_bed, "data/x_dev_genes.bed", quote = FALSE, row.names = FALSE,
col.names = FALSE)
Instead of searching for ChIP-seq reads across the whole gene body, let's
instead focus just on the transcription start site (TSS). Modify genes_bed
so
that the intervals are +/- 1000 bp from the TSS, and then write the output to
a file called x_dev_genes_tss.bed
in the directory data
.
# Your code here