GEOfastq
can be installed from Bioconductor as follows:
The NCBI Gene Expression Omnibus (GEO) offers a convenient interface to explore high-throughput experimental data such as RNA-seq. GEO deposits RNA-seq data as sra files to the Sequence Read Archive (SRA) which can be converted to fastq files using fastq-dump
. This conversion process can be quite slow and it is usually more convenient to download fastq files for a GEO accession generated by the European Nucleotide Archive (ENA). GEOfastq
crawls GEO to retrieve metadata and ENA fastq urls, and then downloads them.
To get fastq data for a GEO series, we first retrieve the metadata for a GEO accession:
Next, we extract the sample accessions for this study and retrieve the GEO metadata and ENA fastq url for an example:
gsm_names <- extract_gsms(gse_text)
gsm_name <- gsm_names[182]
srp_meta <- crawl_gsms(gsm_name)
#> 1 GSMs to process
Now that we have retrieved the necessary metadata, we are ready to download the fastq files for this sample:
data_dir <- tempdir()
# example using smaller file
srp_meta <- data.frame(
run = 'SRR014242',
row.names = 'SRR014242',
gsm_name = 'GSM315559',
ebi_dir = get_dldir('SRR014242'), stringsAsFactors = FALSE)
res <- get_fastqs(srp_meta, data_dir)
#> Warning in utils::download.file(files[i], destfile): URL
#> ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR014/SRR014242/SRR014242.fastq.gz: cannot
#> open destfile
#> 'F:\biocbuild\bbs-3.20-bioc\tmpdir\RtmpErObSe/SRR014242.fastq.gz', reason
#> 'Invalid argument'
#> Warning in utils::download.file(files[i], destfile): download had nonzero exit
#> status
The following package and versions were used in the production of this vignette.
#> R version 4.4.0 RC (2024-04-16 r86468 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows Server 2022 x64 (build 20348)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] GEOfastq_1.13.0
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.35 R6_2.5.1 codetools_0.2-20 fastmap_1.1.1
#> [5] doParallel_1.0.17 xfun_0.43 iterators_1.0.14 cachem_1.0.8
#> [9] parallel_4.4.0 knitr_1.46 RCurl_1.98-1.14 htmltools_0.5.8.1
#> [13] rmarkdown_2.26 lifecycle_1.0.4 bitops_1.0-7 cli_3.6.2
#> [17] foreach_1.5.2 sass_0.4.9 jquerylib_0.1.4 compiler_4.4.0
#> [21] plyr_1.8.9 tools_4.4.0 evaluate_0.23 bslib_0.7.0
#> [25] Rcpp_1.0.12 yaml_2.3.8 rlang_1.1.3 jsonlite_1.8.8