-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
108 lines (85 loc) · 2.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
# Hack to remove double slashes from temporary files when running under MacOS
# This breaks duckdb
Sys.getenv("TMPDIR") |>
stringr::str_remove("/$") |>
Sys.setenv(TMPDIR=_)
```
# delaytional
<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![CRAN status](https://www.r-pkg.org/badges/version/delaytional)](https://CRAN.R-project.org/package=delaytional)
<!-- badges: end -->
`delaytional` provides a backend for [`DelayedArray`](https://bioconductor.org/packages/release/bioc/html/DelayedArray.html)
that stores the array data in an SQL database. The main motivation for doing
this is so that you can use the impressive performance of [DuckDB](https://duckdb.org/)
to store array data in Parquet files and query them out of memory (without
loading the entire dataset into R). Other possible use cases include:
* Out-of-memory querying of CSV or NDJSON data, via DuckDB
* Using SQLite databases as array stores
* Re-using an existing database server (DBMS) to support matrix data
## Installation
You can install the development version of `delaytional` from [GitHub](https://github.com/) with:
```r
# install.packages("remotes")
remotes::install_github("WEHI-ResearchComputing/delaytional")
```
## Example
Let's say we're working with a fairly large matrix:
```{r}
in_mem <- rnorm(n = 4E8) |> matrix(ncol = 2E4)
lobstr::obj_size(in_mem)
```
This isn't too bad, but you can easily imagine we might be in trouble if the matrix were 100 times larger.
Let's convert it into a `delaytional` array and see if we can improve that.
First we need to write the array to parquet using `array_to_table`:
```{r}
table_dest <- tempfile(fileext=".parquet")
in_mem |>
delaytional::array_to_table() |>
arrow::write_parquet(table_dest)
```
Next, we can create an array from the parquet file.
This is also how you would create a `DelayedArray` from an existing parquet
dataset that you didn't make yourself:
```{r}
delayed_parquet <- duckdb::duckdb() |>
DBI::dbConnect() |>
dplyr::tbl(table_dest) |>
delaytional::SqlArraySeed() |>
DelayedArray::DelayedArray()
```
How big is it now?
```{r}
lobstr::obj_size(delayed_parquet)
```
Now let's pull out some random indices from the two arrays, and compare the results:
```{r}
set.seed(1)
x = sample.int(n = nrow(in_mem), size = 5)
y = sample.int(n = ncol(in_mem), size = 5)
```
```{r}
in_mem[x, y]
```
```{r}
delayed_parquet[x, y]
```
```{r}
all.equal(
in_mem[x, y],
as.array(delayed_parquet[x, y]),
)
```
The same results! The only difference is that `delayed_parquet` has a much smaller
memory footprint.