README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

# Hack to remove double slashes from temporary files when running under MacOS
# This breaks duckdb
Sys.getenv("TMPDIR") |>
    stringr::str_remove("/$") |>
    Sys.setenv(TMPDIR=_)
```

# delaytional

<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![CRAN status](https://www.r-pkg.org/badges/version/delaytional)](https://CRAN.R-project.org/package=delaytional)
<!-- badges: end -->

`delaytional` provides a backend for [`DelayedArray`](https://bioconductor.org/packages/release/bioc/html/DelayedArray.html)
that stores the array data in an SQL database. The main motivation for doing 
this is so that you can use the impressive performance of [DuckDB](https://duckdb.org/)
to store array data in Parquet files and query them out of memory (without
loading the entire dataset into R). Other possible use cases include:

* Out-of-memory querying of CSV or NDJSON data, via DuckDB
* Using SQLite databases as array stores
* Re-using an existing database server (DBMS) to support matrix data

## Installation

You can install the development version of `delaytional` from [GitHub](https://github.com/) with:

```r
# install.packages("remotes")
remotes::install_github("WEHI-ResearchComputing/delaytional")
```

## Example

Let's say we're working with a fairly large matrix:
```{r}
in_mem <- rnorm(n = 4E8) |> matrix(ncol = 2E4)
lobstr::obj_size(in_mem)
```

This isn't too bad, but you can easily imagine we might be in trouble if the matrix were 100 times larger.

Let's convert it into a `delaytional` array and see if we can improve that.
First we need to write the array to parquet using `array_to_table`:

```{r}
table_dest <- tempfile(fileext=".parquet")
in_mem |>
    delaytional::array_to_table() |>
    arrow::write_parquet(table_dest)
```

Next, we can create an array from the parquet file. 
This is also how you would create a `DelayedArray` from an existing parquet
dataset that you didn't make yourself:

```{r}
delayed_parquet <- duckdb::duckdb() |>
  DBI::dbConnect() |>
  dplyr::tbl(table_dest) |>
  delaytional::SqlArraySeed() |>
  DelayedArray::DelayedArray()
```

How big is it now?

```{r}
lobstr::obj_size(delayed_parquet)
```

Now let's pull out some random indices from the two arrays, and compare the results:

```{r}
set.seed(1)
x = sample.int(n = nrow(in_mem), size = 5)
y = sample.int(n = ncol(in_mem), size = 5)
```

```{r}
in_mem[x, y]
```
```{r}
delayed_parquet[x, y]
```
```{r}
all.equal(
    in_mem[x, y],
    as.array(delayed_parquet[x, y]),
)
```

The same results! The only difference is that `delayed_parquet` has a much smaller
memory footprint.