Introduction
Writing data structures from R to disk can be an essential task in many applications, especially when working with large datasets. In this article, we will explore how to write objects to disk in R through C++ versus the fst package.
Background
The fst package is a popular choice for fast and efficient data serialization in R. However, it has some limitations, such as requiring more memory and CPU resources than other methods. In this article, we will compare the performance of different methods for writing objects to disk in R, including custom C++ code, the fst package, and its variants.
Benchmarking SSD Write and Read Performance
Benchmarking SSD write and read performance is a tricky business due to various factors that can affect the results. These include DRAM caching, block sizes of write and read operations, and CPU cache effects. To get accurate results, we need to consider these factors when designing our benchmark.
Block Sizes of Write and Read Operations
The default physical sector size of SSDs is 4KB. Writing smaller blocks can hamper performance, but writing larger blocks can also lower performance due to CPU cache effects. The fst package writes data in relatively small chunks, which makes it usually faster than alternatives that write data in a single large block.
Modifying the C++ Code
To facilitate this block-wise writing to SSDs, we can modify our C++ code to use larger blocks. We will define a constant BLOCKSIZE and adjust the code to write in blocks of this size.
#define BLOCKSIZE 262144 // 2^18 bytes per block
long test_blocks(SEXP x, Rcpp::String path) {
char* d = reinterpret_cast<char*>(REAL(x));
std::ofstream outfile;
outfile.open(path.get_cstring(), std::ios::out | std::ios::binary);
long dl = Rf_xlength(x) * 8;
long nr_of_blocks = dl / BLOCKSIZE;
for (long block_nr = 0; block_nr < nr_of_blocks; block_nr++) {
outfile.write(&d[block_nr * BLOCKSIZE], BLOCKSIZE);
}
long remaining_bytes = dl % BLOCKSIZE;
outfile.write(&d[nr_of_blocks * BLOCKSIZE], remaining_bytes);
outfile.close();
return dl;
}
Comparing Methods
Now we can compare methods test, test_blocks, and fst::write_fst in a single benchmark.
x <- runif(134217728) # 1 gigabyte
df <- data.frame(X = x)
fst::threads_fst(1) # use fst in single threaded mode
microbenchmark::microbenchmark(
test(x, "test.bin"),
test_blocks(x, "test.bin"),
fst::write_fst(df, "test.fst", compress = 0),
times = 10)
Conclusion
The modified method test_blocks is about 40 percent faster than the original method and even slightly faster than the fst package. This is expected because fst has some overhead in storing column and table information, possible attributes, hashes, and compression information.
Please note that the difference between fst and your initial test method is much less pronounced on my system, showing again the challenges in using benchmarks to optimize a system.
Recommendations
To achieve faster write speeds when writing objects to disk in R, consider the following:
- Use larger blocks for writing.
- Avoid using DRAM caching if it’s not necessary.
- Optimize your code for CPU cache effects.
- Consider using alternative packages like
fstthat provide efficient data serialization.
By understanding the factors that affect SSD write and read performance and applying these recommendations, you can achieve faster write speeds when writing objects to disk in R.
Last modified on 2024-10-31