Convert a data frame to a data.table without copy

I have a large data frame (in the order of several GB) that I'd like to convert to a data.table . Using as.data.table creates a copy of the data frame, which means I need available memory to be at least twice the size of the data. Is there a way to do the conversion without a copy? Here's a simple example to demonstrate:

library(data.table) N  
With output:
library(data.table) # data.table 1.8.10 For help type: help("data.table") N " data 0x31e4260]: copy as.data.table.data.frame as.data.table gc() # used (Mb) gc trigger (Mb) max used (Mb) # Ncells 304519 16.3 597831 32.0 306162 16.4 # Vcells 100444242 766.4 322342905 2459.3 200933219 1533.0 
33.7k 21 21 gold badges 116 116 silver badges 149 149 bronze badges asked Dec 3, 2013 at 7:12 3,296 3 3 gold badges 24 24 silver badges 29 29 bronze badges

1 Answer 1

This is available from v1.9.0+. From NEWS:

o Following this S.O. post, a function setDT is now implemented that takes a list (named and/or unnamed), data.frame (or data.table ) as input and returns the same object as a data.table by reference (without any copy). See ?setDT examples for more.

This is in accordance with data.table naming convention - all set* functions modifies by reference. := is the only other that also modifies by reference.

require(data.table) # v1.9.0+ setDT(data) # converts data which is a data.frame to data.table *by reference* 

See history for older (now outdated) answers.

1 1 1 silver badge answered Dec 3, 2013 at 8:58 119k 28 28 gold badges 287 287 silver badges 391 391 bronze badges

@Arun: Thanks for a detailed answer. I was actually asking how to convert a data frame to a data.table, but was sloppy in creating the toy example, I'll update my question to make it a data frame. Will the same idea then work for a data frame, e.g., getting rid of the first two setattr since a data frame already has these and keeping the rest?

Commented Dec 3, 2013 at 10:34

@YT, if you mean getting a "data.frame" to a "data.table", then of course what you say is right. If you mean list of data.frames, then you'll have to bind them (column or row-wise) before to set the class and allocate.

Commented Dec 3, 2013 at 10:37

@Arun, I meant the former, a single data frame to a data.table, I edited the question to hopefully better reflect it. Thanks again for a clever solution, I can accept the answer as is or wait if you'd like to edit it to match the revised question.

Commented Dec 3, 2013 at 10:58 Maybe this post from Matthew will help shed more light on truelength . Commented Dec 4, 2013 at 2:54

@eddi Before R2.14.0, the truelength member of R's vector header wasn't initialized by R. In C if you don't initialize a variable it has undefined contents (whatever happens to be in that chunk of RAM previously). data.table() and similar creators initialize truelength to 0 before calling alloc.col for compatibility with pre R 2.14.0. alloc.col looks at truelength as an input (0 is taken to mean truelength==length). At one point I thought data.table would need to depend on R>=2.14.0 because of this, but managed to keep it R>=2.12.0. I test with R2.12.0 before each release to CRAN.

Commented Dec 5, 2013 at 14:44

This question is in a collective: a subcommunity defined by tags with relevant content and experts.