Assignment by reference

Fast add, remove and update subsets of columns, by reference. := operator can be used in two ways: LHS := RHS form, and Functional form . See Usage .

set is a low-overhead loop-able version of := . It is particularly useful for repetitively updating rows of certain columns by reference (using a for-loop). See Examples . It can not perform grouping operations.

let is an alias for the functional form and behaves exactly like `:=` .

A character vector of column names (or numeric positions) or a variable that evaluates as such. If the column doesn't exist, it is added, by reference .

A list of replacement values. It is recycled in the usual way to fill the number of rows satisfying i , if any. To remove a column use NULL .

A data.table . Or, set() accepts data.frame , too.

Optional. Indicates the rows on which the values must be updated. If not NULL , implies all rows . Missing or zero values are ignored. The := form is more powerful as it allows adding/updating columns by reference based on subsets and joins . See Details .

In set , only integer type is allowed in i indicating which rows value should be assigned to. NULL represents all rows more efficiently than creating a vector such as 1:nrow(x) .

Column name(s) (character) or number(s) (integer) to be assigned value when column(s) already exist, and only column name(s) if they are to be created.

A list of replacement values to assign by reference to x[i, j] .

:= is defined for use in j only. It adds or updates or removes column(s) by reference. It makes no copies of any part of memory at all. Please read vignette("datatable-reference-semantics") and follow with examples. Some typical usages are:

The .Last.updated variable contains the number of rows updated by the most recent := or set calls, which may be useful, for example, in production settings for testing assumptions about the number of rows affected by a statement; see .Last.updated for details.

Note that for efficiency no check is performed for duplicate assignments, i.e. if multiple values are passed for assignment to the same index, assignment to this index will occur repeatedly and sequentially; for a given use case, consider whether it makes sense to create your own test for duplicates, e.g. in production code.

All of the following result in a friendly error (by design) :

For additional resources, please read vignette("datatable-faq") . Also have a look at StackOverflow's data.table tag .

:= in j can be combined with all types of i (such as binary search), and all types of by . This a one reason why := has been implemented in j . Please see vignette("datatable-reference-semantics") and also FAQ 2.16 for analogies to SQL.

When LHS is a factor column and RHS is a character vector with items missing from the factor levels, the new level(s) are automatically added (by reference, efficiently), unlike base methods.

Unlike <- for data.frame , the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given when fractional data is truncated. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax , or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening and it is clearer to readers of your code that you really do intend to change the column type; e.g., DT[, colA:=as.integer(colA)] . A plonk occurs whenever you provide a RHS value to `:=` which is nrow long. When a column is plonked , the original column is not updated by reference because that would entail updating every single element of that column whereas the plonk is just one column pointer update.

data.table s are not copied-on-change by := , setkey or any of the other set* functions. See copy .

Advanced (internals):

It is easy to see how sub-assigning to existing columns is done internally. Removing columns by reference is also straightforward by modifying the vector of column pointers only (using memmove in C). However adding (new) columns is more tricky as to how the data.table can be grown by reference : the list vector of column pointers is over-allocated , see truelength . By defining := in j we believe update syntax is natural, and scales, but it also bypasses [<- dispatch and allows := to update by reference with no copies of any part of memory at all.

Since [.data.table incurs overhead to check the existence and type of arguments (for example), set() provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a for loop. See examples. := is more powerful and flexible than set() because := is intended to be combined with i and by in single queries on large datasets.

DT[a > 4, b := c] is different from DT[a > 4][, b := c] . The first expression updates (or adds) column b with the value c on those rows where a > 4 evaluates to TRUE . X is updated by reference , therefore no assignment needed. Note that this does not apply when `i` is missing, i.e. DT[] .

The second expression on the other hand updates a new data.table that's returned by the subset operation. Since the subsetted data.table is ephemeral (it is not assigned to a symbol), the result would be lost; unless the result is assigned, for example, as follows: ans <- DT[a > 4][, b := c] .

DT is modified by reference and returned invisibly. If you require a copy, take a copy first (using DT2 = copy(DT) ).

data.table , copy , setalloccol , truelength , set , .Last.updated

andrew brooks

Data science side projects, thoughts & experiments, advanced tips and tricks with data.table, accessing elements from a column of lists, suppressing intermediate output with {}, fast looping with set, using shift for to lead/lag vectors and lists, create multiple columns with := in one statement, assign a column with := named with a character object, method 1: in-line, method 2: using {} and .sd, method 3: super fast mean calculation, speed check, keyby to key resulting aggregate table, using [1] , [.n] , setkey and by for within group subsetting, method 2: quotes and get, data.frame way, data.table way, print data.table with [], hide output from := with knitr, tips and tricks learned along the way.

This is mostly a running list of data.table tricks that took me a while to figure out either by digging into the official documentation , adapting StackOverflow posts, or more often than not, experimenting for hours. I’d like to persist these discoveries somewhere with more memory than my head (hello internet) so I can reuse them after my mental memory forgets them. A less organized and concise addition to DataCamp’s sweet cheat sheet for the basics .

Most, if not all of these techniques were developed for real data science projects and provided some value to my data engineering. I’ve generalized everything to the mtcars dataset which might not make this value immediately clear in this slightly contrived context. This list is not intended to be comprehensive as DataCamp’s data.table cheatsheet is. OK, enough disclaimers!

Some more advanced functionality from data.table creator Matt Dowle here .

1. DATA STRUCTURES & ASSIGNMENT

Columns of lists, summary table (long and narrow).

This could be useful, but is easily achievable using traditional methods.

summary table (short and narrow)

Add all categories of gear for each cyl to original data.table as a list.

This is more nifty. It’s so simple, I find myself using this trick to quickly explore data ad hoc at the command line. Can also be useful for more serious data engineering.

Update 10/29/2015: Per these comments on StackOverlow referencing my post, t[,gearsL:=list(list(unique(gear))), by=cyl] can be more elegantly written as t[,gearsL:=.(list(unique(gear))), by=cyl] . Thanks for pointing out my unnecessarily verbose and unusual syntax! I think I wrote the first thing that worked when I posted this, not realizing the normal .( syntax was equivalent to the outer list.

Extract second element of each list in gearL1 and create row gearL1 . This isn’t that groundbreaking, but explores how to access elements of columns which are constructed of lists of lists. lapply is your friend.

Update 9/24/2015: Per Matt Dowle’s comments, a slightly more syntactically succinct way of doing this:

Calculate all the gear s for all cars of each cyl (excluding the current current row). This can be useful for comparing observations to the mean of groups, where the group mean is not biased by the observation of interest.

Update 9/24/2015: Per Matt Dowle’s comments, this achieves the same as above.

This is actually a base R trick that I didn’t discover until working with data.table. See ?`{` for some documentation and examples. I’ve only used it within the J slot of data.table, it might be more generalizable. I find it pretty useful for generating columns on the fly when I need to perform some multi-step vectorized operation. It can clean up code by allowing you to reference the same temporary variable by a concise name rather than rewriting the code to re-compute it.

Defaults to just returning the last object defined in the braces unnamed.

We can be more explicit by passing a named list of what we want to keep.

Can also write it like this without semicolons.

This is trickier with := assignments… I don’t think := is intended to work when wrapped in { . Assigning multiple columns with := at once does not allow you to use the first columns you create to use building the ones after it, as we did with = inside the { above. Chaining and then dropping unwanted variables is a messy workaround… still exploring this one.

I still haven’t worked much with the loop + set framework. I’ve been able to achieve pretty much everything with := which is more flexible and powerful. However, if you must loop, set is orders of magnitude faster than native R assignments within loops. Here’s a snippet from data.table news a while back:

data.table creators do favor set for some things , like this task which can also be done w/ lapply and .SD . I was actually directed to this solution after I posed this question on StackOverflow. I was also pleased to learn that the functionality I was looking for – applying a function to a subset of columns with .SDcols while preserving the untouched columns – was added as a feature request.

Note this feature is only available in version 1.9.5 (currently on Github, not CRAN) Base R surprisingly does not have great tools for dealing with leads/lags of vectors that most social science statistical software (Stata, SAS, even FAME which I used in my formative data years) come equipped with out of the box.

shift with by

This is useful, but note that that the columns operated on must be atomic vectors or lists. That is they must exist before running computation. Building columns referencing other columns in this set need to be done individually or chained.

This is the advised way to assign a new column whose name you already have determined and saved as a character. Simply surround the character object in parentheses.

This is old (now deprecated) way which still works for now. Not advised.

Calculate a function over a group (using by ) excluding each entity in a second category.

This title probably doesn’t immediately make much sense. Let me explain what I’m going to calculate and why with an example. We want to compare the mpg of each car to the average mpg of cars in the same class (the same # of cylinders). However, we don’t want to bias the group mean by including the car we want to compare to the average in that average.

This assumption doesn’t appear useful in this example, but assume that gear + cyl uniquely identify the cars. In the real project where I faced this problem, I was calculating an indicator related to an appraiser relative to the average of all other appraisers in their zip3. ( cyl was really zipcode and gear was the appraiser’s ID).

0.a Biased mean: simple mean by cyl

However we want to know for each row, what is the mean among all the other cars with the same # of cyl s, excluding that car.

1.a .GRP without setting key

Update 9/24/2015: Per Matt Dowle’s comments, this also works with slightly less code. For my simple example, there was also a marginal speed gain. Time savings relative to the .GRP method will likely increase with the complexity of the problem.

1.b Same as 1.a, but a little faster

Why does this work, 1.b setting key.

{} is used for to suppress intermediate operations.

Building up

No surprises here.

Nested data.tables and by statements

This chunk shows what happens with two by statements nested within two different data.tables. Explanatory purposes only - not necessary for our task. n counts the # of cars in that cyl . N counts the number of cars by cyl and gear .

Calculating “unbiased mean”

This is in a summary table. This would need to be merged back onto dt if that is desired.

Non-function direct way

Using a vectorized approach to calculate the unbiased mean for each combination of gear and cyl . Mechanically, it calculates the “biased average” for all cars by cyl . Then subtract off the share of cars with the combination of gear and cyl that we want to exclude from the average and add that share. Then extrapolate out this pared down mean.

Wrapping up code below into a function

Method 3 is roughly 100x faster than the other two. Great for this narrow task with the vectorization built in, but less generalizable; The other two methods allow any function to be passed.

Without keyby

Categories are not sorted

take highest value of column A when column B is highest by group

Max of qsec for each category of cyl (this is easy)

value of qsec when mpg is the highest per category of cyl

(this is trickier)

value of qsec when mpg is the lowest per category of cyl

Value of qsec when mpg is the median per category of cyl, subset rows within by statement.

V1 is the standard deviation of mpg by cyl . V2 is the standard deviation of mpg for just the first half of mpg .

3. FUNCTIONS

Passing data.table column names as function arguments, method 1: no quotes, and deparse + substitute.

This way seems more data.table-ish because it maintains the practice of not using quotes on variable names in most cases.

However I tend to pass through column names as characters (quoted) and use get each time I reference that column. That can be annoying if you have a long function repeatedly reference column names, but I often need to write such few lines of code with data.table, it hasn’t struck me as terribly unslick, yet.

Beware of scoping within data.table

When you add something to a data.frame within a function that exists in the global environment, it does not affect that object in the global environment unless you return and reassign it as such, or you use the <<- operator.

When we call the function, we see addcol1 in the output. But not addcol2 . That’s because it’s been added to the df in the global environment one level up.

Here is addcol2 , but not addcol .

Unlike data.frame, the := operator adds a column to both the object living in the global environment and used in the function. I think this is because these objects are actually the same object. data.table shaves computation time by not making copies unless explicitly directed to.

So something like this renaming the local version using copy bypasses this behavior, but is likely somewhat less efficient (and elegant). I suspect there’s a cleaner and/or faster way to do this: keep some variables local to the function while persisting and returning other columns.

4. PRINTING

Nothing groundbreaking here, but a small miscellaneous piece of functionality. In data.frame world, wrapping an expression in () prints the output to the console. This also works with data.table, but there is another way. In data.table this is achieved by appending [] to the end of the expression. I find this useful because when I’m exploring at the console, I don’t usually decide to print the output until I’m almost done and I’m already at the end of the expression I’ve written.

It used to be that assignments using the := operator printed the object to console when knitting documents with knitr and rmarkdown . This is actually fixed in data.table v1.9.5. However at the time of my writing, this currently not available on CRAN… only Github. For 1.9.4 users, this StackOverflow post has some hacky solutions. This least impedance approach I found was simply wrapping the expression in invisible . Other solutions alter the way you use data.table which I didn’t like.

data hacks r programming logo website

Applying Function to Multiple Columns of data.table in R (4 Examples)

In this R tutorial you’ll learn how to handle lapply with data.table objects.

Setting up the Examples

Start by installing and loading the data.table package. We also have an overview post of data.table here . You can find the GitLab repository of data.table here .

packages("data.table") # Install & load data.table library("data.table")

install.packages("data.table") # Install & load data.table library("data.table")

Take the iris data as an example dataset.

iris) # Loading iris data set head(iris) # Printing head of data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa

data(iris) # Loading iris data set head(iris) # Printing head of data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa

Copy the data and put it in the data.table format.

data.table(data.table::copy(iris)) # Copying data as data.table

iris_DT <- data.table(data.table::copy(iris)) # Copying data as data.table

Example 1: Calculating the Sum Values of Multiple Variables

In this example, we want to use lapply to apply the sum() function to multiple data.table columns. We do this by use of .SD and .SDcols . .SD serves as a placeholder for each of the columns which we put in .SDcols .

, lapply (.SD, sum), .SDcols = c("Sepal.Length", "Petal.Length")] # Calculating sum values # Sepal.Length Petal.Length # 1: 876.5 563.7

iris_DT[ , lapply (.SD, sum), .SDcols = c("Sepal.Length", "Petal.Length")] # Calculating sum values # Sepal.Length Petal.Length # 1: 876.5 563.7

The previous line returns the sum of Sepal.Length and Petal.Length .

Example 2: Calculating the Sum Values of Multiple Variables by Groups

Now, we go one step further by calculating the sum of both variables for each category of column Species . For that, we simply add the “ , by = ” argument to the previous code as follows.

, lapply (.SD, sum), by = .(Species), .SDcols = c("Sepal.Length", "Petal.Length")] # Calculating group sums # Species Sepal.Length Petal.Length # 1: setosa 250.3 73.1 # 2: versicolor 296.8 213.0 # 3: virginica 329.4 277.6

iris_DT[ , lapply (.SD, sum), by = .(Species), .SDcols = c("Sepal.Length", "Petal.Length")] # Calculating group sums # Species Sepal.Length Petal.Length # 1: setosa 250.3 73.1 # 2: versicolor 296.8 213.0 # 3: virginica 329.4 277.6

Example 3: Using Self-Defined Functions

Often, we additionally want to use further function arguments or use self-defined functions. We can easily add a self-defined function as follows.

, lapply (.SD, function (x) { sum(sqrt(x) / 2) }), .SDcols = c("Sepal.Length", "Petal.Length")] # Applying arbitrary function # Sepal.Length Petal.Length # 1: 180.8488 140.5313

iris_DT[ , lapply (.SD, function (x) { sum(sqrt(x) / 2) }), .SDcols = c("Sepal.Length", "Petal.Length")] # Applying arbitrary function # Sepal.Length Petal.Length # 1: 180.8488 140.5313

We calculated for each variable x the sum of half of the square root of its values.

Example 4: Defining New Columns

Furthermore, we can use lapply together with the definition of new columns, as shown in the following example.

iris_DT[ , c("Sepal.Length_new", "Petal.Length_new") := lapply (.SD, function (x) { 4*x + 2 }), .SDcols = c("Sepal.Length", "Petal.Length")] # Defining new variables head(iris_DT) # Printing data head # Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_new # 1: 5.1 3.5 1.4 0.2 setosa 22.4 # 2: 4.9 3.0 1.4 0.2 setosa 21.6 # 3: 4.7 3.2 1.3 0.2 setosa 20.8 # 4: 4.6 3.1 1.5 0.2 setosa 20.4 # 5: 5.0 3.6 1.4 0.2 setosa 22.0 # 6: 5.4 3.9 1.7 0.4 setosa 23.6 # Petal.Length_new # 1: 7.6 # 2: 7.6 # 3: 7.2 # 4: 8.0 # 5: 7.6 # 6: 8.8

iris_DT <- iris_DT[ , c("Sepal.Length_new", "Petal.Length_new") := lapply (.SD, function (x) { 4*x + 2 }), .SDcols = c("Sepal.Length", "Petal.Length")] # Defining new variables head(iris_DT) # Printing data head # Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_new # 1: 5.1 3.5 1.4 0.2 setosa 22.4 # 2: 4.9 3.0 1.4 0.2 setosa 21.6 # 3: 4.7 3.2 1.3 0.2 setosa 20.8 # 4: 4.6 3.1 1.5 0.2 setosa 20.4 # 5: 5.0 3.6 1.4 0.2 setosa 22.0 # 6: 5.4 3.9 1.7 0.4 setosa 23.6 # Petal.Length_new # 1: 7.6 # 2: 7.6 # 3: 7.2 # 4: 8.0 # 5: 7.6 # 6: 8.8

For the new columns, we need to specify the column names, “:=” shows that we want to define the new columns by the function on the right-hand side.

Anna-Lena Wölwer R Programming & Survey Statistics

Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page .

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Post Comment

How to Compute the Sum of Squares in R (Example Code)

R error in as.date.numeric: origin must be supplied (2 examples), convert summary() function output to data frame in r (example code).

MLP Logo

data.table in R – The Complete Beginners Guide

  • November 13, 2019
  • Selva Prabhakaran

data.table is a package is used for working with tabular data in R. It provides the efficient data.table object which is a much improved version of the default data.frame . It is super fast and has intuitive and terse syntax. If you know R language and haven’t picked up the `data.table` package yet, then this tutorial guide is a great place to start.

r data.table multiple column assignment

  • Introduction: Why data.table?
  • How to install data.table package
  • Importing Data
  • How to convert data.frame to data.table
  • How to convert data.table to data.frame
  • Filtering rows based on conditions
  • How to select given columns
  • How to select multiple columns using a character vector
  • How to drop columns
  • How to rename columns
  • Creating a new column from existing columns
  • How to create new columns using character vector
  • What does .N and .I do
  • What is .SD and How to write functions inside data.table
  • How to join two or more datatables
  • How to merge multiple data.tables in one shot
  • Pivot Table operations
  • set() – A magic function for fast assignment operations

1. Introduction: Why data.table?

The data.table is an alternative to R’s default data.frame to handle tabular data. The reason it’s so popular is because of the speed of execution on larger data and the terse syntax.

So, effectively you type less code and get much faster speed. It is one of the most downloaded packages in R and is preferred by Data Scientists. It is probably one of the best things that have happened to R programming language as far as speed is concerned.

Though data.table provides a slightly different syntax from the regular R data.frame, it is quite intuitive. So once you get it, it feels obvious and natural that you wouldn’t want to go back the base R data.frame syntax.

By the end of this guide you will understand the fundamental syntax of data.table and the structure behind it. All the core data manipulation functions of data.table, in what scenarios they are used and how to use it, with some advanced tricks and tips as well. data.table is authored by Matt Dowle with significant contributions from Arun Srinivasan and many others .

Related Post: 101 R data.table Exercises

2. How to install data.table package

Installing data.table package is no different from other R packages. Its recommended to run install.packages() to get the latest version on the CRAN repository. However, if you want to use the latest development version, you can get it from github as well.

If you want to revert back to the CRAN version do:

3. Importing Data

The way you work with data.tables is quite different from how you’d work with data.frames. Let’s understand these difference first while you gain mastery over this fantastic package.

The fread() , short for fast read is data.table s version of read.csv() . Like read.csv() it works for a file in your local computer as well as file hosted on the internet. Plus it is atleast 20 times faster. Let’s import the mtcars dataset stored as a csv file.

The imported data is stored directly as a data.table . As you see from the above output, the data.table inherits from a data.frame class and therefore is a data.frame by itself.

So, functions that accept a data.frame will work just fine on data.table as well. Because the dataset we imported was small, the read.csv() ‘s speed was good enough.

However, the speed gain becomes evident when you import a large dataset (millions of rows). To get a flavor of how fast fread() is, run the below code. It creates a 1M rows csv file. Then reads it back again. The time taken by fread() and read.csv() functions gets printed in console.

r data.table multiple column assignment

That’s about 20x faster. The time difference gets wider when the filesize increases.

4. How to convert data.frame to data.table

You can convert any `data.frame` into `data.table` using one of the approaches:

  • data.table(df) or as.data.table(df)

The difference between the two approaches is: data.table(df) function will create a copy of df and convert it to a data.table . Whereas, setDT(df) converts it to a data.table inplace.

That means, the df itself gets converted to a data.table and you don’t have to assign it to a different object.

As a result, there is no copy made and no duplication of the same data. Let’s reload the mtcars dataframe from R’s default datasets pacakge.

Important : The data.table() does not have any rownames. So if the data.frame has any rownames, you need to store it as a separate column before converting to data.table.

Alternately, use setDT() to convert it to data.table in place.

5. How to convert data.table to data.frame

Conversely, use as.data.frame(dt) or setDF(dt) to convert a data.table to a data.frame .

6. Filtering rows based on conditions

The main difference with data.frame is: data.table is aware of its column names. So while filtering, passing only the columns names inside the square brackets is sufficient.

Data Table Syntax

7. How to select given columns

Now, let see how to subset columns. The most unexpected thing you will notice with data.table is you cant select a column by its numbered position in a data.table. For example, you can expect the following to work in a data.frame.

But this would just return ‘1’ in a data.table

If you want to get that column by position alone, you should add an additional argument, with=FALSE.

The returned output is a 1-column data.table. An alternate way and a better practice is to pass in the actual column name.

Notice here that the ‘mpg’ is not a string as it’s not written within quotes.

8. How to select multiple columns using a character vector

What if the column name is present as a string in another variable (vector)? In that case, you can’t use mpg directly. You need to additionally pass with=FALSE .

The same principle applies if you have multiple columns to be selected.

To make the above command work, you need to pass with=FALSE inside the square brackets

Clear? If you want to select multiple columns directly, then enclose all the required column names within list.

9. How to drop columns

How to drop the mpg , cyl and gear columns alone? Place them in a vector and use the ! in front to drop them. This effectively returns all columns except those present in the vector.

10. How to rename columns

The setnames() function is used for renaming columns. It takes the data.table (or data.frame), current name and new name as arguments and changes the column names in place without any copying of data. Always recommended!

11. Creating a new column from existing columns

You can always create a new column as you do with a data.frame, but, data.table lets you create column from within square brackets. This saves key strokes.

To create multiple new columns at once, use the special assignment symbol as a function.

To select only specific columns, use the list or dot symbol instead.

Now let’s see a special but frequently used case. Let’s suppose you have the column names in a character vector and want to select those columns alone from the data.table. Passing it inside the square brackets don’t work.

12. How to create new columns using character vector

Suppose you want to create a new column but you have the name of that new column in another character vector. How to create the new column without using the actual column name?

For example, you have the new column name in the myvar vector. And, you want to assign some value, say the value of 1 to this column. Doing this will create a new column named ‘myvar’. And not var1 as intended.

To create a column named ‘var1’ instead, keep myvar inside a vector.

Finally, if you want to delete a columns, assign it to NULL .

13. Grouping

Now, let’s move on to the second major and awesome feature of R data.table: grouping using by . In base R, grouping is accomplished using the aggregate() function.

It’s a bit cumbersome and hard to remember the syntax.

All the functionalities can be accomplished easily using the ‘by’ argument within square brackets. For example, in mtcars data, how to get the mean mileage for each cylinder type? Answer: Since you want to see the mileage by cyl column, set by = 'cyl' inside the square brackets.

Thats really useful isnt it? You can even add multiple columns to the ‘by’ argument.

14. A slightly complex groupby problem

Now, lets see some really special cases.

How to select the first occurring value of mpg for each unique cyl value That is, instead of taking the mean of mileage for every cylinder, you want to select the first occurring value of mileage. How to do that?

What to do if you want the second value? Just replace the 1 with 2.

And what if you want the last value? You can either use length(mpg) or .N :

15. What does .N and .I do

.N contains the number of rows present. So the following will get the number of rows for each unique value of cyl .

Now, how to create row numbers of items? It can be done using .I variable, short for ‘index’ (I guess). Lets first understand what .I returns.

It returns all the row numbers. Now, how to return the row numbers where cyl=6 ? This can get confusing in the beginning so pay close attention. If you want to get the row numbers of items that satisfy a given condition, you might tend to write like this:

But this returns the wrong answer because, `data.table` has already filtered the rows that contain cyl value of 6. So, what you want to do instead is to write that condition to subset .I alone instead of the whole `data.table`.

The result is same as using the `which()` function, which we used in `data.frames`.

16. Chaining

Data.Table offers unique features there makes it even more powerful and truly a swiss army knife for data manipulation. First lets understand what chaining is.

Using chaining, you can do multiple datatable operatations one after the other without having to store intermediate results. For example, instead of writing two statements you can do it on one. Below code sorts after grouping by cyl :

With chaining, that is, by attaching the square brackets at the end, it’s done in one step.

Actually, chaining is available in dataframes as well, but with features like by , it becomes convenient to use on a data.table.

17. What is .SD and How to write functions inside data.table

Next, lets see how to write functions within a data.table square brackets. Let’s suppose, you want to compute the mean of all the variables, grouped by ‘cyl’.

How to do that?

You can create the columns one by one by writing by hand. Or, you can use the lapply() function to do it all in one go. But `lapply()` takes the data.frame as the first argument. Then, how to use `lapply() inside a data.table? You can use the .SD object as the first argument for lapply() . But, what is the .SD object?

It is nothing but a data.table that contains all the columns of the original datatable except the column specified in ‘by’ argument. So, here is what it looks like.

So, now you can pass this as the first argument in `lapply()`. The 11th column in `.SD` is rownames, so let’s include only the first 10.

Optionally, Instead of subsetting .SD like this, You can specify the columns that should be part of .SD using the .SDCols object

The output now contains only the specified columns.

Now, we have come to the ‘key’ concept for data.tables: Keys

Let’s understand why keys can be useful and how to set it. Setting one or more keys on a data.table enables it to perform binary search , which is many order of magnitudes faster than linear search, especially for large data.

As a result, the filtering operations are super fast after setting the keys. There is a side effect though. By setting a key, the `data.table` gets sorted by that key. So how to set a key? Just use the setkey function.

It’s so fast making it look like nothing happened. But it internally sorted data.table with ‘carname’ as the key.

If you notice, this table is sorted by ‘carname’ variable. To check the keys for a data table, you can use the key() function.

Once the key is set, merging data.tables is very direct. I have distributed few columns of mtcars in the following data.tables.

You can join these two datatables:

This returns dt1 ‘s rows using dt2 based on the key of these data.tables. You can also set multiple keys if you wish.

Now, how to remove the keys? Use setkey() and set it to NULL. But the datatable will not go back to it original row arrangement.

Another aspect of setting keys is the ‘keyby’ argument. Using keyby you can do grouping and set the by column as a key in one go. For example, in this example we saw earlier, you can skip the chaining by using keyby instead of just by .

As a result, the output has the key as cyl.

19. How to join two or more datatables

The data.table package provides a faster implementation of the merge() function. The syntax is pretty much the same as base R’s merge() .

20. How to merge multiple data.tables in one shot

This is bit of a hack by using the Reduce() function to repeatedly merge multiple data.tables stored in a list. Reduce() takes in a function that has to be applied consequtively (which is merge_func in this case) and a list that stores the arguments for function.

21. Pivot Table operations

The dcast.data.table() is the function used for doing pivot table like operations as seen in spreadsheet softwares like Microsoft Office Excel or Google spreadsheets.

The good thing is dcast.data.table() works equally well on data.frame object as well. Let’s create a pivot table showing the mean mileage( mpg ) for Cylinders vs Carburetter ( Carb )

So how to understand the syntax? There are 4 primary arguments:

  • formula: Rows of the pivot table on the left of ‘~’ and columns of the pivot on the right
  • value.var: column whose values should be used to fill up the pivot table
  • fun.aggregate: the function used to aggregate the value.var column.

22. set() – A magic function for fast assignment operations

The set() command is an incredibly fast way to assign values to a new column. The syntax is: set(dt, i, j, value) , where i is the row number and j is the column number.

As a best practice, always explicitly use integers for i and j, that is, use 10L instead of 10. It is usually used in for-loops and is literally thousands of times faster.

Yes, it is so fast even when used within a for-loop, which is proof that for-loop is not really a bottleneck for speed. It is the underlying data structure related overhead that causes for-loop to be slow, which is exactly what set() avoids. Besides, it works on a data.frame object as well.

It works like magic! Below is an example to illustrate the power of set() taken from official documentation itself. The speed benchmark may be outdated, but, run and check the speed by yourself to believe it.

22. Conclusion

We have covered all the core concepts in order to work with data.table package. To gain more practice, try the 101 R data.table Exercises . I’d be interested to know your comments as well, so please share your thoughts in the comments section below.      

More Articles

  • Data Manipulation

How to reduce the memory size of Pandas Data frame

101 r data.table exercises, 101 python datatable exercises (pydatatable), similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

r data.table multiple column assignment

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

r data.table multiple column assignment

r data.table multiple column assignment

Secure Your Spot in Our Statistical Methods in R Online Course Starting on September 9 (Click for More Info)

Joachim Schork Image Course

Group data.table by Multiple Columns in R (Example)

This tutorial illustrates how to group a data table based on multiple variables in R programming .

Table of contents:

Let’s dig in:

Example Data & Add-On Packages

We first need to install and load the data.table package , if we want to use the corresponding functions:

The data table below is used as basement for this R tutorial.

table 1 data table group data table multiple columns r

Table 1 shows that our example data consists of twelve rows and four columns. The variables gr1 and gr2 are our grouping columns.

Example: Group Data Table by Multiple Columns Using list() Function

The following syntax illustrates how to group our data table based on multiple columns.

Have a look at the R code below:

table 2 data table group data table multiple columns r

Table 2 illustrates the output of the previous R code – A data table with an additional column showing the group sums of each combination of our two grouping variables.

Video & Further Resources

Some time ago I have published a video on my YouTube channel, which shows the topics of this tutorial. You can find the video below:

The YouTube video will be added soon.

Furthermore, you may want to have a look at some of the related tutorials that I have published on this website:

  • Extract data.table Column as Vector Using Index Position
  • Remove Multiple Columns from data.table in R
  • Convert data.frame to data.table in R
  • Introduction to R Programming

In this article you have learned how to group data tables in R programming. In case you have further questions, let me know in the comments.

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe. I hate spam & you may opt out anytime: Privacy Policy .

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Post Comment

Joachim Schork Statistician Programmer

I’m Joachim Schork. On this website, I provide statistics tutorials as well as code in Python and R programming.

Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe. I hate spam & you may opt out anytime: Privacy Policy .

Statistical Methods Announcement

Related Tutorials

Reorder Levels of Factor without Changing Order of Values in R (Example)

Reorder Levels of Factor without Changing Order of Values in R (Example)

Split Comma-Separated Character Strings in Column into Separate Rows in R (2 Examples)

Split Comma-Separated Character Strings in Column into Separate Rows in R (2 Examples)

data.table Extension of `data.frame`

  • Benchmarking data.table
  • Efficient reshaping using data.tables
  • Frequently Asked Questions about data.table
  • Importing data.table
  • Introduction to data.table
  • Keys and fast binary search based subset
  • Programming on data.table
  • Reference semantics
  • Secondary indices and auto indexing
  • Using .SD for Data Analysis
  • address: Address in RAM of a variable
  • all.equal.data.table: Equality Test Between Two Data Tables
  • as.data.table: Coerce to data.table
  • as.data.table.xts: Efficient xts to as.data.table conversion
  • as.matrix: Convert a data.table to a matrix
  • assign: Assignment by reference
  • as.xts.data.table: Efficient data.table to xts conversion
  • between: Convenience functions for range subsets.
  • cdt: data.table exported C routines
  • chmatch: Faster match of character vectors
  • coalesce: Coalescing missing values
  • copy: Copy an entire object
  • data.table: Enhanced data.frame
  • data.table-class: S4 Definition for data.table
  • datatable-optimize: Optimisations in data.table
  • dcast.data.table: Fast dcast for data.table
  • deprecated: Deprecated.
  • duplicated: Determine Duplicate Rows
  • fcase: fcase
  • fdroplevels: Fast droplevels
  • fifelse: Fast ifelse
  • foverlaps: Fast overlap joins
  • frank: Fast rank
  • fread: Fast and friendly file finagler
  • froll: Rolling functions
  • fsort: Fast parallel sort
  • fwrite: Fast CSV writer
  • groupingsets: Grouping Set aggregation for data tables
  • IDateTime: Integer based date class
  • J: Creates a join 'data.table'
  • last: First/last item of an object
  • last.updated: Number of rows affected by last update
  • like: Convenience function for calling grep.
  • measure: Specify measure.vars via regex or separator
  • melt.data.table: Fast melt for data.table
  • merge: Merge two data.tables
  • nafill: Fill missing values
  • na.omit.data.table: Remove rows with missing values on columns specified
  • notin: Convenience operator for checking if an example is not in a...
  • openmp-utils: Set or get number of threads that data.table should use
  • patterns: Obtain matching indices corresponding to patterns
  • print.data.table: data.table Printing Options
  • rbindlist: Makes one data.table from a list of many
  • rleid: Generate run-length type group id
  • rowid: Generate unique row ids within each group
  • setattr: Set attributes of objects by reference
  • setcolorder: Fast column reordering of a data.table by reference
  • setDF: Coerce a data.table to data.frame by reference
  • setDT: Coerce lists and data.frames to data.table by reference
  • setkey: Create key on a data.table
  • Browse all...

assign : Assignment by reference In data.table: Extension of `data.frame`

:=R Documentation

Assignment by reference

Description.

Fast add, remove and update subsets of columns, by reference. := operator can be used in two ways: LHS := RHS form, and Functional form . See Usage .

set is a low-overhead loop-able version of := . It is particularly useful for repetitively updating rows of certain columns by reference (using a for-loop). See Examples . It can not perform grouping operations.

let is an alias for the functional form and behaves exactly like `:=` .

A character vector of column names (or numeric positions) or a variable that evaluates as such. If the column doesn't exist, it is added, .

A list of replacement values. It is recycled in the usual way to fill the number of rows satisfying , if any. To remove a column use .

A . Or, accepts , too.

Optional. Indicates the rows on which the values must be updated with. If not provided, implies . The form is more powerful as it allows and based add/update columns by reference. See .

In , only integer type is allowed in indicating which rows should be assigned to. represents all rows more efficiently than creating a vector such as .

Column name(s) (character) or number(s) (integer) to be assigned when column(s) already exist, and only column name(s) if they are to be created.

A list of replacement values to assign by reference to .

:= is defined for use in j only. It adds or updates or removes column(s) by reference. It makes no copies of any part of memory at all. Please read vignette("datatable-reference-semantics") and follow with examples. Some typical usages are:

The .Last.updated variable contains the number of rows updated by the most recent := or set calls, which may be useful, for example, in production settings for testing assumptions about the number of rows affected by a statement; see .Last.updated for details.

Note that for efficiency no check is performed for duplicate assignments, i.e. if multiple values are passed for assignment to the same index, assignment to this index will occur repeatedly and sequentially; for a given use case, consider whether it makes sense to create your own test for duplicates, e.g. in production code.

All of the following result in a friendly error (by design) :

For additional resources, please read vignette("datatable-faq") . Also have a look at StackOverflow's data.table tag .

:= in j can be combined with all types of i (such as binary search), and all types of by . This a one reason why := has been implemented in j . Please see vignette("datatable-reference-semantics") and also FAQ 2.16 for analogies to SQL.

When LHS is a factor column and RHS is a character vector with items missing from the factor levels, the new level(s) are automatically added (by reference, efficiently), unlike base methods.

Unlike <- for data.frame , the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given when fractional data is truncated. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax , or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening and it is clearer to readers of your code that you really do intend to change the column type; e.g., DT[, colA:=as.integer(colA)] . A plonk occurs whenever you provide a RHS value to ':=' which is nrow long. When a column is plonked , the original column is not updated by reference because that would entail updating every single element of that column whereas the plonk is just one column pointer update.

data.table s are not copied-on-change by := , setkey or any of the other set* functions. See copy .

DT is modified by reference and returned invisibly. If you require a copy, take a copy first (using DT2 = copy(DT) ).

Advanced (internals):

It is easy to see how sub-assigning to existing columns is done internally. Removing columns by reference is also straightforward by modifying the vector of column pointers only (using memmove in C). However adding (new) columns is more tricky as to how the data.table can be grown by reference : the list vector of column pointers is over-allocated , see truelength . By defining := in j we believe update syntax is natural, and scales, but it also bypasses [<- dispatch and allows := to update by reference with no copies of any part of memory at all.

Since [.data.table incurs overhead to check the existence and type of arguments (for example), set() provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a for loop. See examples. := is more powerful and flexible than set() because := is intended to be combined with i and by in single queries on large datasets.

DT[a > 4, b := c] is different from DT[a > 4][, b := c] . The first expression updates (or adds) column b with the value c on those rows where a > 4 evaluates to TRUE . X is updated by reference , therefore no assignment needed.

The second expression on the other hand updates a new data.table that's returned by the subset operation. Since the subsetted data.table is ephemeral (it is not assigned to a symbol), the result would be lost; unless the result is assigned, for example, as follows: ans <- DT[a > 4][, b := c] .

data.table , copy , setalloccol , truelength , set , .Last.updated

Related to assign in data.table ...

R package documentation, browse r packages, we want your feedback.

r data.table multiple column assignment

Add the following code to your website.

REMOVE THIS Copy to clipboard

For more information on customizing the embed code, read Embedding Snippets .

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Column assignment and reference semantics in {data.table}.

Posted on February 18, 2024 by Toby Hocking in R bloggers | 0 Comments

The goal of this blog post is to explain some similarities and differences between the base R data.frame object type, and the data.table object type. We will focus on accessing and assigning values, and discuss two major differences:

Syntax means the structure of the code that is used: the characters and symbols that execute tasks. The data.table package uses a syntax where most operations can be done within the square brackets: DT[i, j, by] .

Semantics refers to the internal structure of an object or variable. We say that a data.table object has reference semantics , meaning we can modify a data.table from within a function, and see those modifications after the function is done executing. In other words, two different R variables can point to, and modify, the same data.table .

Difference in syntax

To break down the similarities and differences in syntax, consider the data below,

type name columns code
frame literal one df$col_name
table literal one DT[, col_name := value]
frame variable multiple df[, col_names_list]
table variable multiple DT[, (col_names_list) := values]

The table above defines the different syntax required to do column assignment in data tables ( DT ) and frames ( df ).

type indicates object type: frame or table .

name indicates whether the column(s) to assign are literally written in the code ( col_name ), or if the names are stored in a variable ( col_names_list ).

columns indicates whether only one or multiple (one or more) columns can be assigned using the syntax.

code is the exact syntax of the R code used for the assignment.

Note that there are other ways to do column assignment. For example,

DF[["col_name"]] <- value can also be used for single column assignment in a data frame.

set(DT, j=col_name_list, value=values) is a more efficient version of column assignment for data tables, that is recommended for use in loops, as it avoids the overhead of the [.data.table method.

Below is a reshaped version of the table above, to facilitate easier comparison between frame and table versions:

name columns frame table
literal one df$col_name
variable multiple df[, col_names_list]

The table above shows the equivalent code for assignment of columns using either a data.frame or data.table . In fact, the code in the frame column above can also be used for assignment of a data.table , but it may be less efficient than the data table square brackets, as we will discuss in the next section.

One reason why data.table uses a custom assignment syntax is for consistency: the same syntax can be used, with square brackets and := , for one or multiple column assignment. (Note the use parentheses around col_names_list in the second row of the table column above, to indicate that the left side of := is a variable storing column names or numbers, instead of a direct unquoted column name.)

Another reason why data.table uses a custom assignment syntax is for efficiency, as we see in the next section.

Base “copy on write” versus data.table reference semantics

R has “copy on write” semantics, meaning that in base R if a variable is modified inside a function, a copy is made of the whole variable. For example, consider the code below

In the code above, we pass dt_outside to the base_assign function, which makes a copy called dt_inside before it is modified, so that the data in dt_outside is unchanged after the function is done. Compare with the code below,

The output above shows that by using the square brackets and := assignment, we can modify data.table objects in functions without copying them. Here, the variables dt_inside and dt_outside point to the same underlying data.

Efficiency of reference semantics

r data.table multiple column assignment

The code in this section used a data.table object in both function calls to illustrate the constant time/space assignment which is possible, but the visualized result also applies to other data structures.

As an exercise, add two more expressions to the atime benchmark: base_assign with a data.frame object and tibble object. You should see linear time/space for both.

Conclusions

In this post we have explored the syntax and semantics for assignment using base R and data.table square brackets with := , and we have seen how the reference semantics of data.table can be very beneficial for computational efficiency.

Copyright © 2024 | MH Corporate basic by MH Themes

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Frequently Asked Questions about data.table

1.1 why do dt[ , 5] and dt[2, 5] return a 1-column data.table rather than vectors like data.frame , 1.2 why does dt[,"region"] return a 1-column data.table rather than a vector, 1.3 why does dt[, region] return a vector for the “region” column i’d like a 1-column data.table., 1.4 why does dt[ , x, y, z] not work i wanted the 3 columns x , y and z ., 1.5 i assigned a variable mycol="x" but then dt[, mycol] returns an error. how do i get it to look up the column name contained in the mycol variable, 1.6 what are the benefits of being able to use column names as if they are variables inside dt[...] , 1.7 ok, i’m starting to see what data.table is about, but why didn’t you just enhance data.frame in r why does it have to be a new package, 1.8 why are the defaults the way they are why does it work the way it does, 1.9 isn’t this already done by with() and subset() in base , 1.10 why does x[y] return all the columns from y too shouldn’t it return a subset of x , 1.11 what is the difference between x[y] and merge(x, y) , 1.12 anything else about x[y, sum(foo*bar)] , 1.13 that’s nice. how did you manage to change it given that users depended on the old behaviour, 2.1 how can i avoid writing a really long j expression you’ve said that i should use the column names , but i’ve got a lot of columns., 2.2 why is the default for mult now "all" , 2.3 i’m using c() in j and getting strange results., 2.4 i have built up a complex table with many columns. i want to use it as a template for a new table; i.e. , create a new table with no rows, but with the column names and types copied from my table. can i do that easily, 2.5 is a null data.table the same as dt[0] , 2.6 why has the dt() alias been removed, 2.7 but my code uses j = dt(...) and it works. the previous faq says that dt() has been removed., 2.8 what are the scoping rules for j expressions, 2.9 can i trace the j expression as it runs through the groups, 2.10 inside each group, why are the group variables length-1, 2.11 only the first 10 rows are printed, how do i print more, 2.12 with an x[y] join, what if x contains a column called "y" , 2.13 x[z[y]] is failing because x contains a column "y" . i’d like it to use the table y in calling scope., 2.14 can you explain further why data.table is inspired by a[b] syntax in base , 2.15 can base be changed to do this then, rather than a new package, 2.16 i’ve heard that data.table syntax is analogous to sql., 2.17 what are the smaller syntax differences between data.frame and data.table, 2.18 i’m using j for its side effect only, but i’m still getting data returned. how do i stop that, 2.19 why does [.data.table now have a drop argument from v1.5, 2.20 rolling joins are cool and very fast was that hard to program, 2.21 why does dt[i, col := value] return the whole of dt i expected either no visible value (consistent with <- ), or a message or return value containing how many rows were updated. it isn’t obvious that the data has indeed been updated by reference., 2.22 ok, thanks. what was so difficult about the result of dt[i, col := value] being returned invisibly, 2.23 why do i have to type dt sometimes twice after using := to print the result to console, 2.24 i’ve noticed that base::cbind.data.frame (and base::rbind.data.frame ) appear to be changed by data.table. how is this possible why, 2.25 i’ve read about method dispatch ( e.g. merge may or may not dispatch to merge.data.table ) but how does r know how to dispatch are dots significant or special how on earth does r know which function to dispatch and when, 3.1 i have 20 columns and a large number of rows. why is an expression of one column so quick, 3.2 i don’t have a key on a large table, but grouping is still really quick. why is that, 3.3 why is grouping by columns in the key faster than an ad hoc by , 3.4 what are primary and secondary indexes in data.table, 4.1 “could not find function dt ”, 4.2 “unused argument(s) ( mysum = sum(v) )”, 4.3 “ translatecharutf8 must be called on a charsxp ”, 4.4 cbind(dt, df) returns a strange format, e.g. integer,5, 4.5 “cannot change value of locked binding for .sd ”, 4.6 “cannot change value of locked binding for .n ”, 5.1 “the following object(s) are masked from package:base : cbind , rbind ”, 5.2 “coerced numeric rhs to integer to match the column’s type”, 5.3 reading data.table from rds or rdata file, 6.1 v1.3 appears to be missing from the cran archive, 6.2 is data.table compatible with s-plus, 6.3 is it available for linux, mac and windows, 6.4 i think it’s great. what can i do, 6.5 i think it’s not great. how do i warn others about my experience, 6.6 i have a question. i know the r-help posting guide tells me to contact the maintainer (not r-help), but is there a larger group of people i can ask, 6.7 where are the datatable-help archives, 6.8 i’d prefer not to post on the issues page, can i mail just one or two people privately, 6.9 i have created a package that uses data.table. how do i ensure my package is data.table-aware so that inheritance from data.frame works.

The first section, Beginner FAQs, is intended to be read in order, from start to finish. It’s just written in a FAQ style to be digested more easily. It isn’t really the most frequently asked questions. A better measure for that is looking on Stack Overflow.

This FAQ is required reading and considered core documentation. Please do not ask questions on Stack Overflow or raise issues on GitHub until you have read it. We can all tell when you ask that you haven’t read it. So if you do ask and haven’t read it, don’t use your real name.

This document has been quickly revised given the changes in v1.9.8 released Nov 2016. Please do submit pull requests to fix mistakes or improvements. If anyone knows why the table of contents comes out so narrow and squashed when displayed by CRAN, please let us know. This document used to be a PDF and we changed it recently to HTML.

1. Beginner FAQs

For consistency so that when you use data.table in functions that accept varying inputs, you can rely on DT[...] returning a data.table. You don’t have to remember to include drop=FALSE like you do in data.frame. data.table was first released in 2006 and this difference to data.frame has been a feature since the very beginning.

You may have heard that it is generally bad practice to refer to columns by number rather than name, though. If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you forget to change all the places in your code which refer to column number 5. That is your fault not R’s or data.table’s. It’s really really bad. Please don’t do it. It’s the same mantra as professional SQL developers have: never use select * , always explicitly select by column name to at least try to be robust to future changes.

Say column 5 is named "region" and you really must extract that column as a vector not a data.table. It is more robust to use the column name and write DT$region or DT[["region"]] ; i.e., the same as base R. Using base R’s $ and [[ on data.table is encouraged. Not when combined with <- to assign (use := instead for that) but just to select a single column by name they are encouraged.

There are some circumstances where referring to a column by number seems like the only way, such as a sequence of columns. In these situations just like data.frame, you can write DT[, 5:10] and DT[,c(1,4,10)] . However, again, it is more robust (to future changes in your data’s number of and ordering of columns) to use a named range such as DT[,columnRed:columnViolet] or name each one DT[,c("columnRed","columnOrange","columnYellow")] . It is harder work up front, but you will probably thank yourself and your colleagues might thank you in the future. At least you can say you tried your best to write robust code if something does go wrong.

However, what we really want you to do is DT[,.(columnRed,columnOrange,columnYellow)] ; i.e., use column names as if they are variables directly inside DT[...] . You don’t have to prefix each column with DT$ like you do in data.frame. The .() part is just an alias for list() and you can use list() instead if you prefer. You can place any R expression of column names, using any R package, returning different types of different lengths, right there. We wanted to encourage you to do that so strongly in the past that we deliberately didn’t make DT[,5] work at all. Before v1.9.8 released Nov 2016, DT[,5] used to just return 5 . The thinking was that we could more simply teach one fact that the parts inside DT[...] get evaluated within the frame of DT always (they see column names as if they are variables). And 5 evaluates to 5 so that behaviour was consistent with the single rule. We asked you to go through an extra deliberate hurdle DT[,5,with=FALSE] if you really wanted to select a column by name or number. Going forward from Nov 2016, you don’t need to use with=FALSE and we’ll see how greater consistency with data.frame in this regard will help or hinder both new and long-time users. The new users who don’t read this FAQ, not even this very first entry, will hopefully not stumble as soon with data.table as they did before if they had expected it to work like data.frame. Hopefully they will not miss out on understanding our intent and recommendation to place expressions of columns inside DT[i, j, by] . If they use data.table like data.frame they won’t gain any benefits. If you know anyone like that, please give them a friendly nudge to read this document like you are.

Reminder: you can place any R expression inside DT[...] using column names as if they are variables; e.g., try DT[, colA*colB/2] . That does return a vector because you used column names as if they are variables. Wrap with .() to return a data.table; i.e. DT[,.(colA*colB/2)] . Name it: DT[,.(myResult = colA*colB/2)] . And we’ll leave it to you to guess how to return two things from this query. It’s also quite common to do a bunch of things inside an anonymous body: DT[, { x<-colA+10; x*x/2 }] or call another package’s function: DT[ , fitdistr(columnA, "normal")] .

See the answer above . Try DT$region instead. Or DT[["region"]] .

Try DT[ , .(region)] instead. .() is an alias for list() and ensures a data.table is returned.

Also continue reading and see the FAQ after next. Skim whole documents before getting stuck in one part.

The j expression is the 2nd argument. Try DT[ , c("x","y","z")] or DT[ , .(x,y,z)] .

The error is that column named "mycol" cannot be found, and this error is correct. data.table ’s scoping is different to data.frame in that you can use column names as if they are variables directly inside DT[...] without prefixing each column name with DT$ ; see FAQ 1.1 above.

To use mycol to select the column x from DT , there are a few options:

See ?data.table for more details about the .. prefix.

The with argument takes its name from the base function with() . When with=TRUE (default), data.table operates similar to with() , i.e. DT[, mycol] behaves like with(DT, mycol) . When with=FALSE , the standard data.frame evaluation rules apply to all variables in j and you can no longer use column names directly.

j doesn’t have to be just column names. You can write any R expression of column names directly in j , e.g. , DT[ , mean(x*y/z)] . The same applies to i , e.g. , DT[x>1000, sum(y*z)] .

This runs the j expression on the set of rows where the i expression is true. You don’t even need to return data, e.g. , DT[x>1000, plot(y, z)] . You can do j by group simply by adding by = ; e.g., DT[x>1000, sum(y*z), by = w] . This runs j for each group in column w but just over the rows where x>1000 . By placing the 3 parts of the query (i=where, j=select and by=group by) inside the square brackets, data.table sees this query as a whole before any part of it is evaluated. Thus it can optimize the combined query for performance. It can do this because the R language uniquely has lazy evaluation (Python and Julia do not). data.table sees the expressions inside DT[...] before they are evaluated and optimizes them before evaluation. For example, if data.table see that you’re only using 2 columns out of 100, it won’t bother to subset the 98 that aren’t needed by your j expression.

As highlighted above , j in [.data.table is fundamentally different from j in [.data.frame . Even if something as simple as DF[ , 1] was changed in base R to return a data.frame rather than a vector, that would break existing code in many 1000’s of CRAN packages and user code. As soon as we took the step to create a new class that inherited from data.frame, we had the opportunity to change a few things and we did. We want data.table to be slightly different and to work this way for more complicated syntax to work. There are other differences, too (see below ).

Furthermore, data.table inherits from data.frame . It is a data.frame , too. A data.table can be passed to any package that only accepts data.frame and that package can use [.data.frame syntax on the data.table. See this answer for how that is achieved.

We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0:

unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matt Dowle for suggesting improvements to the way the hash code is generated in unique.c.

A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here .

A third more significant proposal that was accepted is that R now uses data.table’s radix sort code as from R 3.3.0:

The radix sort algorithm and implementation from data.table (forder) replaces the previous radix (counting) sort and adds a new method for order(). Contributed by Matt Dowle and Arun Srinivasan, the new algorithm supports logical, integer (even with large values), real, and character vectors. It outperforms all other methods, but there are some caveats (see ?sort).

This was big event for us and we celebrated until the cows came home. (Not really.)

The simple answer is because the main author originally designed it for his own use. He wanted it that way. He finds it a more natural, faster way to write code, which also executes more quickly.

Some of the features discussed so far are, yes. The package builds upon base functionality. It does the same sorts of things but with less code required and executes many times faster if used correctly.

This was changed in v1.5.3 (Feb 2011). Since then X[Y] includes Y ’s non-join columns. We refer to this feature as join inherited scope because not only are X columns available to the j expression, so are Y columns. The downside is that X[Y] is less efficient since every item of Y ’s non-join columns are duplicated to match the (likely large) number of rows in X that match. We therefore strongly encourage X[Y, j] instead of X[Y] . See next FAQ .

X[Y] is a join, looking up X ’s rows using Y (or Y ’s key if it has one) as an index.

Y[X] is a join, looking up Y ’s rows using X (or X ’s key if it has one) as an index.

merge(X,Y) 1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually differ, whereas the number of rows returned by merge(X, Y) and merge(Y, X) is the same.

BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards? You may suggest merge(X[ , ColsNeeded1], Y[ , ColsNeeded2]) , but that requires the programmer to work out which columns are needed. X[Y, j] in data.table does all that in one step for you. When you write X[Y, sum(foo*bar)] , data.table automatically inspects the j expression to see which columns it uses. It will subset those columns only; the others are ignored. Memory is only created for the columns j uses and Y columns enjoy standard R recycling rules within the context of each group. Let’s say foo is in X and bar is in Y (along with 20 other columns in Y ). Isn’t X[Y, sum(foo*bar)] quicker to program and quicker to run than a merge of everything wastefully followed by a subset ?

This behaviour changed in v1.9.4 (Sep 2014). It now does the X[Y] join and then runs sum(foo*bar) over all the rows; i.e., X[Y][ , sum(foo*bar)] . It used to run j for each group of X that each row of Y matches to. That can still be done as it’s very useful but you now need to be explicit and specify by = .EACHI , i.e. , X[Y, sum(foo*bar), by = .EACHI] . We call this grouping by each i .

For example, (further complicating it by using join inherited scope , too):

The request to change came from users. The feeling was that if a query is doing grouping then an explicit by= should be present for code readability reasons. An option was provided to return the old behaviour: options(datatable.old.bywithoutby) , by default FALSE . This enabled upgrading to test the other new features / bug fixes in v1.9.4, with later migration of any by-without-by queries when ready by adding by=.EACHI to them. We retained 47 pre-change tests and added them back as new tests, tested under options(datatable.old.bywithoutby=TRUE) . We added a startup message about the change and how to revert to the old behaviour. After 1 year the option was deprecated with warning when used. After 2 years the option to revert to old behaviour was removed.

Of the 66 packages on CRAN or Bioconductor that depended on or import data.table at the time of releasing v1.9.4 (it is now over 300), only one was affected by the change. That could be because many packages don’t have comprehensive tests, or just that grouping by each row in i wasn’t being used much by downstream packages. We always test the new version with all dependent packages before release and coordinate any changes with those maintainers. So this release was quite straightforward in that regard.

Another compelling reason to make the change was that previously, there was no efficient way to achieve what X[Y, sum(foo*bar)] does now. You had to write X[Y][ , sum(foo*bar)] . That was suboptimal because X[Y] joined all the columns and passed them all to the second compound query without knowing that only foo and bar are needed. To solve that efficiency problem, extra programming effort was required: X[Y, list(foo, bar)][ , sum(foo*bar)] . The change to by = .EACHI has simplified this by allowing both queries to be expressed inside a single DT[...] query for efficiency.

2. General Syntax

When grouping, the j expression can use column names as variables, as you know, but it can also use a reserved symbol .SD which refers to the S ubset of the D ata.table for each group (excluding the grouping columns). So to sum up all your columns it’s just DT[ , lapply(.SD, sum), by = grp] . It might seem tricky, but it’s fast to write and fast to run. Notice you don’t have to create an anonymous function. The .SD object is efficiently implemented internally and more efficient than passing an argument to a function. But if the .SD symbol appears in j then data.table has to populate .SD fully for each group even if j doesn’t use all of it.

So please don’t do, for example, DT[ , sum(.SD[["sales"]]), by = grp] . That works but is inefficient and inelegant. DT[ , sum(sales), by = grp] is what was intended, and it could be 100s of times faster. If you use all of the data in .SD for each group (such as in DT[ , lapply(.SD, sum), by = grp] ) then that’s very good usage of .SD . If you’re using several but not all of the columns, you can combine .SD with .SDcols ; see ?data.table .

In v1.5.3 the default was changed to "all" . When i (or i ’s key if it has one) has fewer columns than x ’s key, mult was already set to "all" automatically. Changing the default makes this clearer and easier for users as it came up quite often.

In versions up to v1.3, "all" was slower. Internally, "all" was implemented by joining using "first" , then again from scratch using "last" , after which a diff between them was performed to work out the span of the matches in x for each row in i . Most often we join to single rows, though, where "first" , "last" and "all" return the same result. We preferred maximum performance for the majority of situations so the default chosen was "first" . When working with a non-unique key (generally a single column containing a grouping variable), DT["A"] returned the first row of that group so DT["A", mult = "all"] was needed to return all the rows in that group.

In v1.4 the binary search in C was changed to branch at the deepest level to find first and last. That branch will likely occur within the same final pages of RAM so there should no longer be a speed disadvantage in defaulting mult to "all" . We warned that the default might change and made the change in v1.5.3.

A future version of data.table may allow a distinction between a key and a unique key . Internally mult = "all" would perform more like mult = "first" when all x ’s key columns were joined to and x ’s key was a unique key. data.table would need checks on insert and update to make sure a unique key is maintained. An advantage of specifying a unique key would be that data.table would ensure no duplicates could be inserted, in addition to performance.

This is a common source of confusion. In data.frame you are used to, for example:

which returns the two columns. In data.table you know you can use the column names directly and might try:

but this returns one vector. Remember that the j expression is evaluated within the environment of DT and c() returns a vector. If 2 or more columns are required, use list() or .() instead:

c() can be useful in a data.table too, but its behaviour is different from that in [.data.frame .

Yes. If your complex table is called DT , try NEWDT = DT[0] .

No. By “null data.table” we mean the result of data.table(NULL) or as.data.table(NULL) ; i.e. ,

The null data.table| frame is NULL with some attributes attached, which means it’s no longer NULL . In R only pure NULL is NULL as tested by is.null() . When referring to the “null data.table” we use lower case null to help distinguish from upper case NULL . To test for the null data.table, use length(DT) == 0 or ncol(DT) == 0 ( length is slightly faster as it’s a primitive function).

An empty data.table ( DT[0] ) has one or more columns, all of which are empty. Those empty columns still have names and types.

DT was introduced originally as a wrapper for a list of j expressions. Since DT was an alias for data.table, this was a convenient way to take care of silent recycling in cases where each item of the j list evaluated to different lengths. The alias was one reason grouping was slow, though.

As of v1.3, list() or .() should be passed instead to the j argument. These are much faster, especially when there are many groups. Internally, this was a non-trivial change. Vector recycling is now done internally, along with several other speed enhancements for grouping.

Then you are using a version prior to 1.5.3. Prior to 1.5.3 [.data.table detected use of DT() in the j and automatically replaced it with a call to list() . This was to help the transition for existing users.

Think of the subset as an environment where all the column names are variables. When a variable foo is used in the j of a query such as X[Y, sum(foo)] , foo is looked for in the following order:

  • The scope of X ’s subset; i.e. , X ’s column names.
  • The scope of each row of Y ; i.e. , Y ’s column names ( join inherited scope )
  • The scope of the calling frame; e.g. , the line that appears before the data.table query.
  • Exercise for reader: does it then ripple up the calling frames, or go straight to globalenv() ?
  • The global environment

This is lexical scoping as explained in R FAQ 3.3.1 . The environment in which the function was created is not relevant, though, because there is no function . No anonymous function is passed to j . Instead, an anonymous body is passed to j ; for example,

Some programming languages call this a lambda .

Try something like this:

Above , x is a grouping variable and (as from v1.6.1) has length 1 (if inspected or used in j ). It’s for efficiency and convenience. Therefore, there is no difference between the following two statements:

If you need the size of the current group, use .N rather than calling length() on any column.

There are two things happening here. First, if the number of rows in a data.table are large ( > 100 by default), then a summary of the data.table is printed to the console by default. Second, the summary of a large data.table is printed by taking the top and bottom n ( = 5 by default) rows of the data.table and only printing those. Both of these parameters (when to trigger a summary and how much of a table to use as a summary) are configurable by R’s options mechanism, or by calling the print function directly.

For instance, to enforce the summary of a data.table to only happen when a data.table is greater than 50 rows, you could options(datatable.print.nrows = 50) . To disable the summary-by-default completely, you could options(datatable.print.nrows = Inf) . You could also call print directly, as in print(your.data.table, nrows = Inf) .

If you want to show more than just the top (and bottom) 10 rows of a data.table summary (say you like 20), set options(datatable.print.topn = 20) , for example. Again, you could also just call print directly, as in print(your.data.table, topn = 20) .

When i is a single name such as Y it is evaluated in the calling frame. In all other cases such as calls to .() or other expressions, i is evaluated within the scope of X . This facilitates easy self-joins such as X[J(unique(colA)), mult = "first"] .

The Z[Y] part is not a single name so that is evaluated within the frame of X and the problem occurs. Try tmp = Z[Y]; X[tmp] . This is robust to X containing a column "tmp" because tmp is a single name. If you often encounter conflicts of this type, one simple solution may be to name all tables in uppercase and all column names in lowercase, or some similar scheme.

Consider A[B] syntax using an example matrix A :

To obtain cells (1, 2) = 5 and (3, 3) = 11 many users (we believe) may try this first:

However, this returns the union of those rows and columns. To reference the cells, a 2-column matrix is required. ?Extract says:

When indexing arrays by [ a single argument i can be a matrix with as many columns as there are dimensions of x ; the result is then a vector with elements corresponding to the sets of indices in each row of i .

Let’s try again.

A matrix is a 2-dimensional structure with row names and column names. Can we do the same with names?

So yes, we can. Can we do the same with a data.frame ?

But, notice that the result was coerced to character. R coerced A to matrix first so that the syntax could work, but the result isn’t ideal. Let’s try making B a data.frame .

So we can’t subset a data.frame by a data.frame in base R. What if we want row names and column names that aren’t character but integer or float ? What if we want more than 2 dimensions of mixed types? Enter data.table.

Furthermore, matrices, especially sparse matrices, are often stored in a 3-column tuple: (i, j, value) . This can be thought of as a key-value pair where i and j form a 2-column key. If we have more than one value, perhaps of different types, it might look like (i, j, val1, val2, val3, ...) . This looks very much like a data.frame . Hence data.table extends data.frame so that a data.frame X can be subset by a data.frame Y , leading to the X[Y] syntax.

data.frame is used everywhere and so it is very difficult to make any changes to it. data.table inherits from data.frame . It is a data.frame , too. A data.table can be passed to any package that only accepts data.frame . When that package uses [.data.frame syntax on the data.table, it works. It works because [.data.table looks to see where it was called from. If it was called from such a package, [.data.table diverts to [.data.frame .

  • i \(\Leftrightarrow\) where
  • j \(\Leftrightarrow\) select
  • := \(\Leftrightarrow\) update
  • by \(\Leftrightarrow\) group by
  • i \(\Leftrightarrow\) order by (in compound syntax)
  • i \(\Leftrightarrow\) having (in compound syntax)
  • nomatch = NA \(\Leftrightarrow\) outer join
  • nomatch = NULL \(\Leftrightarrow\) inner join
  • mult = "first"|"last" \(\Leftrightarrow\) N/A because SQL is inherently unordered
  • roll = TRUE \(\Leftrightarrow\) N/A because SQL is inherently unordered

The general form is:

A key advantage of column vectors in R is that they are ordered , unlike SQL 2 . We can use ordered functions in data.table queries such as diff() and we can use any R function from any package, not just the functions that are defined in SQL. A disadvantage is that R objects must fit in memory, but with several R packages such as ff , bigmemory , mmap and indexing , this is changing.

  • DT[3] refers to the 3rd row , but DF[3] refers to the 3rd column
  • DT[3, ] == DT[3] , but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)
  • For this reason we say the comma is optional in DT , but not optional in DF
  • DT[[3]] == DF[, 3] == DF[[3]]
  • DT[i, ] , where i is a single integer, returns a single row, just like DF[i, ] , but unlike a matrix single-row subset which returns a vector.
  • DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default
  • DT[ , "colA"][[1]] == DF[ , "colA"] .
  • DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)
  • DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
  • DT[NA] returns 1 row of NA , but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame . The user’s intention was probably DF[NA_integer_] . [.data.table diverts to this probable intention automatically, for convenience.
  • DT[c(TRUE, NA, FALSE)] treats the NA as FALSE , but DF[c(TRUE, NA, FALSE)] returns NA rows for each NA
  • DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
  • data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.
  • check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
  • data.table has always set stringsAsFactors=FALSE by default. In R 4.0.0 (Apr 2020), data.frame ’s default was changed from TRUE to FALSE and there is no longer a difference in this regard; see stringsAsFactors, Kurt Hornik, Feb 2020 .
  • Atomic vectors in list columns are collapsed when printed using ", " in data.frame , but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
  • Unlike data.frames a data.table cannot store rows with no columns, as rows are considered to be the children of columns: nrow(DF[, 0]) returns the number of rows, while nrow(DT[, 0]) always returns 0; but see issue #2422 .

In [.data.frame we very often set drop = FALSE . When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame . In [.data.table we took the opportunity to make it consistent and dropped drop .

When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.

In this case j can be wrapped with invisible() ; e.g., DT[ , invisible(hist(colB)), by = colA] 3

So that data.table can inherit from data.frame without using ... . If we used ... then invalid argument names would not be caught.

The drop argument is never used by [.data.table . It is a placeholder for non-data.table-aware packages when they use the [.data.frame syntax directly on a data.table.

The prevailing row on or before the i row is the final row the binary search tests anyway. So roll = TRUE is essentially just a switch in the binary search C code to return that row.

This has changed in v1.8.3 to meet your expectations. Please upgrade.

The whole of DT is returned (now invisibly) so that compound syntax can work; e.g. , DT[i, done := TRUE][ , sum(done)] . The number of rows updated is returned when verbose is TRUE , either on a per-query basis or globally using options(datatable.verbose = TRUE) .

R internally forces visibility on for [ . The value of FunTab’s eval column (see src/main/names.c ) for [ is 0 meaning “force R_Visible on” (see R-Internals section 1.6 ). Therefore, when we tried invisible() or setting R_Visible to 0 directly ourselves, eval in src/main/eval.c would force it on again.

To solve this problem, the key was to stop trying to stop the print method running after a := . Instead, inside := we now (from v1.8.3) set a global flag which the print method uses to know whether to actually print or not.

This is an unfortunate downside to get #869 to work. If a := is used inside a function with no DT[] before the end of the function, then the next time DT is typed at the prompt, nothing will be printed. A repeated DT will print. To avoid this: include a DT[] after the last := in your function. If that is not possible (e.g., it’s not a function you can change) then print(DT) and DT[] at the prompt are guaranteed to print. As before, adding an extra [] on the end of := query is a recommended idiom to update and then print; e.g.> DT[,foo:=3L][] .

It was a temporary, last resort solution before rbind and cbind S3 method dispatch was fixed in R >= 4.0.0. Essentially, the issue was that data.table inherits from data.frame , and base::cbind and base::rbind (uniquely) do their own S3 dispatch internally as documented by ?cbind . The data.table workaround was adding one for loop to the start of each function directly in base . That modification was made dynamically, i.e. , the base definition of cbind.data.frame was fetched, the for loop added to the beginning, and then assigned back to base . This solution was designed to be robust to different definitions of base::cbind.data.frame in different versions of R, including unknown future changes. It worked well. The competing requirements were:

cbind(DT, DF) needs to work. Defining cbind.data.table didn’t work because base::cbind does its own S3 dispatch and required (before R 4.0.0) that the first cbind method for each object it is passed is identical . This is not true in cbind(DT, DF) because the first method for DT is cbind.data.table but the first method for DF is cbind.data.frame . base::cbind then fell through to its internal bind code which appears to treat DT as a regular list and returns very odd looking and unusable matrix output. See below . We cannot just advise users not to call cbind(DT, DF) because packages such as ggplot2 make such a call ( test 167.2 ).

This naturally led to trying to mask cbind.data.frame instead. Since a data.table is a data.frame , cbind would find the same method for both DT and DF . However, this didn’t work either because base::cbind appears to find methods in base first; i.e. , base::cbind.data.frame isn’t maskable.

Finally, we tried masking cbind itself (v1.6.5 and v1.6.6). This allowed cbind(DT, DF) to work, but introduced compatibility issues with package IRanges , since IRanges also masks cbind . It worked if IRanges was lower on the search() path than data.table, but if IRanges was higher then data.table’s, cbind would never be called and the strange-looking matrix output occurs again (see below ).

Many thanks to the R core team for fixing the issue in Sep 2019. data.table v1.12.6+ no longer applies the workaround in R >= 4.0.0.

This comes up quite a lot but it’s really earth-shatteringly simple. A function such as merge is generic if it consists of a call to UseMethod . When you see people talking about whether or not functions are generic functions they are merely typing the function without () afterwards, looking at the program code inside it and if they see a call to UseMethod then it is generic . What does UseMethod do? It literally slaps the function name together with the class of the first argument, separated by period ( . ) and then calls that function, passing along the same arguments. It’s that simple. For example, merge(X, Y) contains a UseMethod call which means it then dispatches (i.e. calls) paste("merge", class(X), sep = ".") . Functions with dots in their name may or may not be methods. The dot is irrelevant really, other than dot being the separator that UseMethod uses. Knowing this background should now highlight why, for example, it is obvious to R folk that as.data.table.data.frame is the data.frame method for the as.data.table generic function. Further, it may help to elucidate that, yes, you are correct, it is not obvious from its name alone that ls.fit is not the fit method of the ls generic function. You only know that by typing ls (not ls() ) and observing it isn’t a single call to UseMethod .

You might now ask: where is this documented in R? Answer: it’s quite clear, but, you need to first know to look in ?UseMethod and that help file contains:

When a function calling UseMethod('fun') is applied to an object with class attribute c('first', 'second') , the system searches for a function called fun.first and, if it finds it, applies it to the object. If no such function is found a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used, if it exists, or an error results.

Happily, an internet search for “How does R method dispatch work” (at the time of this writing) returns the ?UseMethod help page in the top few links. Admittedly, other links rapidly descend into the intricacies of S3 vs S4, internal generics and so on.

However, features like basic S3 dispatch (pasting the function name together with the class name) is why some R folk love R. It’s so simple. No complicated registration or signature is required. There isn’t much needed to learn. To create the merge method for data.table all that was required, literally, was to merely create a function called merge.data.table .

3. Questions relating to compute time

Several reasons:

  • Only that column is grouped, the other 19 are ignored because data.table inspects the j expression and realises it doesn’t use the other columns.
  • One memory allocation is made for the largest group only, then that memory is re-used for the other groups. There is very little garbage to collect.
  • R is an in-memory column store; i.e., the columns are contiguous in RAM. Page fetches from RAM into L2 cache are minimised.

data.table uses radix sorting. This is significantly faster than other sort algorithms. See our presentations for more information, in particular from useR!2015 Denmark.

This is also one reason why setkey() is quick.

When no key is set, or we group in a different order from that of the key, we call it an ad hoc by .

Because each group is contiguous in RAM, thereby minimising page fetches and memory can be copied in bulk ( memcpy in C) rather than looping in C.

Manual: ?setkey S.O.: What is the purpose of setting a key in data.table?

setkey(DT, col1, col2) orders the rows by column col1 then within each group of col1 it orders by col2 . This is a primary index . The row order is changed by reference in RAM. Subsequent joins and groups on those key columns then take advantage of the sort order for efficiency. (Imagine how difficult looking for a phone number in a printed telephone directory would be if it wasn’t sorted by surname then forename. That’s literally all setkey does. It sorts the rows by the columns you specify.) The index doesn’t use any RAM. It simply changes the row order in RAM and marks the key columns. Analogous to a clustered index in SQL.

However, you can only have one primary key because data can only be physically sorted in RAM in one way at a time. Choose the primary index to be the one you use most often (e.g. [id,date] ). Sometimes there isn’t an obvious choice for the primary key or you need to join and group many different columns in different orders. Enter a secondary index. This does use memory ( 4*nrow bytes regardless of the number of columns in the index) to store the order of the rows by the columns you specify, but doesn’t actually reorder the rows in RAM. Subsequent joins and groups take advantage of the secondary key’s order but need to hop via that index so aren’t as efficient as primary indexes. But still, a lot faster than a full vector scan. There is no limit to the number of secondary indexes since each one is just a different ordering vector. Typically you don’t need to create secondary indexes. They are created automatically and used for you automatically by using data.table normally; e.g. DT[someCol == someVal, ] and DT[someCol %in% someVals, ] will create, attach and then use the secondary index. This is faster in data.table than a vector scan so automatic indexing is on by default since there is no up-front penalty. There is an option to turn off automatic indexing; e.g. , if somehow many indexes are being created and even the relatively small amount of extra memory becomes too large.

We use the words index and key interchangeably.

4. Error messages

See above here and here .

This error is generated by DT[ , MySum = sum(v)] . DT[ , .(MySum = sum(v))] was intended, or DT[ , j = .(MySum = sum(v))] .

This error (and similar, e.g. , “ getCharCE must be called on a CHARSXP ”) may be nothing do with character data or locale. Instead, this can be a symptom of an earlier memory corruption. To date these have been reproducible and fixed (quickly). Please report it to our issues tracker .

This occurs prior to v1.6.5, for rbind(DT, DF) too. Please upgrade to v1.6.7 or later.

.SD is locked by design. See ?data.table . If you’d like to manipulate .SD before using it, or returning it, and don’t wish to modify DT using := , then take a copy first (see ?copy ), e.g. ,

Please upgrade to v1.8.1 or later. From this version, if .N is returned by j it is renamed to N to avoid any ambiguity in any subsequent grouping between the .N special variable and a column called ".N" .

The old behaviour can be reproduced by forcing .N to be called .N , like this:

If you are already running v1.8.1 or later then the error message is now more helpful than the “cannot change value of locked binding” error, as you can see above, since this vignette was produced using v1.8.1 or later.

The more natural syntax now works:

5. Warning messages

This warning was present in v1.6.5 and v.1.6.6 only, when loading the package. The motivation was to allow cbind(DT, DF) to work, but as it transpired, this broke (full) compatibility with package IRanges . Please upgrade to v1.6.7 or later.

Hopefully, this is self explanatory. The full message is:

Coerced numeric RHS to integer to match the column’s type; may have truncated precision. Either change the column to numeric first by creating a new numeric vector length 5 (nrows of entire table) yourself and assigning that (i.e. ‘replace’ column), or coerce RHS to integer yourself (e.g. 1L or as.integer) to make your intent clear (and for speed). Or, set the column type correctly up front when you create the table and stick to it, please.

To generate it, try:

*.RDS and *.RData are file types which can store in-memory R objects on disk efficiently. However, storing data.table into the binary file loses its column over-allocation. This isn’t a big deal – your data.table will be copied in memory on the next by reference operation and throw a warning. Therefore it is recommended to call setalloccol() on each data.table loaded with readRDS() or load() calls.

6. General questions about the package

That is correct. v1.3 was available on R-Forge only. There were several large changes internally and these took some time to test in development.

Not currently.

  • A few core parts of the package are written in C and use internal R functions and R structures.
  • The package uses lexical scoping which is one of the differences between R and S-plus explained by R FAQ 3.3.1

Yes, for both 32-bit and 64-bit on all platforms. Thanks to CRAN. There are no special or OS-specific libraries used.

Please file suggestions, bug reports and enhancement requests on our issues tracker . This helps make the package better.

Please do star the package on GitHub . This helps encourage the developers and helps other R users find the package.

You can submit pull requests to change the code and/or documentation yourself; see our Contribution Guidelines .

We add all articles we know about (whether positive or negative) to the Articles page. All pages in the project’s wiki on GitHub are open-access with no modify restrictions. Feel free to write an article, link to a negative one someone else wrote that you found, or add a new page to our wiki to collect your criticisms. Please make it constructive so we have a chance to improve.

Please see the support guide on the project’s homepage which contains up-to-date links.

The homepage contains links to the archives in several formats.

Sure. You’re more likely to get a faster answer from the Issues page or Stack Overflow, though. Further, asking publicly in those places helps build the general knowledge base.

Please see this answer .

Here we mean either the merge method for data.table or the merge method for data.frame since both methods work in the same way in this respect. See ?merge.data.table and below for more information about method dispatch. ↩

It may be a surprise to learn that select top 10 * from ... does not reliably return the same rows over time in SQL. You do need to include an order by clause, or use a clustered index to guarantee row order; i.e. , SQL is inherently unordered. ↩

e.g. , hist() returns the breakpoints in addition to plotting to the graphics device. ↩

  • Data Visualization
  • Statistics in R
  • Machine Learning in R
  • Data Science in R
  • Packages in R

How to Aggregate multiple columns in Data.table in R ?

In this article, we will discuss how to aggregate multiple columns in Data.table in R Programming Language.

A data.table contains elements that may be either duplicate or unique. As a result of this, the variables are divided into categories depending on the sets in which they can be segregated. The column values can be summed over such that the columns contains summation of frequency counts of variables.

Method 1 : Using list() method

The standard data table indexing methods can be used to segregate and aggregate data contained in a data frame. The “by” attribute is used to divide the data based on the specific column names, provided inside the list() method. A new variable can be added containing the sum of values obtained using the sum() method containing the columns to be summed over.

Syntax: 

df[ , new-col-name:=sum(reqd-col-name), by = list(grouping columns)] 
       

Method 2 : Using lapply()

The data.table library can be installed and loaded into the working space. The lapply() method can then be applied over this data.table object, to aggregate multiple columns using a group. The lapply() method is used to return an object of the same length as that of the input list. Each element returned is the result of the application of function, FUN. 

Syntax: lapply(obj, FUN, …) Parameters :  obj – a vector (atomic or list) or an expression object. FUN – the function to be applied over elements. 

The .SD attribute is used to calculate summary statistics for a larger list of variables. The sum function is applied as the function to compute the sum of the elements categorically falling within each group variable. The ‘by’ attribute is equivalent to the ‘group by’ in SQL while performing aggregation. 

lapply (.SD , sum )
     

Method 3: Using aggregate method

The aggregate() function in R is used to produce summary statistics for one or more variables in a data frame or a data.table respectively.

Syntax: aggregate(sum_var ~ group_var, data = df, FUN = sum) Parameters :  sum_var – The columns to compute sums for group_var – The columns to group data by data – The data frame to take FUN – The function to apply

In case, the grouped variable are a combination of columns, the cbind() method is used to combine columns to be retrieved. The FUN to be applied is equivalent to sum, where each column’s summation over particular categorical group is returned. 

Please Login to comment...

Similar reads.

  • R DataTable
  • How to Delete Discord Servers: Step by Step Guide
  • Google increases YouTube Premium price in India: Check our the latest plans
  • California Lawmakers Pass Bill to Limit AI Replicas
  • Best 10 IPTV Service Providers in Germany
  • 15 Most Important Aptitude Topics For Placements [2024]

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

StatAnalytica

Mastering R Append List: Techniques and Examples

R Append List

Lists in R are helpful because they can store different types of data together, like numbers, text, and even other lists. This makes them great for working with complex data in tasks like data analysis. Sometimes, you may need to add new items to a list, such as extra data or new results.

In this article, we’ll focus on how to use R Append List techniques to do just that. We’ll explain simple ways to append items to a list in R, show when to use each method, and provide easy examples to help you understand and apply these techniques.

What is a List in R?

Table of Contents

In R, a list is a tool for storing different types of data together. Unlike vectors or matrices, which can only hold one kind of data, a list can include numbers, text, vectors, and even other lists.

Here’s what you should know about lists:

  • Variety : Lists can hold many types of data at once. For example, you can have a list with a number, a word, and another list all in one.
  • Names for Items : You can name the items in your list, which makes them easier to find and use.
  • Accessing Items : To get items from a list, you use double square brackets [[]]. You can also use the names you gave to find specific items.

 Creating a list

my_list <- list(name = “Alice”, age = 25, scores = c(90, 85, 88))

 Accessing elements

my_list$name      Gives “Alice”

my_list$scores    Gives c(90, 85, 88)

Lists are useful when you need to store different types of information together or work with complex data.

R Append List: 10 Easy Methods to  Append List Items in R: Simple Methods and Examples

Here are the ten easy methods with can append a list in R

1. Using the c() Function

Explanation : The c() function lets you combine lists, so it’s a simple way to add new items to the end of your existing list.

 Starting list

my_list <- list(name = “John”, age = 30)

 Adding new items

my_list <- c(my_list, list(favorite_fruits = c(“Apple”, “Banana”, “Cherry”)))

 Result

[1] “John”

$favorite_fruits

[1] “Apple” “Banana” “Cherry”

2. Using the append() Function

Explanation : The append() function adds new items to the end of your list. You can also choose where to add items if needed.

my_list <- list(name = “Sarah”, age = 25)

 Adding new items at the end

my_list <- append(my_list, list(hobbies = c(“Reading”, “Hiking”, “Cooking”)))

[1] “Sarah”

[1] “Reading” “Hiking” “Cooking”

3. Using Indexing

Explanation : You can add new items by putting them in a new position in your list. Using length(my_list) + 1 make sure your new item goes to the end.

my_list <- list(name = “Mike”, age = 40)

my_list[[length(my_list) + 1]] <- list(pets = c(“Dog”, “Cat”))

[1] “Mike”

[1] “Dog” “Cat”

4. Using the list() Function

Explanation : Create a new list with the items you want to add, and then combine it with your original list using c().

my_list <- list(name = “Emily”, age = 35)

 New items to add

new_items <- list(favorite_colors = c(“Red”, “Blue”, “Green”))

 Combining the lists

my_list <- c(my_list, new_items)

[1] “Emily”

$favorite_colors

[1] “Red”   “Blue”  “Green”

5. Using modifyList()

Explanation : The modifyList() function updates your list with new items. It’s a good way to merge or add elements.

my_list <- list(name = “Laura”, age = 28)

additional_items <- list(favorite_books = c(“1984”, “To Kill a Mockingbird”))

 Updating the list

my_list <- modifyList(my_list, additional_items)

[1] “Laura”

$favorite_books

[1] “1984”                “To Kill a Mockingbird”

6. Using the <<- Operator

Explanation : The <<- operator changes a variable in a parent environment, which is useful for updating lists from different places in your code.

 Starting list in the global environment

my_list <<- list(name = “David”, age = 50)

my_list <<- c(my_list, list(countries_visited = c(“France”, “Japan”, “Brazil”)))

[1] “David”

$countries_visited

[1] “France” “Japan”  “Brazil”

7. Using rbindlist() from the data.table Package

Explanation : If your list contains data frames, brindles () can combine them or add new rows. This is great for managing tables of data.

library(data.table)

 Starting list with a data frame

my_list <- list(data.frame(name = “Anna”, age = 22))

 Adding a new data frame

new_df <- data.frame(name = “Mark”, age = 29)

my_list <- rbindlist(c(my_list, list(new_df)))

  name age

1  Anna  22

2  Mark  29

8. Using c() with Named Elements

Explanation : When you add new items, you can use c() with names to keep your list organized and clear.

my_list <- list(name = “Tina”, age = 33)

 Adding named items

my_list <- c(my_list, list(favorite_songs = c(“Song A”, “Song B”), city = “Chicago”))

[1] “Tina”

$favorite_songs

[1] “Song A” “Song B”

[1] “Chicago”

9. Using lapply() for Multiple Appends

Explanation : Use lapply() to add multiple items to your list easily. It applies a function to each element and helps in updating the list with several new elements.

my_list <- list(name = “Rachel”, age = 27)

 Items to add

items_to_add <- list(hobbies = c(“Yoga”, “Photography”), favorite_foods = c(“Pizza”, “Pasta”))

 Adding items

my_list <- lapply(names(items_to_add), function(x) {

  my_list[[x]] <- items_to_add[[x]]

  return(my_list)

[1] “Rachel”

[1] “Yoga”       “Photography”

$favorite_foods

[1] “Pizza” “Pasta”

10. Using data.frame() for Structured Lists

Explanation : If your list contains data frames, you can use data.frame() to add new rows or columns. It’s useful for managing structured data.

 Starting data frame list

my_list <- list(data.frame(name = “Chris”, age = 45))

 Adding a new row

new_row <- data.frame(name = “Ella”, age = 34)

my_list <- list(rbind(my_list[[1]], new_row))

1 Chris  45

2  Ella  34

These explanations and examples help you understand how to add items to lists in R simply and straightforwardly.

How to Prevent Common Errors When Appending to Lists in R

1. mixing up c() and append().

  • Mistake : Use c() when you need to add items at a specific place in your list.
  • How to Avoid : Use c() to add items to the end of the list. If you need to put items in a certain spot, use append().

2. Flattening the List

  • Mistake : Accidentally changing your list’s shape when using c() or append().
  • How to Avoid : Check the list after adding items. If you want to keep the list’s shape, use functions that handle lists correctly.

3. Replacing Existing Items

  • Mistake : Changing items you already have in the list when you just want to add new ones.
  • How to Avoid : Be careful where you add new items. Make sure you add them without changing the existing items.

4. Misusing modifyList()

  • Mistake : Using modifyList() when your lists don’t have overlapping names, which doesn’t work as expected.
  • How to Avoid : Only use modifyList() if you’re updating items with the same names in both lists. For other situations, use c() or append().

5. Forgetting to Install or Load Packages

  • Mistake : Trying to use functions from packages like data.table without installing or loading them first.
  • How to Avoid : Install the package with install.packages(“data.table”) and load it with library(data.table) before using its functions.

6. Ignoring Data Types

  • Mistake : Adding different types of data to your list without checking if they mix well.
  • How to Avoid : Make sure the data types you’re adding work well together or convert them if needed.

7. Assuming Functions Are the Same

  • Mistake : Thinking that all functions like c(), append(), and modifyList() work the same way.
  • How to Avoid : Know what each function does. Use c() for quick additions, append() for specific spots, and modifyList() for updating lists with similar names.

Final Words

Knowing how to add items to lists in R is really useful for working with data. You can use c() to add things to the end of your list quickly, append() to put items in a specific spot, and modifyList() to update lists with matching names.

Be careful to avoid common mistakes, like using the wrong function or mixing up data types. By following the tips we discussed, you’ll manage your lists better and avoid errors.

Keep practicing with these methods, and you’ll learn to append to lists in R. Happy coding!

How Long Does It Take to Learn R Programming?

8 No-Brainer Programming Languages For Robotics For 2023

What should I do if my list has different types of data?

R lists can handle different types of data, but make sure the new data you add fits well with the existing ones. Managing data types carefully helps avoid problems.

What if I forget to install or load a package like data.table?

If you try to use a function from a package that isn’t installed or loaded, R will give you an error. To fix this, install the package with install.packages(“data.table”) and load it with library(data.table).

Can I use indexing to add items to a specific place in a list?

Yes, you can use indexing (like list[[index]] <- new_item) to put items in a specific spot. Just remember, this will replace the item at that spot. If you want to add without replacing, use append()

Related Posts

r-vs-matlab

R vs Matlab: Which one is more powerful and Why?

MATLAB-vs-Octave

MATLAB vs Octave: All You Need To Know

Leave a comment cancel reply.

Your email address will not be published. Required fields are marked *

  • Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
  • Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
  • OverflowAI GenAI features for Teams
  • OverflowAPI Train & fine-tune LLMs
  • Labs The future of collective knowledge sharing
  • About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

R Data.Table Multiple Assignment IfElse

I have 'data1' and wish for 'data2' by using fifelse from data.table. The rule is, if Group = 'C', then columns 'A' and 'B' equal to zero and column 'C' equals to -99.

bvowe's user avatar

3 Answers 3

Using conditional replacement in data.table ,

Note that the final dataframe is data1 ( := updates by reference), not data2

linog's user avatar

Here, is one option with data.table

Or another option is set

In base R , we can do

akrun's user avatar

Here is another approach using dplyr and rbind

sachin2014's user avatar

Your Answer

Reminder: Answers generated by artificial intelligence tools are not allowed on Stack Overflow. Learn more

Sign up or log in

Post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged r data.table or ask your own question .

  • The Overflow Blog
  • At scale, anything that could fail definitely will
  • Featured on Meta
  • Announcing a change to the data-dump process
  • Bringing clarity to status tag usage on meta sites
  • What does a new user need in a homepage experience on Stack Overflow?
  • Feedback requested: How do you use tag hover descriptions for curating and do...
  • Staging Ground Reviewer Motivation

Hot Network Questions

  • How should I secure ceiling drywall with no edge backing?
  • Does it make sense for the governments of my world to genetically engineer soldiers?
  • Nearly stalled on takeoff after just 3 hours training on a PPL. Is this normal?
  • Multiple alien species on Earth at the same time: one species destroys Earth but the other preserves a small group of humans
  • what should I do if my student has quarrel with my collaborator
  • In roulette, is the frequency of getting long sequences of reds lower than that of shorter sequences?
  • Why is there so much salt in cheese?
  • Hip pain when cycling (experienced cyclist)
  • Does an airplane fly less or more efficiently after a mid-flight engine failure?
  • If a Palestinian converts to Judaism, can they get Israeli citizenship?
  • How to translate the German word "Mitmenschlich(keit)"
  • diagonal argument in latex
  • In Lord Rosse's 1845 drawing of M51, was the galaxy depicted in white or black?
  • What other marketable uses are there for Starship if Mars colonization falls through?
  • Doesn't counting hole and electron current lead to double-counting of actual current?
  • Why is a USB memory stick getting hotter when connected to USB-3 (compared to USB-2)?
  • Why doesn’t dust interfere with the adhesion of geckos’ feet?
  • Is consciousness a prerequisite for knowledge?
  • Marie-Sklodowska-Curie actions: publish a part of the proposal just after the deadline?
  • Why does each state get two Senators?
  • Help writing block matrix
  • MANIFEST_UNKNOWN error: OCI index found, but Accept header does not support OCI indexes
  • Using rule-based symbology for overlapping layers in QGIS
  • Definition clarification on topology regarding closed set

r data.table multiple column assignment

IMAGES

  1. Summarize Multiple Columns of data.table by Group in R (Example)

    r data.table multiple column assignment

  2. data.table in R

    r data.table multiple column assignment

  3. [R Beginners]: Spread one column values to multiple columns in R

    r data.table multiple column assignment

  4. Add Multiple New Columns to data.table in R

    r data.table multiple column assignment

  5. R : How to group data.table by multiple columns?

    r data.table multiple column assignment

  6. Add Row & Column to data.table in R (4 Examples)

    r data.table multiple column assignment

VIDEO

  1. Drop Multiple Columns from Data Frame Using dplyr Package in R (Example)

  2. R Tutorial

  3. Das große R Data Table Tutorial Teil 1: Einleitung & Installation (Deutsch)

  4. Column Properties in CSS

  5. Merge Two Matrices by Columns in R (2 Examples)

  6. Convert row names to column in R

COMMENTS

  1. Assign multiple columns using := in data.table, by group

    Nov 29, 2016 at 17:49. 1. @PanFrancisco Without by=a it will work, but return a different answer. The mean(a) and sum(a) aggregates are being recycled within each group when by=a. Without by=a it just sticks the mean and sum for the entire column into each cell (i.e. different numbers). - Matt Dowle.

  2. r

    I am trying to figure out an elegant way to use := assignment to replace many columns at once in a data.table by applying a shared function. A typical use of this might be to apply a string function (e.g., gsub) to all character columns in a table.It is not difficult to extend the data.frame way of doing this to a data.table, but I'm looking for a method consistent with the data.table way of ...

  3. How to Group data.table by Multiple Columns in R

    by Zach Bobbitt May 10, 2024. Often you may want to group the rows of a data.table in R based on multiple columns. You can use the following basic syntax to do so: dt[, mean_points:=mean(points), by=c('team', 'position')] This particular example will group the rows of the data.table named dt by the team and position columns and then create a ...

  4. Assignment by reference

    Assignment by reference. Fast add, remove and update subsets of columns, by reference. := operator can be used in two ways: LHS := RHS form, and Functional form. See Usage. set is a low-overhead loop-able version of :=. It is particularly useful for repetitively updating rows of certain columns by reference (using a for-loop).

  5. Advanced tips and tricks with data.table

    Fast looping with set. Using shift for to lead/lag vectors and lists. Create multiple columns with := in one statement. Assign a column with := named with a character object. 2. BY. Calculate a function over a group (using by) excluding each entity in a second category. METHOD 1: in-line. METHOD 2: using {} and .SD.

  6. Introduction to data.table

    Data analysis using data.table. Data manipulation operations such as subset, group, update, join, etc. are all inherently related.Keeping these related operations together allows for:. concise and consistent syntax irrespective of the set of operations you would like to perform to achieve your end goal.. performing analysis fluidly without the cognitive burden of having to map each operation ...

  7. Applying Function to Multiple Columns of data.table in R (4 Examples)

    Example 1: Calculating the Sum Values of Multiple Variables. In this example, we want to use lapply to apply the sum () function to multiple data.table columns. We do this by use of .SD and .SDcols. . SD serves as a placeholder for each of the columns which we put in .SDcols. iris_DT [ , lapply (. SD, sum), .

  8. Add Multiple New Columns to data.table in R (Example)

    Example: Add Multiple Columns to data.table Using `:=`. This example illustrates how to add several vectors as new variables to a data.table object in only one function call. Have a look at the following line of R code: data [ , `:= ` (y1 = y1, y2 = y2)] # Add columns to data.table. data # Print updated data.table.

  9. data.table in R

    setDT(df) The difference between the two approaches is: data.table(df) function will create a copy of df and convert it to a data.table. Whereas, setDT(df) converts it to a data.table inplace. That means, the df itself gets converted to a data.table and you don't have to assign it to a different object.

  10. Add Multiple New Columns to data.table in R

    Method 1: Using :=. A column can be added to an existing data table using := operator. Syntax: ':=' (data type, constructors) Here ':' represents the fixed values and '=' represents the assignment of values. So, they together represent the assignment of fixed values.

  11. Group data.table by Multiple Columns in R (Example)

    Example: Group Data Table by Multiple Columns Using list () Function. The following syntax illustrates how to group our data table based on multiple columns. Have a look at the R code below: data_grouped <- data # Duplicate data table. data_grouped [ , sum:= sum (value), by = list (gr1, gr2)] # Add grouped column.

  12. assign : Assignment by reference

    Advanced (internals): It is easy to see how sub-assigning to existing columns is done internally. Removing columns by reference is also straightforward by modifying the vector of column pointers only (using memmove in C). However adding (new) columns is more tricky as to how the data.table can be grown by reference: the list vector of column ...

  13. Advanced Tips and Tricks with data.table

    Tips and tricks learned along the way 1. DATA STRUCTURES & ASSIGNMENT => Columns of lists => Suppressing intermediate output with {} => Fast looping with set => Using shift for to lead/lag vectors and lists => Create multiple columns with := in one statement => Assign a column with := named with a character object 2. BY => Calculate a function over a group (using by) excluding each entity in a ...

  14. Column assignment and reference semantics in {data.table}

    The table above defines the different syntax required to do column assignment in data tables (DT) and frames (df).type indicates object type: frame or table.. name indicates whether the column(s) to assign are literally written in the code (col_name), or if the names are stored in a variable (col_names_list).. columns indicates whether only one or multiple (one or more) columns can be assigned ...

  15. Frequently Asked Questions about data.table

    data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column. check.names is by default TRUE in data.frame but FALSE in data.table, for convenience. data.table has always set stringsAsFactors=FALSE by default.

  16. Group data.table by Multiple Columns in R

    library(data.table) The column at a specified index can be extracted using the list subsetting, i.e. [, operator. The new column can be added in the second argument assigned to a predefined or a user-defined function defined over a set of columns of data.table. The by argument can be added to group the data using a set of columns from the data ...

  17. How to Add Multiple Columns to Data Frame in R

    Method 2: Add Multiple Columns to data.table Object. library (data.table) df[ , ':='(new_col1 = new_col1, new_col2 = new_col2, new_col3 = new_col3)] The following examples show how to use each method in practice. Example 1: Add Multiple Columns to data.frame Object. Suppose we have the following data frame in R:

  18. How to Aggregate multiple columns in Data.table in R

    Method 2 : Using lapply () The data.table library can be installed and loaded into the working space. The lapply () method can then be applied over this data.table object, to aggregate multiple columns using a group. The lapply () method is used to return an object of the same length as that of the input list.

  19. Mastering R Append List: Techniques and Examples

    2. Flattening the List. Mistake: Accidentally changing your list's shape when using c() or append().; How to Avoid: Check the list after adding items.If you want to keep the list's shape, use functions that handle lists correctly. 3. Replacing Existing Items

  20. Assign value to specific data.table columns and rows

    R data table: Assign a value to column based on reference column. 1. data.table: Assign a column value where the column is specified by another column ... Find and delete files from unix directory of multiple patterns Not a cross, not a word (number crossword) Does an airplane fly less or more efficiently after an mid-flight engine failure ...

  21. R Data.Table Multiple Assignment IfElse

    R Data.Table Multiple Assignment IfElse. Ask Question Asked 4 years, 5 months ago. Modified 4 years, 5 months ago. Viewed 273 times ... Assigning multiple columns in data.table() with conditional function. 3. R: Assigning variable in data.table using logical statement inside "()" in ifelse function ...