Understanding the Changes in dplyr 0.7.5: Select() Behavior

The dplyr package is a powerful tool for data manipulation in R, providing various functions to filter, sort, and transform datasets. However, with each new version of dplyr, changes are made to improve performance and functionality. In this article, we’ll delve into the recent change in select() behavior between dplyr 0.7.4 and 0.7.5, specifically focusing on the usage of named vectors.

Introduction to Select()

The select() function is a crucial part of the dplyr package, allowing users to choose specific columns from a dataset. This function can be used in various contexts, such as data cleaning, data transformation, or data merging.

library(dplyr)
df <- data.frame(a = 1:5, b = 6:10, c = 11:15)
print(df)
#   a  b  c
# 1 1  6 11
# 2 2  7 12
# 3 3  8 13
# 4 4  9 14
# 5 5 10 15

# Using select() function
df_selected <- df %>% 
  select(a, b)
print(df_selected)
#   a  b
# 1 1  6
# 2 2  7
# 3 3  8
# 4 4  9
# 5 5 10

Named Vectors and Select() Behavior

In dplyr 0.7.4, when using a named vector to specify columns for the select() function, it returned column names with vector values (strings). For example:

cols <- c(x = 'a', y = 'b', z = 'c')
df_selected <- df %>% 
  select(cols)
print(df_selected)
#   a  b  c
# 1 1  6 11
# 2 2  7 12
# 3 3  8 13
# 4 4  9 14
# 5 5 10 15

However, in dplyr 0.7.5, the select() function returns column names with vector names instead of strings:

cols <- c(x = 'a', y = 'b', z = 'c')
df_selected <- df %>% 
  select(cols)
print(df_selected)
#   x  y  z
# 1 1  6 11
# 2 2  7 12
# 3 3  8 13
# 4 4  9 14
# 5 5 10 15

Why This Change?

The change in behavior is attributed to the introduction of tidyselect, a new package that provides more sophisticated and flexible selection logic. With tidyselect, the vars_select() function has become more robust, handling cases where named vectors are used.

library(tidyselect)
dplyr:::select.data.frame(mtcars, c(a = "mpg", b = "disp"))

In contrast to the old behavior, tidyselect does not lose column names for named and quoted columns. Instead, it uses the overscope_eval_next() function from the rlang package to handle these cases.

library(rlang)
ind_list <- map_if(ind_list, is_character, match_var, table = vars)
eval_tidy(ind_list[[1]])

The new vars_c() function in tidyselect also plays a crucial role in resolving this issue:

tidyselect:::vars_c(a = "mpg", b = "disp")

Impact and Conclusion

This change in behavior may seem minor at first, but it can have significant consequences for users who rely on specific column names when selecting data. The introduction of tidyselect has improved the robustness and flexibility of the selection logic, ensuring that named vectors are handled correctly.

When working with dplyr 0.7.5 or later, it’s essential to be aware of this change and adjust your code accordingly. By understanding how tidyselect handles column selection, you can write more effective and reliable code for data manipulation tasks.

library(dplyr)
df <- data.frame(a = 1:5, b = 6:10, c = 11:15)

# Using select() with named vector
cols <- c(x = 'a', y = 'b', z = 'c')
df_selected <- df %>% 
  select(cols)

# Printing the result
print(df_selected)
#   x  y  z
# 1 1  6 11
# 2 2  7 12
# 3 3  8 13
# 4 4  9 14
# 5 5 10 15

library(tidyselect)
df_selected <- df %>% 
  select(c(x = "a", y = "b"))

# Printing the result
print(df_selected)
#   a  b
# 1 1  6
# 2 2  7
# 3 3  8
# 4 4  9
# 5 5 10

Last modified on 2024-09-19