-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Name repair for duplicated columns inconsistent between read_csv
and spec_csv
#1387
Comments
NB: I dug through some readr code and it seems to me that if (anyDuplicated(col_names)) {
dups <- duplicated(col_names)
old_names <- col_names
col_names <- make.unique(col_names, sep = "_")
warning(
"Duplicated column names deduplicated: ",
paste0(
encodeString(old_names[dups], quote = "'"),
" => ",
encodeString(col_names[dups], quote = "'"),
" [", which(dups), "]",
collapse = ", "
),
call. = FALSE
)
} And It seems to me that the code to make unique names for duplicated column names is not correct and should follow the same pattern as vroom/vroom_ or vice versa. Edit: read_csv uses |
read_csv
and spec_csv
Sometimes I know the duplicated columns exisited, but I just wanna ignore it. However, with the current defualt name_repair, looking like "num...1", I cannot ignore it, I have to deal with it. |
Note to self: Right now Lines 441 to 458 in 85cf1e8
The fact that we force value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")
# note that edition 1 actually guesses column types
readr::with_edition(1, readr::read_csv(value, n_max = 0, guess_max = 1000))
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> # A tibble: 0 × 6
#> # … with 6 variables: a <dbl>, error <chr>, b <dbl>, error_1 <chr>, c <dbl>,
#> # error_2 <chr>
#> # ℹ Use `colnames()` to see all variable names
# while edition 2 returns all characters
readr::read_csv(value, n_max = 0, guess_max = 1000)
#> New names:
#> Rows: 0 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (6): a, error...2, b, error...4, c, error...6
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> # A tibble: 0 × 6
#> # … with 6 variables: a <chr>, error...2 <chr>, b <chr>, error...4 <chr>,
#> # c <chr>, error...6 <chr>
#> # ℹ Use `colnames()` to see all variable names Created on 2022-08-23 by the reprex package (v2.0.1.9000) There are three options for moving forward:
|
It seems like a stopgap readr 2e / vroom version of (BTW various things seem wonky with this printed output) library(readr)
value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")
dat <- read_csv(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
spec(dat)
#> cols(
#> a = col_double(),
#> error...2 = col_character(),
#> b = col_double(),
#> error...4 = col_character(),
#> c = col_double(),
#> error...6 = col_character()
#> ) readr 2e / vroom column names: a, error...2, b, error...4, c, error...6 What happens if we just ask for the spec from the same input? spec_csv(value)
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> cols(
#> a = col_double(),
#> error = col_character(),
#> b = col_double(),
#> error_1 = col_character(),
#> c = col_double(),
#> error_2 = col_character()
#> ) Problem: we get readr 1e-style column names: a, error, b, error_1, c, error_2 It does seem reasonable to expect A simple fix would be to call spec_csv_vroom <- function(..., guess_max = 1000) {
tmp <- readr::read_csv(..., n_max = guess_max, guess_max = guess_max)
spec(tmp)
}
spec_csv_vroom(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> cols(
#> a = col_double(),
#> error...2 = col_character(),
#> b = col_double(),
#> error...4 = col_character(),
#> c = col_double(),
#> error...6 = col_character()
#> ) Is there a reason not to do something along these lines? |
Performance? I guess? i.e. this commit from Jim talks about vroom not supporting guessing without parsing But since |
Same as @DavisVaughan, it would solve the problem especially in the long term for removing readr edition 1 code, but having |
Yeah the only thing I could think of was performance (or elegance). But I think returning the wrong column names is a much bigger sin that the downsides of a simple "readr doesn't have to parse to guess col types but vroom does" seems a bit misleading. readr is definitely consulting (tokenizing and typing) up to
How so? It seems like the point of |
I know that |
But you can't guess column types without reading from the file. |
When trying to create column specifications of a file that contains duplicated variable names, spec_csv() renames the variables differently (e.g "error_1") than read_csv() ("error...1").
Let test.csv be a simple CSV with duplicated variables "error" and one observation:
The text was updated successfully, but these errors were encountered: