r - read.csv and readr::read_delim yields different ID variables. Why? - Stack Overflow

admin2025-04-28  2

I'm trying to import a dataframe where there are a household-ID (not unique) and an individual-ID (unique). The dataframe is in .csv file and is untouched from the time I downloaded it.

The unique individual ID is usually just the household-ID number (a long sequence of numbers) and another number denoting the rank of the individual. For instance, if household ID is 11587492056, individual 1 from this household will have 1158749205601, individual 2 from this household will have 1158749205602, etc. This is written from the documentation.

When I import the .csv file using base read.csv file (I didn't even write code, just importing by clicking at the RStudio appropriate command), here are the first three observations:

print(readcsv2010[1:3, 1:2])

structure(list(IDENTHH = c(258001990644611232, 258001990644611232, 
258001990644611232), IDENTIND = c(25800199064461123584, 25800199064461123584, 
25800199064461123584)), row.names = c(NA, 3L), class = "data.frame")

As you can see, the first three rows of column IDENTIND are all the same when it should be unique. However, when I use the readr command, I get the following dataframe:

print(readr2010[1:3, 1:2])

structure(list(IDENTHH = c("00258001990644611237", "00258001990644611237", 
"00258001990644611237"), IDENTIND = c("0025800199064461123701", 
"0025800199064461123702", "0025800199064461123703")), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

Now, I have the right ID structure (each IDENTIND is the concatenation of IDENTHH + one unique number), but I somehow have leading 0s everywhere.

It seems readr, although imperfect, is the way to go here. But can someone explain me in details why such a discrepancy appear? Does it have to do with my dataframe, or does it have to do with the functions?

I'm trying to import a dataframe where there are a household-ID (not unique) and an individual-ID (unique). The dataframe is in .csv file and is untouched from the time I downloaded it.

The unique individual ID is usually just the household-ID number (a long sequence of numbers) and another number denoting the rank of the individual. For instance, if household ID is 11587492056, individual 1 from this household will have 1158749205601, individual 2 from this household will have 1158749205602, etc. This is written from the documentation.

When I import the .csv file using base read.csv file (I didn't even write code, just importing by clicking at the RStudio appropriate command), here are the first three observations:

print(readcsv2010[1:3, 1:2])

structure(list(IDENTHH = c(258001990644611232, 258001990644611232, 
258001990644611232), IDENTIND = c(25800199064461123584, 25800199064461123584, 
25800199064461123584)), row.names = c(NA, 3L), class = "data.frame")

As you can see, the first three rows of column IDENTIND are all the same when it should be unique. However, when I use the readr command, I get the following dataframe:

print(readr2010[1:3, 1:2])

structure(list(IDENTHH = c("00258001990644611237", "00258001990644611237", 
"00258001990644611237"), IDENTIND = c("0025800199064461123701", 
"0025800199064461123702", "0025800199064461123703")), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

Now, I have the right ID structure (each IDENTIND is the concatenation of IDENTHH + one unique number), but I somehow have leading 0s everywhere.

It seems readr, although imperfect, is the way to go here. But can someone explain me in details why such a discrepancy appear? Does it have to do with my dataframe, or does it have to do with the functions?

Share Improve this question asked Jan 8 at 13:42 ValVal 593 bronze badges 3
  • 7 To help us help you it would be useful if you posted a sample of your csv file (just a few lines of the two columns of interest) and the command that rstudio issues (you can see it in the box in the lower right angle of the dialog window) – Claudio Commented Jan 8 at 14:04
  • 4 I suspect the leading zeros are in the file already. Reading these IDs as character strings seems like the right choice here. read.csv by default reads them as numeric data (and there is limited precision in floating-point numbers). – Roland Commented Jan 8 at 14:11
  • Roland is absolutely correct. For example if you in the console write 0025800199064461123201 you get 25800199064461123584. So I agree, as these are IDs and not numbers, reading as character (or factor), is the way to go. – Godrim Commented Jan 8 at 14:31
Add a comment  | 

1 Answer 1

Reset to default 0

I think @Roland's answer is correct: you need to tell read.csv not to convert the IDs to numeric, but leave them as characters:

readcsv2010 <- read.csv("input.csv", colClasses = "character")

转载请注明原文地址:http://anycun.com/QandA/1745853252a91246.html