I'm trying to import a dataframe where there are a household-ID (not unique) and an individual-ID (unique). The dataframe is in .csv file and is untouched from the time I downloaded it.
The unique individual ID is usually just the household-ID number (a long sequence of numbers) and another number denoting the rank of the individual. For instance, if household ID is 11587492056, individual 1 from this household will have 1158749205601, individual 2 from this household will have 1158749205602, etc. This is written from the documentation.
When I import the .csv file using base read.csv file (I didn't even write code, just importing by clicking at the RStudio appropriate command), here are the first three observations:
print(readcsv2010[1:3, 1:2])
structure(list(IDENTHH = c(258001990644611232, 258001990644611232,
258001990644611232), IDENTIND = c(25800199064461123584, 25800199064461123584,
25800199064461123584)), row.names = c(NA, 3L), class = "data.frame")
As you can see, the first three rows of column IDENTIND are all the same when it should be unique. However, when I use the readr command, I get the following dataframe:
print(readr2010[1:3, 1:2])
structure(list(IDENTHH = c("00258001990644611237", "00258001990644611237",
"00258001990644611237"), IDENTIND = c("0025800199064461123701",
"0025800199064461123702", "0025800199064461123703")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
Now, I have the right ID structure (each IDENTIND is the concatenation of IDENTHH + one unique number), but I somehow have leading 0s everywhere.
It seems readr, although imperfect, is the way to go here. But can someone explain me in details why such a discrepancy appear? Does it have to do with my dataframe, or does it have to do with the functions?
I'm trying to import a dataframe where there are a household-ID (not unique) and an individual-ID (unique). The dataframe is in .csv file and is untouched from the time I downloaded it.
The unique individual ID is usually just the household-ID number (a long sequence of numbers) and another number denoting the rank of the individual. For instance, if household ID is 11587492056, individual 1 from this household will have 1158749205601, individual 2 from this household will have 1158749205602, etc. This is written from the documentation.
When I import the .csv file using base read.csv file (I didn't even write code, just importing by clicking at the RStudio appropriate command), here are the first three observations:
print(readcsv2010[1:3, 1:2])
structure(list(IDENTHH = c(258001990644611232, 258001990644611232,
258001990644611232), IDENTIND = c(25800199064461123584, 25800199064461123584,
25800199064461123584)), row.names = c(NA, 3L), class = "data.frame")
As you can see, the first three rows of column IDENTIND are all the same when it should be unique. However, when I use the readr command, I get the following dataframe:
print(readr2010[1:3, 1:2])
structure(list(IDENTHH = c("00258001990644611237", "00258001990644611237",
"00258001990644611237"), IDENTIND = c("0025800199064461123701",
"0025800199064461123702", "0025800199064461123703")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
Now, I have the right ID structure (each IDENTIND is the concatenation of IDENTHH + one unique number), but I somehow have leading 0s everywhere.
It seems readr, although imperfect, is the way to go here. But can someone explain me in details why such a discrepancy appear? Does it have to do with my dataframe, or does it have to do with the functions?
I think @Roland's answer is correct: you need to tell read.csv not to convert the IDs to numeric, but leave them as characters:
readcsv2010 <- read.csv("input.csv", colClasses = "character")
read.csv
by default reads them as numeric data (and there is limited precision in floating-point numbers). – Roland Commented Jan 8 at 14:11