I'm trying to parallelise with openxlsx
and its function writeData
the export of many Excel individual files that should be somehow summarised in a central Excel file.
As can be shown in the reprex below I first create the central Excel file before creating the parallel processes and then I create the individual files in their respective parallel clusters.
The problem is that nothing (i.e. no 'summary', here simply the Row
counter) is written in the central Excel file, although I exported its associated workbook object to the parallel processes.
lapply(c("openxlsx", "parallel"), library, character.only = TRUE)
NameOutputFolder <- "Output"
NameOutputSubfolder <- "Individual files"
OutputFolder <- file.path(".", NameOutputFolder)
if(!dir.exists(OutputFolder)) dir.create(OutputFolder)
OutputSubfolder <- file.path(".", NameOutputFolder, NameOutputSubfolder)
if(!dir.exists(OutputSubfolder)) dir.create(OutputSubfolder)
OutputFile_Central <- file.path(OutputFolder, "Excel_Central.xlsx")
Workbook_Central <- createWorkbook()
addWorksheet(wb = Workbook_Central, sheetName = "Summary", zoom = 80, gridLines = FALSE)
no_cores <- detectCores()
print(paste0("Configuring parallelisation (", no_cores, " cores found) and setting up clusters"))
MyCluster <- makePSOCKcluster(no_cores - 1)
clusterEvalQ(MyCluster, {
library(openxlsx)
})
clusterExport(MyCluster, c("OutputSubfolder", "Workbook_Central"))
parLapply(cl = MyCluster, X = 1:10, fun = function(Row){
OutputFile_Individual <- file.path(OutputSubfolder, paste0("Excel_Individual_", Row, ".xlsx"))
Workbook_Individual <- createWorkbook()
writeData(wb = Workbook_Central, sheet = "Summary", x = Row, startCol = 1, startRow = Row)
saveWorkbook(wb = Workbook_Individual, file = OutputFile_Individual, overwrite = TRUE)
})
stopCluster(MyCluster)
saveWorkbook(wb = Workbook_Central, file = OutputFile_Central, overwrite = TRUE)
I guess the solution would be to create temporary central Excel files in each parallel process (indexed by Sys.getpid()
) and then to merge them in the unique central Excel file after the parallel code has run, right?
No other solution?
I'm trying to parallelise with openxlsx
and its function writeData
the export of many Excel individual files that should be somehow summarised in a central Excel file.
As can be shown in the reprex below I first create the central Excel file before creating the parallel processes and then I create the individual files in their respective parallel clusters.
The problem is that nothing (i.e. no 'summary', here simply the Row
counter) is written in the central Excel file, although I exported its associated workbook object to the parallel processes.
lapply(c("openxlsx", "parallel"), library, character.only = TRUE)
NameOutputFolder <- "Output"
NameOutputSubfolder <- "Individual files"
OutputFolder <- file.path(".", NameOutputFolder)
if(!dir.exists(OutputFolder)) dir.create(OutputFolder)
OutputSubfolder <- file.path(".", NameOutputFolder, NameOutputSubfolder)
if(!dir.exists(OutputSubfolder)) dir.create(OutputSubfolder)
OutputFile_Central <- file.path(OutputFolder, "Excel_Central.xlsx")
Workbook_Central <- createWorkbook()
addWorksheet(wb = Workbook_Central, sheetName = "Summary", zoom = 80, gridLines = FALSE)
no_cores <- detectCores()
print(paste0("Configuring parallelisation (", no_cores, " cores found) and setting up clusters"))
MyCluster <- makePSOCKcluster(no_cores - 1)
clusterEvalQ(MyCluster, {
library(openxlsx)
})
clusterExport(MyCluster, c("OutputSubfolder", "Workbook_Central"))
parLapply(cl = MyCluster, X = 1:10, fun = function(Row){
OutputFile_Individual <- file.path(OutputSubfolder, paste0("Excel_Individual_", Row, ".xlsx"))
Workbook_Individual <- createWorkbook()
writeData(wb = Workbook_Central, sheet = "Summary", x = Row, startCol = 1, startRow = Row)
saveWorkbook(wb = Workbook_Individual, file = OutputFile_Individual, overwrite = TRUE)
})
stopCluster(MyCluster)
saveWorkbook(wb = Workbook_Central, file = OutputFile_Central, overwrite = TRUE)
I guess the solution would be to create temporary central Excel files in each parallel process (indexed by Sys.getpid()
) and then to merge them in the unique central Excel file after the parallel code has run, right?
No other solution?
With clusterExport(MyCluster, c("OutputSubfolder", "Workbook_Central"))
you are copying master's OutputSubfolder
& Workbook_Central
objects to node processes, all changes are just local. And not just local to the function environment, those object live in separate R processes.
You might want to return summary data (Row
in this example) from parLapply
ed function and handle Workbook_Central
in your master process when working with a parLapply()
result. This would also align better with common lapply()
usage pattern -- stuff goes in, transformed stuff with the same length comes out, no (or minimal) side effects from applied function.
lapply(c("openxlsx", "parallel"), library, character.only = TRUE)
NameOutputFolder <- "Output"
NameOutputSubfolder <- "Individual files"
OutputFolder <- file.path(".", NameOutputFolder)
if(!dir.exists(OutputFolder)) dir.create(OutputFolder)
OutputSubfolder <- file.path(".", NameOutputFolder, NameOutputSubfolder)
if(!dir.exists(OutputSubfolder)) dir.create(OutputSubfolder)
OutputFile_Central <- file.path(OutputFolder, "Excel_Central.xlsx")
no_cores <- detectCores()
print(paste0("Configuring parallelisation (", no_cores, " cores found) and setting up clusters"))
#> [1] "Configuring parallelisation (8 cores found) and setting up clusters"
MyCluster <- makePSOCKcluster(no_cores - 1)
clusterEvalQ(MyCluster, {
library(openxlsx)
})
clusterExport(MyCluster, c("OutputSubfolder"))
# collect summary data into resulting_list
resulting_list <- parLapply(cl = MyCluster, X = 1:10, fun = function(Row){
OutputFile_Individual <- file.path(OutputSubfolder, paste0("Excel_Individual_", Row, ".xlsx"))
Workbook_Individual <- createWorkbook()
saveWorkbook(wb = Workbook_Individual, file = OutputFile_Individual, overwrite = TRUE)
# return summary (a scalar, vector, list, data.frame, ...):
Row
})
stopCluster(MyCluster)
# structure of resulting_list
str(resulting_list)
#> List of 10
#> $ : int 1
#> $ : int 2
#> $ : int 3
#> $ : int 4
#> $ : int 5
#> $ : int 6
#> $ : int 7
#> $ : int 8
#> $ : int 9
#> $ : int 10
# transform it for writeData(), writing a list would store it at A1:J1
(summary_data <- unlist(resulting_list))
#> [1] 1 2 3 4 5 6 7 8 9 10
# create & save Workbook_Central
Workbook_Central <- createWorkbook()
addWorksheet(wb = Workbook_Central, sheetName = "Summary", zoom = 80, gridLines = FALSE)
writeData(wb = Workbook_Central, sheet = "Summary", x = summary_data, startCol = 1, startRow = 1)
saveWorkbook(wb = Workbook_Central, file = OutputFile_Central, overwrite = TRUE)
Resulting files:
fs::dir_tree(OutputFolder)
#> ./Output
#> ├── Excel_Central.xlsx
#> └── Individual files
#> ├── Excel_Individual_1.xlsx
#> ├── Excel_Individual_10.xlsx
#> ├── Excel_Individual_2.xlsx
#> ├── Excel_Individual_3.xlsx
#> ├── Excel_Individual_4.xlsx
#> ├── Excel_Individual_5.xlsx
#> ├── Excel_Individual_6.xlsx
#> ├── Excel_Individual_7.xlsx
#> ├── Excel_Individual_8.xlsx
#> └── Excel_Individual_9.xlsx