Using R Quarto, how can I ensure PDF documents have text that is accurately copy and pasted? - Stack Overflow

admin2025-04-17  3

If I use the following Quarto document to render a PDF, the copy paste behavior depends on the browser being used to view the resulting document.

I'm using pdflatex here because I regularly generate 200 PDFs with render_quarto, and this runs much quicker than xelatex. In addition, xelatex also gives me trouble when using a watermark with the 'background' latex package.

---
title: "Untitled"
format: 
  pdf:
    pdf-engine: pdflatex
---

## Quarto

```{r}
my_pipeline <- 
  mtcars |> 
  summary()
```

Using the RStudio viewer to copy and paste this code I receive the following after pasting. Where does the 1 come from? How can I preserve the spacing or line breaks?

my_pipeline <-mtcars |>summary()1

Using MS Edge to view the PDF and copy paste the code, I receive the following after pasting. This one does a better job of preserving the line breaks, but it refuses to capture the dash!

 my_pipeline <
mtcars |>
 summary()

If I use the following Quarto document to render a PDF, the copy paste behavior depends on the browser being used to view the resulting document.

I'm using pdflatex here because I regularly generate 200 PDFs with render_quarto, and this runs much quicker than xelatex. In addition, xelatex also gives me trouble when using a watermark with the 'background' latex package.

---
title: "Untitled"
format: 
  pdf:
    pdf-engine: pdflatex
---

## Quarto

```{r}
my_pipeline <- 
  mtcars |> 
  summary()
```

Using the RStudio viewer to copy and paste this code I receive the following after pasting. Where does the 1 come from? How can I preserve the spacing or line breaks?

my_pipeline <-mtcars |>summary()1

Using MS Edge to view the PDF and copy paste the code, I receive the following after pasting. This one does a better job of preserving the line breaks, but it refuses to capture the dash!

 my_pipeline <
mtcars |>
 summary()
Share Improve this question asked Jan 30 at 19:09 kputschkokputschko 8161 gold badge9 silver badges25 bronze badges 5
  • 2 If you are planning on sticking with the PDF format, you might have to navigate dealing with a TeX-based solution. I found this link on a Quarto discussion similar to what you are experiencing. – Quinton.Quagliano Commented Jan 30 at 19:23
  • Unfortunately, my org is tied to PDF documents. Thanks for the links. Sounds like using a dedicated PDF viewer like Sumatra might be the best recommendation. – kputschko Commented Jan 30 at 20:12
  • I don't have any problematic datasets stored in PDF format. We mostly use the PDF files for code documentation, where copying and pasting lines of "echoed" lines of code is helpful. – kputschko Commented Jan 30 at 21:26
  • 'We mostly use the PDF files for code documentation, where copying and pasting lines of "echoed" lines of code is helpful.' I don't know details but that sounds vaguely like a bad approach. I suspect the actual solution would be to improve your process. You shouldn't need to copy code from PDF files. – Roland Commented Jan 31 at 6:08
  • @Roland, unfortunately our org doesn't allow sharing of HTML documents. I'm open to other suggestions! I just figured if the choice is between PDF and DOCX, PDF would be less problematic. – kputschko Commented Jan 31 at 13:57
Add a comment  | 

1 Answer 1

Reset to default 1

I do not suggest SumatraPDF (which I support,) as the best for the task, since it suffers ALL the same problems any PDF reader has. Such as poorly defined fonts or no mechanical data in a PDF so:

  • no form feeds
  • no line feeds
  • no tabs Even Indents are not physical just binary offset values.

Text extraction has been removed as a function because it is too variable. However you can use better programmable extractors on the page such as here. But you will need to write custom command line programming using XPDF, Balabolka or Poppler.

转载请注明原文地址:http://anycun.com/QandA/1744895724a89149.html