
Convert a document (.docx, .pdf, .odt, .rtf, or .html) to a plain text file
Source:R/doc_to_txt.R
doc_to_txt.Rd
This used to be a thin wrapper around textreadr::read_document()
that also
writes the result to output
, doing its best to correctly write UTF-8
(based on the approach recommended in this blog post). However,
textreadr
was archived from CRAN. It now directly wraps the functions
that textreadr
wraps: pdftools::pdf_text()
, striprtf::read_rtf
, and
it uses xml2
to import .docx
and .odt
files, and rvest
to import
.html
files, using the code from the textreadr
package.
Arguments
- input
The path to the input file.
- output
The path and filename to write to. If this is a path to an existing directory (without a filename specified), the
input
filename will be used, and the extension will be replaced withextension
.- encoding
The encoding to use when writing the text file.
- newExt
The extension to append: only used if
output = NULL
andnewExt
is notNULL
, in which case the output will be written to a file with the same name asinput
but withnewExt
as extension.- preventOverwriting
Whether to prevent overwriting existing files.
- silent
Whether to the silent or chatty.
Examples
### This example requires the {xml2} package
if (requireNamespace("xml2", quietly = TRUE)) {
print(
rock::doc_to_txt(
input = system.file(
"extdata/doc-to-test.docx", package="rock"
)
)
);
}
#> [1] "This is a word document."
#> [2] "It doesn’t have much fancy content, but it’s 12kb large nonetheless."
#> [3] "Because some people use Word to transcribe, it can be useful to import Word files."
#> [4] "Note that this does mean you’ll lose markup."