
Convert a document (.docx, .pdf, .odt, .rtf, or .html) to a plain text file
Source:R/doc_to_txt.R
doc_to_txt.RdThis used to be a thin wrapper around textreadr::read_document() that also
writes the result to output, doing its best to correctly write UTF-8
(based on the approach recommended in this blog post). However,
textreadr was archived from CRAN. It now directly wraps the functions
that textreadr wraps: pdftools::pdf_text(), striprtf::read_rtf, and
it uses xml2 to import .docx and .odt files, and rvest to import
.html files, using the code from the textreadr package.
Arguments
- input
The path to the input file.
- output
The path and filename to write to. If this is a path to an existing directory (without a filename specified), the
inputfilename will be used, and the extension will be replaced withextension.- encoding
The encoding to use when writing the text file.
- newExt
The extension to append: only used if
output = NULLandnewExtis notNULL, in which case the output will be written to a file with the same name asinputbut withnewExtas extension.- preventOverwriting
Whether to prevent overwriting existing files.
- silent
Whether to the silent or chatty.
Examples
### This example requires the {xml2} package
if (requireNamespace("xml2", quietly = TRUE)) {
print(
rock::doc_to_txt(
input = system.file(
"extdata/doc-to-test.docx", package="rock"
)
)
);
}
#> [1] "This is a word document."
#> [2] "It doesn’t have much fancy content, but it’s 12kb large nonetheless."
#> [3] "Because some people use Word to transcribe, it can be useful to import Word files."
#> [4] "Note that this does mean you’ll lose markup."