R/doc_to_txt.R
doc_to_txt.Rd
This used to be a thin wrapper around textreadr::read_document()
that also
writes the result to output
, doing its best to correctly write UTF-8
(based on the approach recommended in this blog post). However,
textreadr
was archived from CRAN. It now directly wraps the functions
that textreadr
wraps: pdftools::pdf_text()
, striprtf::read_rtf
, and
it uses xml2
to import .docx
and .odt
files, and rvest
to import
.html
files, using the code from the textreadr
package.
The path to the input file.
The path and filename to write to. If this is a path to
an existing directory (without a filename specified), the input
filename
will be used, and the extension will be replaced with extension
.
The encoding to use when writing the text file.
The extension to append: only used if output = NULL
and
newExt
is not NULL
, in which case the output will be written to a file
with the same name as input
but with newExt
as extension.
Whether to prevent overwriting existing files.
Whether to the silent or chatty.
The converted source, as a character vector.
### This example requires the {xml2} package
if (requireNamespace("xml2", quietly = TRUE)) {
print(
rock::doc_to_txt(
input = system.file(
"extdata/doc-to-test.docx", package="rock"
)
)
);
}
#> [1] "This is a word document."
#> [2] "It doesn’t have much fancy content, but it’s 12kb large nonetheless."
#> [3] "Because some people use Word to transcribe, it can be useful to import Word files."
#> [4] "Note that this does mean you’ll lose markup."