
Cleaning & editing sources
Source:R/clean_source.R
, R/clean_sources.R
, R/search_and_replace_in_source.R
, and 1 more
cleaning_sources.Rd
These functions can be used to 'clean' one or more sources or perform search and replace taks. Cleaning consists of two operations: splitting the source at utterance markers, and conducting search and replaces using regular expressions.
Usage
clean_source(
input,
output = NULL,
replacementsPre = rock::opts$get("replacementsPre"),
replacementsPost = rock::opts$get("replacementsPost"),
extraReplacementsPre = NULL,
extraReplacementsPost = NULL,
removeNewlines = FALSE,
removeTrailingNewlines = TRUE,
rlWarn = rock::opts$get(rlWarn),
utteranceSplits = rock::opts$get("utteranceSplits"),
preventOverwriting = rock::opts$get("preventOverwriting"),
encoding = rock::opts$get("encoding"),
silent = rock::opts$get("silent")
)
clean_sources(
input,
output,
outputPrefix = "",
outputSuffix = "_cleaned",
recursive = TRUE,
filenameRegex = ".*",
replacementsPre = rock::opts$get(replacementsPre),
replacementsPost = rock::opts$get(replacementsPost),
extraReplacementsPre = NULL,
extraReplacementsPost = NULL,
removeNewlines = FALSE,
utteranceSplits = rock::opts$get(utteranceSplits),
preventOverwriting = rock::opts$get(preventOverwriting),
encoding = rock::opts$get(encoding),
silent = rock::opts$get(silent)
)
search_and_replace_in_source(
input,
replacements = NULL,
output = NULL,
preventOverwriting = TRUE,
encoding = "UTF-8",
rlWarn = rock::opts$get(rlWarn),
silent = FALSE
)
search_and_replace_in_sources(
input,
output,
replacements = NULL,
outputPrefix = "",
outputSuffix = "_postReplacing",
preventOverwriting = rock::opts$get("preventOverwriting"),
recursive = TRUE,
filenameRegex = ".*",
encoding = rock::opts$get("encoding"),
silent = rock::opts$get("silent")
)
Arguments
- input
For
clean_source
andsearch_and_replace_in_source
, either a character vector containing the text of the relevant source or a path to a file that contains the source text; forclean_sources
andsearch_and_replace_in_sources
, a path to a directory that contains the sources to clean.- output
For
clean_source
andsearch_and_replace_in_source
, if notNULL
, this is the name (and path) of the file in which to save the processed source (if it isNULL
, the result will be returned visibly). Forclean_sources
andsearch_and_replace_in_sources
,output
is mandatory and is the path to the directory where to store the processed sources. This path will be created with a warning if it does not exist. An exception is if "same
" is specified - in that case, every file will be written to the same directory it was read from.- replacementsPre, replacementsPost
Each is a list of two-element vectors, where the first element in each vector contains a regular expression to search for in the source(s), and the second element contains the replacement (these are passed as
perl
regular expressions; seeregex
for more information). Instead of regular expressions, simple words or phrases can also be entered of course (since those are valid regular expressions).replacementsPre
are executed before theutteranceSplits
are applied;replacementsPost
afterwards.- extraReplacementsPre, extraReplacementsPost
To perform more replacements than the default set, these can be conveniently specified in
extraReplacementsPre
andextraReplacementsPost
. This prevents you from having to manually copypaste the list of defaults to retain it.- removeNewlines
Whether to remove all newline characters from the source before starting to clean them. Be careful: if the source contains YAML fragments, these will also be affected by this, and will probably become invalid!
- removeTrailingNewlines
Whether to remove trailing newline characters (i.e. at the end of a character value in a character vector);
- rlWarn
Whether to let
readLines()
warn, e.g. if files do not end with a newline character.- utteranceSplits
This is a vector of regular expressions that specify where to insert breaks between utterances in the source(s). Such breakes are specified using the
utteranceMarker
ROCK setting.- preventOverwriting
Whether to prevent overwriting of output files.
- encoding
The encoding of the source(s).
- silent
Whether to suppress the warning about not editing the cleaned source.
- outputPrefix, outputSuffix
The prefix and suffix to add to the filenames when writing the processed files to disk.
- recursive
Whether to search all subdirectories (
TRUE
) as well or not.- filenameRegex
A regular expression to match against located files; only files matching this regular expression are processed.
- replacements
The strings to search & replace, as a list of two-element vectors, where the first element in each vector contains a regular expression to search for in the source(s), and the second element contains the replacement (these are passed as
perl
regular expressions; seeregex
for more information). Instead of regular expressions, simple words or phrases can also be entered of course (since those are valid regular expressions).
Details
The cleaning functions, when called with their default arguments, will do the following:
Double periods (
..
) will be replaced with single periods (.
)Four or more periods (
...
or.....
) will be replaced with three periodsThree or more newline characters will be replaced by one newline character (which will become more, if the sentence before that character marks the end of an utterance)
All sentences will become separate utterances (in a semi-smart manner; specifically, breaks in speaking, if represented by three periods, are not considered sentence ends, wheread ellipses ("…" or unicode 2026, see the example) are.
If there are comma's without a space following them, a space will be inserted.
Examples
exampleSource <-
"Do you like icecream?
Well, that depends\u2026 Sometimes, when it's..... Nice. Then I do,
but otherwise... not really, actually."
### Default settings:
cat(clean_source(exampleSource));
#> Do you like icecream?
#>
#> Well, that depends…
#> Sometimes, when it's... Nice.
#> Then I do,
#> but otherwise... not really, actually.
### First remove existing newlines:
cat(clean_source(exampleSource,
removeNewlines=TRUE));
#> Do you like icecream?
#> Well, that depends…
#> Sometimes, when it's... Nice.
#> Then I do, but otherwise... not really, actually.
### Example with a YAML fragment
exampleWithYAML <-
c(
"Do you like icecream?",
"",
"",
"Well, that depends\u2026 Sometimes, when it's..... Nice.",
"Then I do,",
"but otherwise... not really, actually.",
"",
"---",
"This acts as some YAML. So this won't be split.",
"Not real YAML, mind... It just has the delimiters, really.",
"---",
"This is an utterance again."
);
cat(
rock::clean_source(
exampleWithYAML
),
sep="\n"
);
#> Do you like icecream?
#>
#>
#> Well, that depends…
#> Sometimes, when it's... Nice.
#> Then I do,
#> but otherwise... not really, actually.
#>
#> ---
#> This acts as some YAML. So this won't be split.
#> Not real YAML, mind... It just has the delimiters, really.
#> ---
#> This is an utterance again.
exampleSource <-
"Do you like icecream?
Well, that depends\u2026 Sometimes, when it's..... Nice. Then I do,
but otherwise... not really, actually."
### Simple text replacements:
cat(search_and_replace_in_source(exampleSource,
replacements=list(c("\u2026", "..."),
c("Nice", "Great"))));
#> Do you like icecream?
#>
#>
#> Well, that depends... Sometimes, when it's..... Great. Then I do,
#> but otherwise... not really, actually.
### Using a regular expression to capitalize all words following
### a period:
cat(search_and_replace_in_source(exampleSource,
replacements=list(c("\\.(\\s*)([a-z])", ".\\1\\U\\2"))));
#> Do you like icecream?
#>
#>
#> Well, that depends… Sometimes, when it's..... Nice. Then I do,
#> but otherwise... Not really, actually.