validUTF8 | R Documentation |
Check if each element of a character vector is valid in its implied encoding.
validUTF8(x) validEnc(x)
x |
a character vector. |
These use similar checks to those used by functions such as
grep
.
validUTF8
ignores any marked encoding (see
Encoding
) and so looks directly if the bytes in each
string are valid UTF-8. (For the validity of ‘noncharacters’
see the help for intToUtf8
.)
validEnc
regards character strings as validly encoded unless
their encodings are marked as UTF-8 or they are unmarked and the R
session is in a UTF-8 or other multi-byte locale. (The checks in
other multi-byte locales depend on the OS and as with
iconv
not all invalid inputs may be detected.)
A logical vector of the same length as x
. NA
elements
are regarded as validly encoded.
It would be possible to check for the validity of character strings in a Latin-1 encoding, but extensions such as CP1252 are widely accepted as ‘Latin-1’ and 8-bit encodings rarely need to be checked for validity.
x <- ## from example(text) c("Jetz", "no", "chli", "z\xc3\xbcrit\xc3\xbc\xc3\xbctsch:", "(noch", "ein", "bi\xc3\x9fchen", "Z\xc3\xbc", "deutsch)", ## from a CRAN check log "\xfa\xb4\xbf\xbf\x9f") validUTF8(x) validEnc(x) # depends on the locale Encoding(x) <-"UTF-8" validEnc(x) # typically the last, x[10], is invalid ## Maybe advantageous to declare it "unknown": G <- x ; Encoding(G[!validEnc(G)]) <- "unknown" try( substr(x, 1,1) ) # gives 'invalid multibyte string' error in a UTF-8 locale try( substr(G, 1,1) ) # works in a UTF-8 locale nchar(G) # fine, too ## but it is not "more valid" typically: all.equal(validEnc(x), validEnc(G)) # typically TRUE
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.