Welcome back, and happy new year everyone!
Well – this was meant to be a weekly post – and I’m almost 3 months late for that. But – better late than never!
If you’re new to this blog series, head on over to part 1 to catch up on what we’re doing here. That first post sets the stage, gives motivation to the problem, and sets the parameters of the challenge.
In the last solution, one of the biggest issues is the redundancy of wrapping the string. The solution is largely based upon the fact that we wrap and then find the segments we need to correct. In this attempt, I try to rectify this by instead identifying the strings that need to be hyphenated upfront. So how do we do that?
Now, the recursive approach is out. Instead I try to find a way to split the string upfront, but I do some wonky things. I wanted to modify the string in place as I go, and in the process I exercise some bad practice. I declare insert_break
inside this function and do so to leverage the name spaces, because when I walk over the string, I look back outside to grab text
and the buffer count, which ultimately makes things more complicated. But in this example I manage to only run str_wrap()
once. Unfortunately, I’m still left with quite a bit of overhead, making this solution still undesirable. Nonetheless – there’s still a learning experience to be had!
#' @param text String to be wrapped
#' @param width Longest allowable width of a line
#'
#' @return Wrapped string
overflow_wrap2 <- function(text, width) {
# Don't go down this hole if we can avoid it
if (nchar(text) < width) {
return(text)
}
# Find where the splits need to happen
sections <- str_locate_all(text, paste0("\\w{", width-1, "}(?=\\w)"))[[1]] #[1]
if (!is_empty(sections)) {
# Trackers
buff <- 0 #[2]
insert_break <- function(loc) {
st <- str_sub(text, end = loc + buff) #[3]
en <- str_sub(text, start = loc + 1 + buff) #[4]
buff <<- buff + 2 #[5]
text <<- paste(c(st, en), collapse = "- ") #[6]
}
# Walk the identified sections to insert the hyphenation
# Use the 'end' column of the sections matrix
walk(sections[,2], ~ insert_break(.))
}
# Wrap the final string
str_wrap(text, width) #[7]
}
# Vectorized version of overflow_wrap2
overflow_wrapper2 <- Vectorize(overflow_wrap2, vectorize.args = "text") #[8]
What I’m actually doing here is inserting "- "
into words longer than the alotted width. Going back to our desired goal, we’re using hyphens to break those long running words so they wrap properly. By inserting a hyphen and a space, when I run str_wrap()
, the wrapping happens the way I’d like it to.
To break down the algorithm a gain, here’s what I do:
(numbers line up with the numbers in the comments within the code above)
- Identify any word within the input string that’s longer than the specified width
- Create a buffer that will track the extending length of the string
- Slice out the beginning of the string to an identified break point
- Slice out the break point to the end of the string
- Increment the buffer by the length of the text I’m inserting
- Paste the text back together, inserting
"- "
in between the split sections of string - Wrap the text on the newly prepared string using
str_wrap()
Note that points 3 through 6 happen within the insert_break()
function. Because there can be multiple sections that need breaks inserted, I iterate over each identified section using purrr::walk()
.
Lastly, there’s item number 8 from the code above. This function was originally written to take in a single element text vector. Using the Vectorize()
function, this allows us to vectorize our function over multi-element character vectors, as we would encounter in a data frame.
Sidebar about regular expressions
If you haven’t dabbled into the world of regular expressions, I highly recommend it. When you’re working with text, you’re inevitably going to come across some situation where a good regex will lead you to a nice and elegant solution. Plus, there’s always something satisfying about working out a monster like this and realizing that it finally works:
(a(\+\d+)?|x+)(\.(a(\+\d+)?|x+)?)?
You tell ME what it does.
As a quick aside, if you want to quickly learn the basics of regular expressions, I highly recommend this tutorial. Furthermore, regex101.com is an incredible tool any time you’re trying to work the kinks out of an expression.
Ok, moving on.
With that outline of the algorithm broken down, let’s dabble into some of the nuance. Using regular expressions I’m able to find each string segment in which I’d need to insert a hyphen upfront using stringr::str_locate_all
. This ends up being quite nice and concise:
sections <- str_locate_all(text, paste0("\w{", width-1, "}(?=\w)"))[[1]]
A few things worth calling out – stringr::str_locate_all()
returns a list containing a matrix. It returns a list because this function is vectorized, but here I’m pulling the matrix out directly, hence the [[1]]
(we’ll come back to this)
A “fun” thing about strings in R is that backslashes (\
) are by default a break character. Regular expressions make heavy use backslashes, and to use them in a regex in R, you have to break the break character (fun, right?). So this in turn makes them pretty nasty to look at. In R 4.0.0 and later, we get the wonderful addition of raw strings. As an avid Python programmer as well, I was thrilled to see this addition. That said, given that I’m concatenating multiple strings together here to form the regex I want, since the raw strings use the C++ style syntax and must be wrapped in ()
or {}
this would make the whole thing a bit uglier.
Knowledge Check
Let’s look back at our test set of data.
text <- c(
"Pneumonoultramicroscopicsilicovolcanoconiosis is a really long word",
"It's a name that has been invented for a lung disease caused by breathing in very small pieces of ash or dust",
"Pneumono refers to the lung. Ultra means extremely, microscopic means tiny, silica is sand, volcano is self-evident, and coniosis is scarring.")
On your own, run overflow_wrapper2()
on text
and look at the results.
A few questions to check your knowledge:
- In overflow_wrap2, why do I use
width-1
? - Check out the link right here to see how this regex evaluates when width is set to 11.
- Why isn’t “extremely,” matched?
- Why isn’t “self-evident” matched?
Exploring Environments
Alright, moving forward. The next think you might notice is that I declare another function inside this one, and I use the super assign (<<-
). This probably seems a bit weird, but bear with me.
My goal here was to modify the variable text
in place. While we’ll find out soon that this was a bad idea, there are some interesting lessons to be learned. The entire point of the buff
variable within the function is handle the fact that the string is changing length as each hyphen is inserted. I insert a hyphen (-) and some white space so that I can eventually feed this new string into stringr::str_wrap()
once (note that this was my original goal; to eliminate the redundant calls to stringr::str_wrap()
). Since I’m inserting two character, I buffer my sub-stringing by two characters after each insertion. So I sub-string, I increment my buffer, and I concatenate my string back to the text
variable.
So that’s what’s happening, but let’s explore the how. There are two quirks you might question:
insert_break()
is inside this function. I didn’t want to do that, but for this to work the way I wanted it to – I had to.- In the call to
purrr::walk()
I use a lambda function instead of just passing ininsert_break
itself.
This largely comes down to the way that R handles namespaces. This is worth playing around with to understand. Try this:
- Take the
insert_break()
function out ofoverflow_wrap2()
and put it in the global environment. Try rerunningoverflow_wrap2()
again and see what happens. Play around with the function by insertingbrowser()
calls and see if you can spot what’s happening. - Put
insert_break()
back where it was and change the linewalk(sections[,2], ~ insert_break(.))
towalk(sections[,2], insert_break)
to pass theinsert_break()
function intowalk()
directly instead of as an anonymous function using{purrr}
lambda syntax. Now what happens?
So we’ve got some quirks that we’ve now worked through. Let’s take a look at performance.
wrap2_bench <- microbenchmark::microbenchmark(
overflow_wrapper2(rep(text, 1000), width=10),
unit='s'
)
ggplot2::autoplot(wrap2_bench)
Ok. We made some improvements – we trimmed about 1.08 of a second per test. But we still have a lot of overhead compared to the baseline – about 1.32 seconds per test. What else can we do? See you (hopefully) next week to find out!
As a reminder – you can submit your solutions in our GitHub repo, which has submission instructions in the README.
Knowledge check answers:
Why do I use width-1
?
In the regex, I’m using something called a positive lookahead. This means that I only want to match cases there are width-1
characters followed by another character. This effectively gives me the end points of the strings where I need to insert the hyphens, and only if a hyphen actually needs to be inserted.
If my desired width is 10, and I instead just searched for a continuous string of 9 characters, I’d end up making matches that I didn’t want. This is because I just want the end point of 9
characters when that 9
character sequence is actually 10 characters long.
Why isn’t “extremely,” matched? and Why isn’t “self-evident” matched?
The \w
token in regular expressions means “Any word character”, and as such commas and hyphens are not matched. Similarly, str_wrap()
will handle these situations properly.