Ruby regex problem: matching optional white space at start of line

rubyjohn · December 23, 2020, 2:51pm

I am writing a regular expression for Ruby. I want it to match all lines that begin with a *, skipping optional white space. The following regular expression does this correctly on rubular.com:

/^\s*\*.*$/

For example, in the following text, this regular expression correctly matches the first two lines, and correctly does not match the third line:

   * text
 *text
text*text

But when I use this regular expression in Ruby, none of these three lines match. If I remove the ^, which indicates start of line, Ruby and rubular.com give the same result: they match the first two lines fully, and match the third line starting at the *. But that is not what I am trying to do. My current solution uses /\s*\*.*$/ (or, equivalently, /\*.*$/), both of which unfortunately also match text*text.

Is there a reason why Ruby is not matching the behavior I see on rubular? I would like to find a way to get the regex expression listed at the top of this post to work in Ruby the same way it does on rubular.

I am a complete newcomer to Ruby, so any tips or suggestions are much appreciated!

SouravGoswami · December 23, 2020, 5:13pm

When you are using this:

/^\s*\*.*$/

It will only match the first line, to make it match multiple lines:

/^\s*\*.*$/m

But there’s a caveat. What you are doing is actually matching if:

^\s*: The line starts with 0 or more occurrences of white space.
*: There is a * after the first match.
.*$: Match 0 or more characters after the second match which ends the string.

But your string is:

* text\n *text\ntext*text\n"`

Your string stats with a space
Your string has * after the space.
Your string has 0 ore more character(s) after the space which also ends the string.

So your regexp perfectly matches the whole string with the multiline match.

To get the desired result, you have to check if there’s a newline, if it does have a new line, then match it from the beginning. That is, it has to loop the match.

You can do that in other simple way, without using regexp:

string.each_line.select { |x|
    x.lstrip!
    x.start_with?('*'.freeze)
}.join

String#lstrip strips the trailing white spaces (\r, \s, \t, \n, \u0000, etc.) from the beginning (left) of the string and creates a new string without modifying the actual string from memory.
'*'.freeze makes a frozen string that’s not allocated in every loop.

Here you are selecting only the white space stripped lines that starts with ‘*’ and creating an array out of it, which you are then joining together as a string. So the return value will look like:

"   * text\n *text\n"

Can’t say a better regexp alternative of this though that’s faster than this…

rubyjohn · December 23, 2020, 6:08pm

Thank you for the reply! I did not quite understand it, so let me provide some additional context.

I am working on a new lexer for Rouge. An example is available here. Line 62 provides a regular expression for an in-line comment, which is close to what I am trying to accomplish:

/#.*?$/

Here is some example text. I want to get rouge to identify the first three lines as single-line comments, but not the fourth line.

*text
   * text
 *text
text*text

To accomplish this, I make two changes to the template code /#.*?$/.

(1) The comment will be indicated by *, not #. That changes the above regex line to:

/\*.*?$/

I have confirmed that this change works as expected. However, this corresponds to an in-line comment, so it matches the second half of the fourth line above, text*text. Thus, a second change is necessary to accomplish what I want.

(2) Require the match to only work if the * is preceded by optional white space at the start of a line:

/^\s*\*.*?$/

However, when I test this out, rouge only matches the first line below:

*text
   * text
 *text
text*text

This is not related to needing an m, because I am not interested in a multi-line comment here. And in any case, if I add text above this text sample above, it does not cause any issues–the first line, *text, still gets matched.

rubyjohn · December 24, 2020, 1:22am

I have solved this problem. It turns out there was interference in another part of the code where whitespace was being parsed.

To sum up, the following regex expression does indeed work for this situation:

/^\s*\*.*$/

There is no inconsistency between ruby and rubular.

SouravGoswami · December 24, 2020, 7:07am

Well I was talking about this:

I was thinking you want to select every lines beginning with *.