Trying regular expressions in Ruby

17th August 2020

Is understanding regular expressions difficult to you? It usually happens to me and, for such reason, I decided to read and reread information about this topic. I wrote down some notes which I think will be enough to work with them when I need it and I would like to share them with you:

We can create a Ruby RegEx with // or %r{} which are the literal constructors of it.

/abc/
%r{abc}

And both are instances of the RegEx class

/abc/.class
=> Regexp

%r{abc}.class
=> Regexp

The methods match, match? and =~

Method	Description
match?	Returns a boolean (True or False): true if the match was successful and false if not

Example:

/ruby/.match?("ruby regex are not too complicated")
=> true

The String class has also the match? method:

"ruby regex are not too complicated".match? /ruby/
=> true

Method	Description
=~	If the match was successful, the index of the first position of the matched word is returned, otherwise, nil is returned.

Example:

/are/ =~ "ruby regex are not too complicated"
=> 11

The number 11 was returned because the expression matched the word are and the position of its first character is 11. It can be proved:

"ruby regex are not too complicated"[11]
=> "a"

String class has also the =~ method:

"ruby regex are not too complicated" =~ /are/
=> "11"

Method	Description
match	If the match was successful, a MatchData object is returned with the matched information, otherwise, `nil` is returned

Example:

"ruby regex are not too complicated".match(/ruby/)
=> #<MatchData "ruby">

Special characters ^ $ ? . / \ [] {} () + *

To use a special character in a ReGex, it needs to be escaped. For example:

/\./.match?("Hello.")
=> true

It is also possible to get a match if the dot sign is not escaped, but it is not the match we are looking for, let's prove it:

/./.match("Hello.")
=> #<MatchData "H">

/\./.match("Hello.")
=> #<MatchData ".">

To escape a character, the backslash character \ needs to be positioned just before the desired character, like with the previous example with the dot \.

The dot `.`

This character used on a ReGex matches everything with all characters on a string, except if there is a newline \n

/.ated/.match?("ruby regex are not too complicated")
=> true

If we inspect what the expression matched, we get:

/.ated/.match("ruby regex are not too complicated")
=> #<MatchData "cated">

This expression expects that any character exists just before the pattern ated

The dot . doesn't match a newline \n

Example:

/.complicated/.match?("ruby regex are not too \ncomplicated")
=> false

Character class [ ]

Inside of the brackets, many characters can be listed, and any of them can be matched

Example test a string to know if it has a vowel:

/[aeiou]/.match?("Hello")
=> true

And more characters can be added to the expression just after the brackets:

/[rty]uby/.match?("ruby regex are not too complicated")
=> true

In this example, the character r inside the brackets, plus the characters out of them uby matches with the word ruby

Ranges inside character class [a-z] [0-9] [A-Z]

Ranges can be created inside the brackets, example:

Syntax	Description
/[a-z]/	This range matches any letters between `a` and `z` (without capital letters)
/[0-9]/	This range matches any number
/[1-5]/	This range is equal to the characters list `/[12345]/`

/[a-z]/.match("Ruby regex are not too complicated")
=> #<MatchData "u">

/[0-9]/.match("I am 31 years old")
=> #<MatchData "3">

Ranges Abbreviations

Syntax	Abbreviation	Description
/[0-9]/	/\d/	This range matches any number
/[0-9a-zA-Z_]/	/\w/	This range matches any number, any letter from `a` to `z` and `A` to `Z` and underscore `_`

Abbreviation	Description
/\s/	This expression is not a range but it belongs to the abbreviation expressions, this one covers spaces, tabs, and newlines

All of these abbreviations have a negative version that matches the opposite from the positive versions

Abbreviation	Description
/\D/	This matches all that is not a number
/\W/	This matches all that is not a number, letter or underscore
/\S/	This matches all that is not a space, newline or tab

Captures

Syntaxis	Description
()	The pattern indicated inside the parentheses will be captured

For example, let's capture the strings "Lenin Godinez" and "40" from the next string:

str = "Lenin Godinez,RoR Developer,40 years"

/([A-Za-z]+\s[A-Za-z]+),.+,(\d+)/.match str
=> #<MatchData "Lenin Godinez,RoR Developer,40 years" 1:"Lenin Godinez" 2:"40">

With the pattern ([A-Za-z]+\s[A-Za-z]+) the words "Lenin Godinez" are captured and, with the pattern (\d+), the number "40" is captured

All the captures automatically are assigned to global variables. On the previous example, the two captures were stored on the global variables $1 and $2 and we can test it using puts:

puts "Name: #{$1}, Age: #{$2}"
Name: Lenin Godinez, Age: 40
=> nil

The captures can be accessed the same way we get an element from an array: sending an index:

m = /([A-Za-z]+\s[A-Za-z]+),.+,(\d+)/.match str

puts "Name: #{m[1]}, Age: #{m[2]}"
Name: Lenin Godinez, Age: 40
=> nil

If zero is indicated as the index on the m variable, the complete match is returned:

m[0]
=> "Lenin Godinez,RoR Developer,40"

A useful method from the MatchData object is the method captures which returns an array with the captures:

m.captures
=> ["Lenin Godinez", "40"]

Named Captures

Syntaxis	Description
(?<capture_name>)	It names the capture, the captures continue getting stored on global variables

Example:

str = "Lenin Godinez,RoR Developer,40 years"
re = /(?<name>[A-Za-z]+\s[A-Za-z]+),.+,(?<age>\d+)/

re.match str
=> #<MatchData "Lenin Godinez,RoR Developer,40" name:"Lenin Godinez" age:"40">

The captures can be accessed in the same way we get the value from a hash: sending a key, being the key the name of the capture

Example:

m = re.match str

puts "Name: #{m[:name]}, Age: #{m[:age]}"
Name: Lenin Godinez, Age: 40
=> nil

Also, there is a useful method to get the named captures: the method named_captures that returns a hash with the captures:

m.named_captures
=> {"name"=>"Lenin Godinez", "age"=>"40"}

Modifiers ? + * {}

Modifier	Description
?	Represents 0 or 1. It can be used to indicate if a character is optional, and we need to add it just after the optional character

Example:

/Mrs?\./.match? "Welcome Mrs. Smith"
=> true

If we remove the modifier ? from the pattern and the letter s from "Mrs." on the string, we can see that the test fails, because it is looking for the pattern Mrs

/Mrs\./.match? "Welcome Mr. Smith"
=> false

Now, if we return the modifier ? to the pattern, this time we will get a match with "Mr." because the character s on the pattern is optional

/Mrs?\./.match? "Welcome Mr. Smith"
=> true

Modifier	Description
+	Represents one or more. This modifier absorbs all characters it can as long as the pattern continues matching

Example:

/\d+/.match("I am 31 years old")
=> #<MatchData "31">

If the modifier is removed from this expression, only the first number from 31 is matched

/\d/.match("I am 31 years old")
=> #<MatchData "3">

Modifier	Description
*	Represents zero or more. This modifier absorbs all characters it can and matches them even if there are no characters

Example:

/[a-zA-Z]+:\s[a-zA-Z]+.*/.match("Name: Lenin, Role: RoR Dev")
=> #<MatchData "Name: Lenin, Role: RoR Dev">

# This /[a-zA-Z]+:\s[a-zA-Z]+/ matches "Name: Lenin"
# And this .* placed after the previous pattern, matches the remaining of the string

If we modify the string to be only "Name: Lenin", the expression continues matching since more characters after the word "Lenin" are just optional, thanks to the modifier *

/[a-zA-Z]+:\s[a-zA-Z]+.*/.match("Name: Lenin")
=> #<MatchData "Name: Lenin">

Modifier	Description
{}	This modifier is useful when we want to indicate an exact number of repetitions on the match

Example:

If I want to match the format phone number 111-111-1111

/\d{3}-\d{3}-\d{4}/.match? "312-123-1234"
=> true

/\d{3}-\d{3}-\d{4}/.match? "312-1234-1234"
=> false

/\d{3}-\d{3}-\d{4}/.match? "3-123-1234"
=> false

Also, we can indicate a minimum and maximum of repetitions with a range as {1,n} The first number indicates the minimum and the second number indicates the maximum of repetitions.

Example:

/\d{1,4}/.match("0123456789")
=> #<MatchData "0123">

On thin example, four numbers are matched, because we asked for at least one number and four as maximum

If the string to be evaluated has less characters than the maximum indicated on the pattern, then all characters are matched

/\d{1,6}/.match("0123")
=> #<MatchData "0123">

If the minimum amount is not reached, then the expression returns nil

/\d{5,8}/.match("0123")
=> nil

If the second number on the range is not indicated, it only will take the minimum amount and the maximum amount will be open:

/\d{3,}/.match("0123456789")
=> #<MatchData "0123456789">

Also, if the minimum amount is not reached, then the expression returns nil

/\d{3,}/.match("01")
=> nil

Anchors ^ $ \A \z \Z \b

Anchor	Description
^	Indicates the beginning of a line. The match occurs at the begining of it

Example:

/^\s*#/.match(" # A ruby comment")
=> #<MatchData " #">

The match can also be found at the beginning of a newline

/^\s*#/.match("x = 1\n # A ruby comment")
=> #<MatchData " #">

If we try to look the comment up between the string, we won't get a match and a nil will be returned

/^\s*#/.match(" x = 1 # A ruby comment")
=> nil

Anchor	Description
$	Indicates the end of a line. The match occurs at the end of it

Example:

If we want to match the dot that appears at the end of the line of the next string:

/\.$/.match("ruby regex are not too complicated.\ncurrently I am working with them")
=> #<MatchData ".">

Anchor	Description
\A	Indicates the beginning of a string. It is not the same as the beginning of a line

/\Aruby regex/.match("ruby regex are not too complicated.\ncurrently I am working with them")
=> #<MatchData "ruby regex">

If we try to match the word "currently" that starts on a new line, it is not matched because this anchor does not work for that purpose.

/\Acurrently/.match("ruby regex are not too complicated.\ncurrently I am working with them")
=> nil

Anchor	Description
\z	Indicates the end of a string. It is not the same as the end of a line

/them\z/.match("ruby regex are not too complicated.\ncurrently I am working with them")
=> #<MatchData "them">

If we try to match the end of a line, it will return nil

/complicated\.\z/.match("ruby regex are not too complicated.\ncurrently I am working with them")
=> nil

Anchor	Description
\Z	Indicates the end of a string. It ignores if there is a newline at the end of the string

/them\Z/.match("ruby regex are not too complicated.\ncurrently I am working with them\n")
=> #<MatchData "them">

If the same string is evaluated with the anchor \z, then we won't get a match due to the new line indicated at the end of the string

/them\z/.match("ruby regex are not too complicated.\ncurrently I am working with them\n")
=> nil

Anchor	Description
\b	This does not match signs

/\b\w+\b/.match("%%%Ruby###")
=> #<MatchData "Ruby">

LookAhead Assertions

Syntaxis	Description
(?=)	This uses the indicated pattern inside the lookahead to find a match, the pattern indicated inside the lookahead assertion is not returned on the match

Example:

There is a list of numbers and I want to match the numbers that ends with a dot, but I don't want to include the dot sign on the result of the match. The lookahead assertion needed is (?=\.) with the dot escaped inside it:

str = "123 456. 789"
/\d+(?=\.)/.match(str)
=> #<MatchData "456">

On the other hand, there is a negative version of this lookahead assertion

Syntaxis	Description
(?!)	This matches a pattern that does not include the pattern indicated inside the lookahead assertion

Example:

/\d+(?!\.)/.match(str)
=> #<MatchData "123">

The result is "123" because it was the first match that the pattern found with the indications "A series of numbers without a dot placed at the end of it"

LookBehind Assertions

Syntaxis	Description
(?<=)	This matches a pattern if it is preceded by the pattern indicated inside the lookbehind assertion

Example:

I want to match the word "regex" only if the word "ruby" is placed just before of it in the next string:

str = "ruby regex are not too complicated"
re = /(?<=ruby\s)ruby/

re.match str
=> #<MatchData "regex">

And its negative version is:

Syntaxis	Description
(?<!)	This matches a pattern if it is not preceded by the pattern indicated inside the lookbehind assertion

str = "ruby regex are not too complicated"
re = /(?<!ruby\s)regex/
re.match str
=> nil

In this example, nil is returned because the "regex" word is preceded by the word "ruby", so the match is not successful. To get a match, the string needs to be modified.

Modifiers /i/m

Syntaxis	Description
/i	This modifiers makes the expression case insensitive

Example:

/abc/i.match? "ABC"
=> true

Syntaxis	Description
/m	This modifier helps the dot matcher to ignore the newlines

Example:

/.+/m.match "abc\ndef"
=> #<MatchData "abc\ndef">

Interpolation

Syntaxis	Description
/#{string}/	It is possible to interpolate strings inside a RegEx just as it is done with the strings

str = "def"
/abc#{str}/
=> /abcdef/

Regex.escape() method

This method helps to escape characters from a string.

Example:

Regex.escape("a.c")
=> "a\\.c"

str = "a.c"
re = /#{Regexp.escape(str)}/

re.match("a.c")
=> #<MatchData "a.c">

Scan method

This method evaluates a string from left to right and returns an array with all the matches found

Example:

str = "testing 1 2 3 testing 4 5 6"
str.scan(/\d/)
=> ["1", "2", "3", "4", "5", "6"]

It is possible to make captures with the scan method: if something is matched, an array of arrays is returned with the results

Example:

str = "RoR y Ruby JS y React"
str.scan(/([A-Za-z]+)\sy\s([A-Za-z]+)/)
=> [["RoR", "Ruby"], ["Js", "React"]]

Grep method

If a RegEx is sent as a parameter on this method, this evaluates each element from the array and, if it matches the pattern, then it is returned on an array. All elements that pass the test are returned in the array

Example:

["USA", "UK", "Francia", "Mexico"].grep(/[a-z]/)
=> ["Francia", "Mexico"]

Only 2 elements from the array matched the expression, and it can be proved as follows:

/[a-z]/.match? "USA"
=> false
/[a-z]/.match? "Francia"
=> true

And that is it! All you need to do is practice and practice with them to get more familiar. Feel free to reference this blog post if you have doubts.

About michelada.io

Tags

Trying regular expressions in Ruby

The methods match, match? and =~

Special characters ^ $ ? . / \ [] {} () + *

The dot `.`

Character class [ ]

Ranges inside character class [a-z] [0-9] [A-Z]

Ranges Abbreviations

Captures

Named Captures

Modifiers ? + * {}

Anchors ^ $ \A \z \Z \b

LookAhead Assertions

LookBehind Assertions

Modifiers /i/m

Interpolation

Regex.escape() method

Scan method

Grep method

Lenin Godinez

View Comments

Lenin Godinez

Share Article

About michelada.io

About michelada.io

Tags

Trying regular expressions in Ruby

The methods match, match? and =~

Special characters ^ $ ? . / \ [] {} () + *

The dot .

Character class [ ]

Ranges inside character class [a-z] [0-9] [A-Z]

Ranges Abbreviations

Captures

Named Captures

Modifiers ? + * {}

Anchors ^ $ \A \z \Z \b

LookAhead Assertions

LookBehind Assertions

Modifiers /i/m

Interpolation

Regex.escape() method

Scan method

Grep method

Lenin Godinez

View Comments

Lenin Godinez

Share Article

About michelada.io

The dot `.`