Trying regular expressions in Ruby
17th August 2020Is understanding regular expressions difficult to you? It usually happens to me and, for such reason, I decided to read and reread information about this topic. I wrote down some notes which I think will be enough to work with them when I need it and I would like to share them with you:
We can create a Ruby RegEx with // or %r{}
which are the literal constructors of it.
/abc/
%r{abc}
And both are instances of the RegEx class
/abc/.class
=> Regexp
%r{abc}.class
=> Regexp
The methods match, match? and =~
Method | Description |
---|---|
match? | Returns a boolean (True or False): true if the match was successful and false if not |
Example:
/ruby/.match?("ruby regex are not too complicated")
=> true
The String class has also the match?
method:
"ruby regex are not too complicated".match? /ruby/
=> true
Method | Description |
---|---|
=~ | If the match was successful, the index of the first position of the matched word is returned, otherwise, nil is returned. |
Example:
/are/ =~ "ruby regex are not too complicated"
=> 11
The number 11 was returned because the expression matched the word are
and the position of its first character is 11. It can be proved:
"ruby regex are not too complicated"[11]
=> "a"
String class has also the =~
method:
"ruby regex are not too complicated" =~ /are/
=> "11"
Method | Description |
---|---|
match | If the match was successful, a MatchData object is returned with the matched information, otherwise, nil is returned |
Example:
"ruby regex are not too complicated".match(/ruby/)
=> #<MatchData "ruby">
Special characters ^ $ ? . / \ [] {} () + *
To use a special character in a ReGex, it needs to be escaped. For example:
/\./.match?("Hello.")
=> true
It is also possible to get a match if the dot sign is not escaped, but it is not the match we are looking for, let's prove it:
/./.match("Hello.")
=> #<MatchData "H">
/\./.match("Hello.")
=> #<MatchData ".">
To escape a character, the backslash character \
needs to be positioned just before the desired character, like with the previous example with the dot \.
The dot .
This character used on a ReGex matches everything with all characters on a string, except if there is a newline \n
/.ated/.match?("ruby regex are not too complicated")
=> true
If we inspect what the expression matched, we get:
/.ated/.match("ruby regex are not too complicated")
=> #<MatchData "cated">
This expression expects that any character exists just before the pattern ated
The dot .
doesn't match a newline \n
Example:
/.complicated/.match?("ruby regex are not too \ncomplicated")
=> false
Character class [ ]
Inside of the brackets, many characters can be listed, and any of them can be matched
Example test a string to know if it has a vowel:
/[aeiou]/.match?("Hello")
=> true
And more characters can be added to the expression just after the brackets:
/[rty]uby/.match?("ruby regex are not too complicated")
=> true
In this example, the character r
inside the brackets, plus the characters out of them uby
matches with the word ruby
Ranges inside character class [a-z] [0-9] [A-Z]
Ranges can be created inside the brackets, example:
Syntax | Description |
---|---|
/[a-z]/ | This range matches any letters between a and z (without capital letters) |
/[0-9]/ | This range matches any number |
/[1-5]/ | This range is equal to the characters list /[12345]/ |
/[a-z]/.match("Ruby regex are not too complicated")
=> #<MatchData "u">
/[0-9]/.match("I am 31 years old")
=> #<MatchData "3">
Ranges Abbreviations
Syntax | Abbreviation | Description |
---|---|---|
/[0-9]/ | /\d/ | This range matches any number |
/[0-9a-zA-Z_]/ | /\w/ | This range matches any number, any letter from a to z and A to Z and underscore _ |
Abbreviation | Description |
---|---|
/\s/ | This expression is not a range but it belongs to the abbreviation expressions, this one covers spaces, tabs, and newlines |
All of these abbreviations have a negative version that matches the opposite from the positive versions
Abbreviation | Description |
---|---|
/\D/ | This matches all that is not a number |
/\W/ | This matches all that is not a number, letter or underscore |
/\S/ | This matches all that is not a space, newline or tab |
Captures
Syntaxis | Description |
---|---|
() | The pattern indicated inside the parentheses will be captured |
For example, let's capture the strings "Lenin Godinez" and "40" from the next string:
str = "Lenin Godinez,RoR Developer,40 years"
/([A-Za-z]+\s[A-Za-z]+),.+,(\d+)/.match str
=> #<MatchData "Lenin Godinez,RoR Developer,40 years" 1:"Lenin Godinez" 2:"40">
With the pattern ([A-Za-z]+\s[A-Za-z]+)
the words "Lenin Godinez" are captured and, with the pattern (\d+)
, the number "40" is captured
All the captures automatically are assigned to global variables. On the previous example, the two captures were stored on the global variables $1
and $2
and we can test it using puts:
puts "Name: #{$1}, Age: #{$2}"
Name: Lenin Godinez, Age: 40
=> nil
The captures can be accessed the same way we get an element from an array: sending an index:
m = /([A-Za-z]+\s[A-Za-z]+),.+,(\d+)/.match str
puts "Name: #{m[1]}, Age: #{m[2]}"
Name: Lenin Godinez, Age: 40
=> nil
If zero is indicated as the index on the m
variable, the complete match is returned:
m[0]
=> "Lenin Godinez,RoR Developer,40"
A useful method from the MatchData object is the method captures
which returns an array with the captures:
m.captures
=> ["Lenin Godinez", "40"]
Named Captures
Syntaxis | Description |
---|---|
(?<capture_name>) | It names the capture, the captures continue getting stored on global variables |
Example:
str = "Lenin Godinez,RoR Developer,40 years"
re = /(?<name>[A-Za-z]+\s[A-Za-z]+),.+,(?<age>\d+)/
re.match str
=> #<MatchData "Lenin Godinez,RoR Developer,40" name:"Lenin Godinez" age:"40">
The captures can be accessed in the same way we get the value from a hash: sending a key
, being the key
the name of the capture
Example:
m = re.match str
puts "Name: #{m[:name]}, Age: #{m[:age]}"
Name: Lenin Godinez, Age: 40
=> nil
Also, there is a useful method to get the named captures: the method named_captures
that returns a hash with the captures:
m.named_captures
=> {"name"=>"Lenin Godinez", "age"=>"40"}
Modifiers ? + * {}
Modifier | Description |
---|---|
? | Represents 0 or 1. It can be used to indicate if a character is optional, and we need to add it just after the optional character |
Example:
/Mrs?\./.match? "Welcome Mrs. Smith"
=> true
If we remove the modifier ?
from the pattern and the letter s
from "Mrs." on the string, we can see that the test fails, because it is looking for the pattern Mrs
/Mrs\./.match? "Welcome Mr. Smith"
=> false
Now, if we return the modifier ?
to the pattern, this time we will get a match with "Mr." because the character s
on the pattern is optional
/Mrs?\./.match? "Welcome Mr. Smith"
=> true
Modifier | Description |
---|---|
+ | Represents one or more. This modifier absorbs all characters it can as long as the pattern continues matching |
Example:
/\d+/.match("I am 31 years old")
=> #<MatchData "31">
If the modifier is removed from this expression, only the first number from 31 is matched
/\d/.match("I am 31 years old")
=> #<MatchData "3">
Modifier | Description |
---|---|
* | Represents zero or more. This modifier absorbs all characters it can and matches them even if there are no characters |
Example:
/[a-zA-Z]+:\s[a-zA-Z]+.*/.match("Name: Lenin, Role: RoR Dev")
=> #<MatchData "Name: Lenin, Role: RoR Dev">
# This /[a-zA-Z]+:\s[a-zA-Z]+/ matches "Name: Lenin"
# And this .* placed after the previous pattern, matches the remaining of the string
If we modify the string to be only "Name: Lenin", the expression continues matching since more characters after the word "Lenin" are just optional, thanks to the modifier *
/[a-zA-Z]+:\s[a-zA-Z]+.*/.match("Name: Lenin")
=> #<MatchData "Name: Lenin">
Modifier | Description |
---|---|
{} | This modifier is useful when we want to indicate an exact number of repetitions on the match |
Example:
If I want to match the format phone number 111-111-1111
/\d{3}-\d{3}-\d{4}/.match? "312-123-1234"
=> true
/\d{3}-\d{3}-\d{4}/.match? "312-1234-1234"
=> false
/\d{3}-\d{3}-\d{4}/.match? "3-123-1234"
=> false
Also, we can indicate a minimum and maximum of repetitions with a range as {1,n}
The first number indicates the minimum and the second number indicates the maximum of repetitions.
Example:
/\d{1,4}/.match("0123456789")
=> #<MatchData "0123">
On thin example, four numbers are matched, because we asked for at least one number and four as maximum
If the string to be evaluated has less characters than the maximum indicated on the pattern, then all characters are matched
/\d{1,6}/.match("0123")
=> #<MatchData "0123">
If the minimum amount is not reached, then the expression returns nil
/\d{5,8}/.match("0123")
=> nil
If the second number on the range is not indicated, it only will take the minimum amount and the maximum amount will be open:
/\d{3,}/.match("0123456789")
=> #<MatchData "0123456789">
Also, if the minimum amount is not reached, then the expression returns nil
/\d{3,}/.match("01")
=> nil
Anchors ^ $ \A \z \Z \b
Anchor | Description |
---|---|
^ | Indicates the beginning of a line. The match occurs at the begining of it |
Example:
/^\s*#/.match(" # A ruby comment")
=> #<MatchData " #">
The match can also be found at the beginning of a newline
/^\s*#/.match("x = 1\n # A ruby comment")
=> #<MatchData " #">
If we try to look the comment up between the string, we won't get a match and a nil
will be returned
/^\s*#/.match(" x = 1 # A ruby comment")
=> nil
Anchor | Description |
---|---|
$ | Indicates the end of a line. The match occurs at the end of it |
Example:
If we want to match the dot that appears at the end of the line of the next string:
/\.$/.match("ruby regex are not too complicated.\ncurrently I am working with them")
=> #<MatchData ".">
Anchor | Description |
---|---|
\A | Indicates the beginning of a string. It is not the same as the beginning of a line |
/\Aruby regex/.match("ruby regex are not too complicated.\ncurrently I am working with them")
=> #<MatchData "ruby regex">
If we try to match the word "currently" that starts on a new line, it is not matched because this anchor does not work for that purpose.
/\Acurrently/.match("ruby regex are not too complicated.\ncurrently I am working with them")
=> nil
Anchor | Description |
---|---|
\z | Indicates the end of a string. It is not the same as the end of a line |
/them\z/.match("ruby regex are not too complicated.\ncurrently I am working with them")
=> #<MatchData "them">
If we try to match the end of a line, it will return nil
/complicated\.\z/.match("ruby regex are not too complicated.\ncurrently I am working with them")
=> nil
Anchor | Description |
---|---|
\Z | Indicates the end of a string. It ignores if there is a newline at the end of the string |
/them\Z/.match("ruby regex are not too complicated.\ncurrently I am working with them\n")
=> #<MatchData "them">
If the same string is evaluated with the anchor \z
, then we won't get a match due to the new line indicated at the end of the string
/them\z/.match("ruby regex are not too complicated.\ncurrently I am working with them\n")
=> nil
Anchor | Description |
---|---|
\b | This does not match signs |
/\b\w+\b/.match("%%%Ruby###")
=> #<MatchData "Ruby">
LookAhead Assertions
Syntaxis | Description |
---|---|
(?=) | This uses the indicated pattern inside the lookahead to find a match, the pattern indicated inside the lookahead assertion is not returned on the match |
Example:
There is a list of numbers and I want to match the numbers that ends with a dot, but I don't want to include the dot sign on the result of the match. The lookahead assertion needed is (?=\.)
with the dot escaped inside it:
str = "123 456. 789"
/\d+(?=\.)/.match(str)
=> #<MatchData "456">
On the other hand, there is a negative version of this lookahead assertion
Syntaxis | Description |
---|---|
(?!) | This matches a pattern that does not include the pattern indicated inside the lookahead assertion |
Example:
/\d+(?!\.)/.match(str)
=> #<MatchData "123">
The result is "123" because it was the first match that the pattern found with the indications "A series of numbers without a dot placed at the end of it"
LookBehind Assertions
Syntaxis | Description |
---|---|
(?<=) | This matches a pattern if it is preceded by the pattern indicated inside the lookbehind assertion |
Example:
I want to match the word "regex" only if the word "ruby" is placed just before of it in the next string:
str = "ruby regex are not too complicated"
re = /(?<=ruby\s)ruby/
re.match str
=> #<MatchData "regex">
And its negative version is:
Syntaxis | Description |
---|---|
(?<!) | This matches a pattern if it is not preceded by the pattern indicated inside the lookbehind assertion |
str = "ruby regex are not too complicated"
re = /(?<!ruby\s)regex/
re.match str
=> nil
In this example, nil
is returned because the "regex" word is preceded by the word "ruby", so the match is not successful. To get a match, the string needs to be modified.
Modifiers /i/m
Syntaxis | Description |
---|---|
/i | This modifiers makes the expression case insensitive |
Example:
/abc/i.match? "ABC"
=> true
Syntaxis | Description |
---|---|
/m | This modifier helps the dot matcher to ignore the newlines |
Example:
/.+/m.match "abc\ndef"
=> #<MatchData "abc\ndef">
Interpolation
Syntaxis | Description |
---|---|
/#{string}/ | It is possible to interpolate strings inside a RegEx just as it is done with the strings |
str = "def"
/abc#{str}/
=> /abcdef/
Regex.escape() method
This method helps to escape characters from a string.
Example:
Regex.escape("a.c")
=> "a\\.c"
str = "a.c"
re = /#{Regexp.escape(str)}/
re.match("a.c")
=> #<MatchData "a.c">
Scan method
This method evaluates a string from left to right and returns an array with all the matches found
Example:
str = "testing 1 2 3 testing 4 5 6"
str.scan(/\d/)
=> ["1", "2", "3", "4", "5", "6"]
It is possible to make captures with the scan method: if something is matched, an array of arrays is returned with the results
Example:
str = "RoR y Ruby JS y React"
str.scan(/([A-Za-z]+)\sy\s([A-Za-z]+)/)
=> [["RoR", "Ruby"], ["Js", "React"]]
Grep method
If a RegEx is sent as a parameter on this method, this evaluates each element from the array and, if it matches the pattern, then it is returned on an array. All elements that pass the test are returned in the array
Example:
["USA", "UK", "Francia", "Mexico"].grep(/[a-z]/)
=> ["Francia", "Mexico"]
Only 2 elements from the array matched the expression, and it can be proved as follows:
/[a-z]/.match? "USA"
=> false
/[a-z]/.match? "Francia"
=> true
And that is it! All you need to do is practice and practice with them to get more familiar. Feel free to reference this blog post if you have doubts.
View Comments