官术网_书友最值得收藏!

Regular expressions

To search for and match patterns in text and other data, regular expressions are an indispensable tool for the data scientist. Julia adheres to the Perl syntax of regular expressions. For a complete reference, refer to http://www.regular-expressions.info/reference.html. Regular expressions are represented in Julia as a double (or triple) quoted string preceded by r, such as r"..." (optionally, followed by one or more of the i, s, m, or x flags), and they are of type Regex. The regexp.jl script shows some examples.

In the first example, we will match the email addresses (#> shows the result):

email_pattern = r".+@.+" 
input = "john.doe@mit.edu" 
println(occursin(email_pattern, input)) #> true 

The regular expression pattern + matches any (non-empty) group of characters. Thus, this pattern matches any string that contains @ somewhere in the middle.

In the second example, we will try to determine whether a credit card number is valid or not:

visa = r"^(?:4[0-9]{12}(?:[0-9]{3})?)$"  # the pattern 
input = "4457418557635128" 
occursin(visa, input)  #> true 
if occursin(visa, input) 
    println("credit card found") 
    m = match(visa, input) 
    println(m.match) #> 4457418557635128 
    println(m.offset) #> 1 
    println(m.offsets) #> [] 
end 

The occursin(regex, string) function returns true or false, depending on whether the given regex matches the string, so we can use it in an if expression. If you want the detailed information of the pattern matching, use match instead of occursin. This either returns nothing when there is no match, or an object of type RegexMatch when the pattern is found (nothing is, in fact, a value to indicate that nothing is returned or printed, and it has a type of Nothing).

The RegexMatch object has the following properties:

  • match contains the entire substring that matches (in this example, it contains the complete number)
  • offset states at what position the matching begins (here, it is 1)
  • offsets gives the same information as the preceding line, but for each of the captured substrings
  • captures contains the captured substrings as a tuple (refer to the following example)

Besides checking whether a string matches a particular pattern, regular expressions can also be used to capture parts of the string. We can do this by enclosing parts of the pattern in parentheses ( ). For instance, to capture the username and hostname in the email address pattern used earlier, we modify the pattern as follows:

email_pattern = r"(.+)@(.+)" 

Notice how the characters before @ are enclosed in brackets. This tells the regular expression engine that we want to capture this specific set of characters. To see how this works, consider the following example:

email_pattern = r"(.+)@(.+)" 
input = "john.doe@mit.edu" 
m = match(email_pattern, input) 
println(m.captures) #> Union{Nothing,
SubString{String}}["john.doe", "mit.edu"]

Here is another example:

m = match(r"(ju|l)(i)?(a)", "Julia") 
println(m.match) #> "lia" 
println(m.captures) #> l - i - a 
println(m.offset) #> 3 
println(m.offsets) #> 3 - 4 - 5 

The search and replace functions also take regular expressions as arguments, for example, replace("Julia", r"u[\w]*l" => "red") returns "Jredia". If you want to work with all the matches, matchall and eachmatch come in handy:

str = "The sky is blue"
reg = r"[\w]{3,}" # matches words of 3 chars or more 
r = collect((m.match for m = eachmatch(reg, str)))
show(r) #> ["The","sky","blue"]

iter = eachmatch(reg, str) 
for i in iter 
    println("\"$(i.match)\" ") 
end 

The collect function returns an array with RegexMatch for each match. eachmatch returns an iterator, iter, over all the matches, which we can loop through with a simple for loop. The screen output is "The", "sky", and "blue", printed on consecutive lines.

主站蜘蛛池模板: 从化市| 女性| 辽阳县| 陆丰市| 黄大仙区| 泾源县| 翁牛特旗| 米林县| 平远县| 荣成市| 麟游县| 铁岭县| 织金县| 湄潭县| 明溪县| 辽阳市| 黄山市| 兴文县| 大埔区| 东山县| 辽中县| 南涧| 溆浦县| 儋州市| 芒康县| 日土县| 丁青县| 铜川市| 皋兰县| 敖汉旗| 札达县| 太和县| 宁晋县| 东阳市| 高州市| 年辖:市辖区| 武安市| 宣恩县| 拉孜县| 广水市| 石棉县|