Dec
13th | 2007

How to use Regular Expressions

Filed under C# | Posted by Amit

Hi.
Ever needed to parse a web page and get all the Links in it (href’s)? the easy way is to use this regular expression to get the href:Regex r = new Regex(”href.*)”;

for those of you who don’t know this means get me something that starts with -href- and then: whatever… that’s what the -.*- is for. The problem is that now we have to work on the results in order to get the actual link.

Extra work? I don’t think so…

We want to use groups, so the regular expression will look like this:”href.*?”(?<HREF>.*?)”Or in code: (we need to add \ for some escape characters)

Regex MyRegex = new Regex(“href.*?”(?<href>.*?)”", RegexOptions.Multiline);

The RegexOptions.Multiline means that we can provide a multiline string as the

input of the Regular expression. lets break it down:

href.*?”(?<HREF>.*?)”

The beginning is the same -href.*- get everything that starts with href now comes the twist.

the -?”- means stop on the first ” you find, if we drop the -?- he will stop on the last -”- he finds (greedy!!!).

Now comes the definition of the group: -(?<HREF>.*?) the syntax for defining a group is :

(?<GroupName><Rule>)

What comes after the Group name is the regular expression for the group, in our case the end looks like this:

.*?)”

which means get everything until the first ” you see.

that way we will get the “clean” URL inside the HREF group!

To use the groups use this code:

public static void GetMatches(string s)

{

    Regex MyRegex = new Regex(“href.*?”(?<href>.*?)”", RegexOptions.Multiline);

    MatchCollection mc1 = MyRegex.Matches(s);

    Console.WriteLine(MyRegex.ToString());

    foreach (Match m1 in mc1)

    {

        Console.WriteLine(“URL: {0}”, m1.Groups["href"].Value);

    }

}

Credit to Shahar A.

Have fun!!

Amit

Tags: , ,

Post a Comment

Search Dev102