Hi.
Ever needed to parse a web page and get all the Links in it (href’s)? the easy way is to use this regular expression to get the href:Regex r = new Regex(”href.*)”;
for those of you who don’t know this means get me something that starts with -href- and then: whatever… that’s what the -.*- is for. The problem is that now we have to work on the results in order to get the actual link.
Extra work? I don’t think so…
We want to use groups, so the regular expression will look like this:”href.*?”(?<HREF>.*?)”Or in code: (we need to add \ for some escape characters)
Regex MyRegex = new Regex("href.*?"(?<href>.*?)"", RegexOptions.Multiline);
The RegexOptions.Multiline means that we can provide a multiline string as the input of the Regular expression. lets break it down:
href.*?”(?<HREF>.*?)”
The beginning is the same -href.*- get everything that starts with href now comes the twist.
the -?”- means stop on the first ” you find, if we drop the -?- he will stop on the last -”- he finds (greedy!!!).
Now comes the definition of the group: -(?<HREF>.*?) the syntax for defining a group is :
(?<GroupName><Rule>)
What comes after the Group name is the regular expression for the group, in our case the end looks like this:
.*?)”
which means get everything until the first ” you see.
that way we will get the “clean” URL inside the HREF group!
To use the groups use this code:
public static void GetMatches(string s) { Regex MyRegex = new Regex("href.*?"(?<href>.*?)"", RegexOptions.Multiline); MatchCollection mc1 = MyRegex.Matches(s); Console.WriteLine(MyRegex.ToString()); foreach (Match m1 in mc1) { Console.WriteLine("URL: {0}", m1.Groups["href"].Value); } }
Have fun!!
Amit
We pay for user submitted tutorials and articles that we publish. Anyone can send in a contribution
Learn More
Andreja Ilic Said on Sep 15, 2008 :
Very nice… Recently I had problem similar like this, and I used my classes.
What I was going to ask is: do you know how fast this is? Is it implemented like grammar or like string manipulation?
Thanks
Andreja Ilic
Amit Said on Sep 16, 2008 :
It is a s fast as it gets :).
Especially if you add
RegexOptions.Compiled
The compiler will create a special function according to the expression you entered so most of the work is done in compile time.