How to extract URLs using regex
Scroll down to see why you came here. First, let's have a look at some theory based on FreeBSD's re_format(7):
Regular expression theory for sysadmins
What is an extended aka modern regular expression? One or more non-empty branches, separated by |.
What does that mean?
A branch is one ore more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc... piece.
A piece is an atom, possibly followed by *, +, ? or bound.
- An atom followed by a * matches a sequence of 0 or more matches of the atom.
- An atom followed by a + matches a sequence of 1 or more matches of the atom.
- An atom followed by a ? matches a sequence of 0 or one matches of the atom.
- An example of a bound: {2,8} where numbers between 0-255 (inclusive) are valid; the first number must not exceed the second.
- An atom followed by {8} matches a sequence of exactly 8 matches of the atom.
- An atom followed by {8,} matches a sequence of 8 or more matches of the atom.
- An atom followed by {0,42} matches a sequence of 0 through 42 (inclusive) matches of the atom.
An atom is:
- A regular expression enclosed in (), matching a match for said regular expression.
- An an empty set of (), matching the null string.
- A dot . which matches any single character.
- A caret ^ which matches the null string at the beginning of a line.
- A dollar sign $ which matches the null string at the end of a line.
- A single escaped special character, matching the special character:
- Escape character: \
- Characters to be escaped if you want to match them: ^ . [ $ ( ) | * + ? { \
- A single combination of the escape character \ and any character that does not need escaping, matching that character as if the \ had not been present.
- A single ordinary character, matching that character.
- NOTE: a curly bracket open sign { followed by a character other than a digit is an ordinary character, not the beginning of a bound. It is illegal to end an RE with the escape character \.
- A bracket expression is a list of characters enclosed in []. With the exception of the following all, and some combinations using ], special characters including the escape character \ lose their special significance within a bracket expression:
- It normally matches any single character from the list: [xyz] would match either a single x, a single y or a single z.
- If the list begins with a ^, as in [^xyz], it matches any single character not from the rest of the list.
- If two characters in the list are separated by a hyphen -, as in [a-f] or [0-9], this is known as a collating sequence and the full range of characters between the two (inclusive) are matched. Note that the beginning and the end of the list are called endpoints, you cannot have three endpoints, so the following is illegal: [n-o-t].
- To include a literal ] in the list make it the first character in the list: []xyz] or with negation [^]xyz] or without the letters.
- To include a literal hyphen - in the list make it the first or last character of the list as in [-xyz] or [xyz-], or the second endpoint of a range: [a-f-]. To use the hyphen as the first endpoint enclose it like this [.-.] (making it a collating element) and the result would look like this: [[.-.]a-f]
- Within a bracket expression a collating element is:
- A character.
- A multi-character sequence that collates as if it were a single character.
- Or a collating sequence name for either.
- Enclosing a collating element in [. and .] stands for the sequence of characters of that collating element.
- The resulting sequence is a single element of the bracket expression's list.
- Example: the RE [[.ch.]]*c matches the first five characters of chchcc.
- Within a bracket expression a collating element enclosed in [= and =] is an equivalence class which stands for:
- An equivalence class stands for the sequence of characters of all collating elements equivalent to that one, including itself.
- if there are no other equivalent collating elements, the treatment is as if the enclosing delimiters were [. and .]
- An equivalence class may not be an endpoint of a range.
- Example: if x and y are members of an equivalence class, then [=x=], [=y=] and [xy] are all synonyms.
- An equivalence class stands for the sequence of characters of all collating elements equivalent to that one, including itself.
- Within a bracket expression a character class is one of the following enclosed in [: and :]
- ( cannot be used as an endpoint of a range; these are based on ctype(3) and in its man page are references for equivalent function man pages and all this seems to be specific to locale settings and implementation choices rather than of exact science)
- Matches a single character of the character class.
- alnum: [a-z], [A-Z] and [0-9]
- alpha: [a-z] and [A-Z]
- cntrl: a bunch of control chars such as EOT (end of transmission, octal value 004)
- digit: [0-9]
- graph: alnum and punct
- lower: [a-z]
- print: graph and " " (space)
- punct: ! " # $ % & ' ( ) + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
- upper: [A-Z]
- xdigit: hexadecimal numbers, i.e. [0-9], [a-f] and [A-F]
- space: "\t" "\n" "\v" "\f" "\r" " " (tab, newline, vertical tabulation, form feed, carriage return, space)
- blank: "/t" (tab) and " " (space)
- Example: [:alnum:] or negate [^[:lower:]]
- A word in the following sense (context: null string as word delimiter) is defined as sequence of word characters which is neither preceded nor followed by word characters which are alnum (see above) or an underscore.
- [[:<:]] matches the null string at the beginning of a word.
- [[:>:]]matches the null string at the end of a word.
- It normally matches any single character from the list: [xyz] would match either a single x, a single y or a single z.
Regular expression to extract URLs
# NOTE: source HTML contains HTML escapes
grep -E '(((http|https|ftp|gopher)|mailto):(//)?[^ <>"\t]*|(www|ftp)[0-9]?\.[-a-z0-9.]+)[^ .,;\t\n\r<">\):]?[^, <>"\t]*[^ .,;\t\n\r<">\):]' myURLs.txt