Saturday, January 19, 2013

Delphi String Parsing Functions

Intro

To 'parse' a string of characters, is to filter it in some programmatic way, which allows you to get a piece or pieces of useful data from that string. 

Most of the time i create (and re-create) programming functions which allow me to parse HTML code from websites, after all HTML code is just one long string of characters. The kind of data you can extract is anything you'd find in an HTML webpage: E.g. Image links & Text values (Think ebay prices, tables, articles, javascript/css links etc).

Seeing as my memory is terrible, i have a bad habit of re-creating code, instead of just re-using what i have previously created. This post will serve not only as a tutorial for those of you who need to use these functions, but also as code documentation for my own future reference. I hope it is easy enough to understand :)

Necessary Functions

Example case of finding substringwewant within MainString:

MainString = Now is the winter of our discontent.
StringA = the
StringB = our
substringwewant = winter of

All string parsing functions rely on being able to find a substringwewant of characters within a larger, MainString of characters. Before that can be done, they rely on finding two substrings within a main string. StringA, a sequence of characters immediately before the substringwewant  and StringB, a sequence of characters immediately after the substringwewant  Once StringA & StringB are located, it is reasonably straight forward to extract the text between them, using existing programming functions (Copy(),Pos() & PosEX()). This is done in stages:
  1. Locate StringA, within the MainString: The Pos() function is perfect for this, it takes two parameters (String A, Main String) and gives us the index of String A within the main string. The function would be written like this: Pos(StringA,MainString);.
  2. Locate StringB, within the MainString (Search only characters after StringA s position): The PosEx() function is perfect for this, even more so than regular Pos(). The reason why, is that PosEx() takes a third parameter - an offset parameter, which tells the function to start looking for String B only after a certain number of characters into the main string. In this case, we want PosEx() to only look for StringB, AFTER the location of StringA. Doing this offset search has the following benefits, it: a) saves doubling up checks on the characters before StringA and the search is therefore much faster, and b) Makes sure you only find the single occurence of StringB immediately after StringA, and not another, random occurence of StringB).
  3. Get the length of StringA: This is done simply by putting StringA into a function called Length(StringA), which returns a number indicating StringA's length.
  4. Get the characters between StringA & StringB: The function called Copy() is what we will use for this (As well as the information gathered in the previous steps). Copy() returns text from within MainString, after it takes three parameters: MainString,Index,Count. Where: 
    1. Index = 1 + the location of the last character in StringA = 1 + Length(StringA).
    2. Count = StringBlocation - Index.
Ahh! So much complexity, for what seemed like such a straightforward task - getting a substring from within a main string. Unfortunately all of these steps are necessary, just to extract one substring from a mainstring. If you have to programmatically extract tens or hundreds of substrings from a mainstring, this process very soon becomes very boring which makes it easy to stuff up. I've done it hundreds of times AND stuffed up heaps along the way. I decided to pile all this into one programming function (CopyBetween()) which takes four parameters - and does the lot!

Get text between two strings (CopyBetween())

CopyBetween() returns the string between StringA and StringB after it takes four parameters: FirstString (StringA), SecondString (StringB), SourceString (MainString) and Offset (The number of characters at the start of MainString to ignore in the searches).

Uses
  StrUtils; //Neccessary for Pos() and PosEx() functions.


function CopyBetween(FirstString: String; SecondString: String; SourceString: String; Offset: Integer = 1): String;
var
  FirstStringLength,index1,index2:Integer;
begin
  // Store the length of the prefix text (FirstString).
  FirstStringLength := Length(FirstString);
  // Store the location of the prefix text (FirstString).
  index1 := PosEx(FirstString, SourceString, Offset);
  // Store the location of the suffix text (SecondString).
  index2 := PosEx(SecondString, SourceString, index1+FirstStringLength);
  // Return the text between the two strings.
  Result := Copy(SourceString,index1+FirstStringLength,index2-(index1+FirstStringLength));
end;

Copy the above function into your code somewhere, and put the Windows & StrUtils declaration in your uses section (At the top your code) or download the source code file at the bottom of this page.

Find out if strings occur in the correct order (IsInOrder())

Sometimes, before you can go ahead and get text from between two strings, you have to know if both strings occur in the order expected, E.g: StringA occurring before StringB, and not after string StringB.
Here is a function which returns true, if the strings occur in order, or false, if the strings do not occur in order.
It takes four parameters: OccursFirst (StringA), OccursSecond (StringB), SourceString (MainString) & Offset (The number of characters at the start of MainString to ignore in the searche).

Uses
  StrUtils; //Neccessary for Pos() and PosEx() functions.


function IsInOrder(OccursFirst: String; OccursSecond: String; SourceString: String; Offset: Integer = 1): Bool;
var
  index1,index2: Integer;
begin
  // Store the location of the first string text.
  index1 := PosEx(OccursFirst,SourceString,Offset);
  // Store the location of the second string text.
  index2 := PosEx(OccursSecond,SourceString,Offset);

  // If the second string is after the first string,
  if(index2 > index1)then
  begin
    // Return true.
    Result := True;
  end
  // Otherwise,
  else
  begin
    // Return false.
    Result := False;
  end;
end;

Copy the above function into your code somewhere, and put the Windows & StrUtils declaration in your uses section (At the top your code) or download the source code file at the bottom of this page..

Source code file (DParseUtils.pas):
 - http://www.filehosting.org/file/details/413075/DParseUtils.pas

Any suggested modifications, edits, additions or constructive criticism welcome in the comments section below :)