Document Filtering using Regular expressions


This is from a project to extract certain data from legal (real estate) documents that had been scanned into TIFF format. The documents were OCR'd to make them program readable then subjected to string searches that extracted certain data. Because of the differences in wording used by various creators of the documents a document type could be (was) identified by the appearance in the document of different bracketing phrases. In addition the scanning process blurred the document in non-predictable places which led to 'misspellings' in the OCR.

I found it useful to write regular expression type string processors that could account for different spellings, wordings, and OCR deficiencies and routines to delete up-to and through the found strings both from the left and from the right.

Processing a document, then, was a process of searching for various spellings of certain pre-amble words or phrases and deleting from the left through that phrase then search for a post-amble in like manner and deleting from the right through that phrase.
The result was the text being searched for.
Of course, the item being searched for could be identified by several different pre-amble phrases so that it was necessary to devise a routine that searched the document for several phrases and returned the earliest 'hit'.

using System;
using System.Collections.Generic;
using System.Collections;
using System.Text;
using System.IO;
using System.Text.RegularExpressions;

namespace docfilter
{
	static class Rgx
	{
		public static MatchCollection oMatches;      // collection of matches
		public static Match oMatch;                  // a match item 

		// -----------------------------------------------------------------------
		//  Search 'strng' for escaped rgx sensitive symbols in an expression
		//  useful when an extracted string is used as a search expression
		// -----------------------------------------------------------------------
		public static int RgxExecNrmlzd(string patrn, string strng)
		{
			string expr, tmp;

			if (patrn.Length < 1)         // if empty
				return 0;

			expr = patrn;
			expr = expr.Replace(@"\", @"\\");
			expr = expr.Replace(@"*", @"\*");
			expr = expr.Replace(@"+", @"\+");
			expr = expr.Replace(@"?", @"\?");
			expr = expr.Replace(@"|", @"\|");
			expr = expr.Replace(@"{", @"\{");
			expr = expr.Replace(@"}", @"\}");
			expr = expr.Replace(@"[", @"\[");
			expr = expr.Replace(@"]", @"\]");
			expr = expr.Replace(@"(", @"\(");
			expr = expr.Replace(@")", @"\)");
			expr = expr.Replace(@".", @"\.");

			// --- handle leading ^ and trailing $ ---
			tmp = expr.Substring(0, 1);
			expr = expr.Substring(1, expr.Length - 1);
			tmp = tmp.Replace(@"^", @"\^");
			expr = tmp + expr;

			tmp = expr.Substring(expr.Length - 1, 1);
			expr = expr.Substring(0, expr.Length - 1);
			tmp = tmp.Replace(@"$", @"\$");
			expr = expr + tmp;

			return (RgxExec(expr, strng));
		}

		// -----------------------------------------------------------------------
		//  Search strng for patrn
		//  Establish Collection Object Matches with hits
		//  Return # of hits (entries in Matches)
		// -----------------------------------------------------------------------
		public static int RgxExec(string patrn, String target)
		{
			Regex rgxo = new Regex(patrn, RegexOptions.IgnoreCase);
			oMatch = rgxo.Match(target);
			if (oMatch.Success == true)
				return 1;
			return 0;
		}
		public static int RgxExecAll(string patrn, String target)
		{
			Regex rgxo = new Regex(patrn, RegexOptions.IgnoreCase);
			oMatches = rgxo.Matches(target);
			return oMatches.Count;
		}

		// ---------------------------------------------
		// deletes target right from the match
		// must be called IMMEDIATELY after rgxexec()
		// ---------------------------------------------
		public static string DeleteRightToMatch(String target)
		{
			int pos = oMatch.Index + oMatch.Value.Length;
			string result = target.Remove(pos, target.Length - pos);
			result = result.Replace("  ", " ");
			return result.Trim();
		}

		// ---------------------------------------------
		// deletes target right including the the match 
		// removes possible trailing space
		// must be called IMMEDIATELY after rgxexec()
		// ---------------------------------------------
		public static string DeleteRightThroughMatch(String target)
		{
			int pos = oMatch.Index;
			string result = target.Remove(pos, target.Length - pos);
			if (result.Length > 0)
			{
				if (result[result.Length - 1] == ' ')
				{
					result = result.Remove(result.Length - 1, 1);
				}
			}
			return result.Trim();
		}

		// ---------------------------------------------
		// deletes target left to the match
		// must be called IMMEDIATELY after rgxexec()
		// ---------------------------------------------
		public static string DeleteLeftToMatch(String target)
		{
			int pos = oMatch.Index;
			string result = target.Remove(0, pos);
			return result.Trim();
		}

		// ---------------------------------------------
		// deletes target left through the match
		// removes possible leading space
		// must be called IMMEDIATELY after rgxexec()
		// ---------------------------------------------
		public static string DeleteLeftThroughMatch(String target)
		{
			int pos = oMatch.Index + oMatch.Value.Length;
			string result = target.Remove(0, pos);
			return result.Trim();
		}

		// ---------------------------------------------
		//  remove the match from the target
		// ---------------------------------------------
		public static string DeleteTheMatch(String target)
		{
			string result = target.Remove(oMatch.Index, oMatch.Value.Length);
			result = result.Replace("  ", " ");
			return result.Trim();
		}

		// ---------------------------------------------
		//  replace the match in the target
		// ---------------------------------------------
		public static string ReplaceTheMatch(String target, string repl)
		{
			int pos = oMatch.Index + oMatch.Value.Length;
			string result = target.Remove(pos, target.Length - pos);    // left part
			result += repl;                               // replacement 
			result += target.Remove(0, pos);              // right part 
			result = result.Replace("  ", " ");
			return result.Trim();
		}

		// --------------------------------------------------------------------------
		//  load field discovery constraints into an array and rtn it
		//  constraints begin with the passed key and end with a blank line
		//  tabs are stripped and everything upto and including the first / and
		//  everything from and including the last / is stripped
		//  a line may be commented by placing a single quote in col 1
		//
		//  at least one element is always returned
		//
		//		The constraint file name is in glob.inifile
		// --------------------------------------------------------------------------
		public static ArrayList LoadConstraints (string constraint) {
			ArrayList ar = new ArrayList();
			string ln;
			StreamReader sr;

			ar.Clear();
			sr = File.OpenText(Glob.inifile);
         
			// --- find start of constraints ---
			while (sr.EndOfStream == false) {
				ln = sr.ReadLine();
				if (ln.Contains(constraint) == true)
					break;
			}

			if (sr.EndOfStream == true) {
				MessageBox.Show(constraint + " not found in " + Glob.inifile);
				sr.Close();
				return (ar);
			}

			// --- unload constraints into arraylist ---
			while (sr.EndOfStream == false) {
				ln = sr.ReadLine();
				ln = ln.Trim();
				if (ln.Length < 2)                  // empty line signals end
					break;
				if (ln.Substring(0, 1) == "'")      // comment
					continue;
				int p2 = ln.LastIndexOf('/')-1;       // trailing slash
				int p1 = ln.IndexOf('/')+1;           // leading slash
				ar.Add(ln.Substring(p1, (p2 - p1 + 1)));
			}
			sr.Close();
			return (ar);
		}

		// --------------------------------------------------------------------------
		//  find the earliest occurence of the pattern strings in the search string
		//  if any found set the match object and return true
		// --------------------------------------------------------------------------
		public static bool EarliestOccurrence(string ss, ArrayList ar)
		{
			int expr = 0;
			int pos = 9999999;           // an impossibly large string position
			for (int i = 0; i < ar.Count; i++)
			{
				if (Rgx.RgxExec((string)ar[i], ss) > 0)
				{
					if (pos > oMatch.Index)
					{
						pos = oMatch.Index;
						expr = i;
					}
				}
			}

			if (pos < 9999999)
			{
				Rgx.RgxExec(ar[expr].ToString(), ss);     // set omatch to earliest hit
				return (true);
			}
			return (false);
		}
	}
}


Of particular interest is the function 'EarliestOccurrence(string ss, ArrayList ar)'.
Many of the documents being filtered have different pre-ambles for sought-for terms due to the preparation by different firms. For example in one instance the mortgage borrowers name may be prefaced by 'executed between' or 'between the mortgagor' or 'borrower is' depending upon who prepared the document.
To solve this problem the function 'EarliestOccurence()' would be called with ss=the document as a string and ar set to an array of the previously mentioned phrases. If a match was made the function would return 'true' and the match object would be set to the earliest occurring phrase. You could then DeleteLeftThroughMatch() and the bgorrower name would lead the result string.

The function 'LoadConstraints()' provides a way that the different phrases can be packaged in a text file and retrieved. The constraint name is enclosed in square brackets [name] and followed on separate lines by the phrases enclosed in forward slashes /phrase/.
Here is a sample for Duluth County, Minnesota...
	[AZC_MOR:grantor_preamble]
		/ executed between /
		/ between the mortgagor(\(s\))?/
		/\bborrower ?is /
		/borrower signature name/
		/\bmortgagor ?is /
		/\bmortgagor:/