Foxhound is the better* Database Monitor for SQL Anywhere.
*better: More thorough, more relevant, more effective.
...more Alerts, more All Clears, more details, more control in your hands.


Breck Carter
Last modified: June 14, 1996
mail to: bcarter@bcarter.com



The Dreaded Meta-Tilde (II)

...continued from The Dreaded Meta-Tilde (I)

"I can't tell you how frustrating it has been working with Match(). Please tell me how to let the user type car and have it match Breck Carter, carrot, car lot, etc."

Yowza! The roller coaster just went ballistic and I'm gonna lose my lunch. SetFilter gets really interesting when you include calls to the Match() function to do powerful searches. Listing 3 shows a simple SetFilter call that matches all rows that have the string "car" embedded anywhere within string_column.

Listing 3: SetFilter With Match

   li_RC = dw_groups.SetFilter &
      ( 'match ( string_column, ' &
      + '".*[Cc][Aa][Rr].*" )' )
   if li_RC <> 1 then
      MessageBox ( 'Error', 'SetFilter failed' )
      return
   end if

The periods, asterisks and square brackets are called "metacharacters" in PowerBuilder-speak, meaning they have special meanings other than simply being characters that match themselves. The letters CcAaRr are non-metacharacters because they do match themselves. The whole bizarre string is sometimes called a "regular expression", one of those oxymorons we're so fond of using like "transparent interface" and "portable software".

The code in Listing 3 can be expressed in English as follows: Show only the rows where string_column contains one of the letters C or c, followed by A or a, then R or r, preceded by zero or more occurrences (represented by the first asterisk) of any character (the period), and followed by zero or more occurrences of any character.

Whew! In other words, it matches the substring "car" in upper or lower case embedded anywhere in string_column. I once coded Hewlett Packard Printer Control Language (PCL) by hand, and it wasn't much harder than getting regular expressions right. Both are tiny little programming languages where single characters are combined to form powerful operators.

"But I want the user to just type 'car', not [Cc][Aa][Rr]. How is that accomplished?"

Well, you could break the value into single characters and build the match string, but it's a lot simpler to convert both the user input and the DataWindow column to lower case. Listing 4 shows how the user can type a word in a SingleLineEdit and have it used in a case insensitive match. Note that leading and trailing spaces are removed from the user's input but not the DataWindow column.

Listing 4: Case Insensitive Match

   ls_match = trim ( lower ( sle_match.text ) )
   ls_match &
      = 'match( lower( string_column ), ".*' &
      + ls_match &
      + '.*" )'

   li_RC = dw_groups.SetFilter ( ls_match )
   if li_RC <> 1 then
      MessageBox ( 'Error', 'SetFilter failed' )
      return
   end if

"What about matching 'car' at the beginning of words like 'carrot' but not embedded within words like 'incarnate'?"

Incarnate indeed. Match can do what you want but first you must figure out exactly what that is; i.e., how is "the beginning of a word" defined? The answer might be "a prefix is a substring that appears at the beginning of the string, or anywhere in the string if it's preceded by a blank." Listing 5 shows how the caret metacharacter can be used to match the beginning of the string, and how the "or" operator may be used to combine multiple match calls in one SetFilter call. These two match calls find 'car' at the beginning of the string, or one blank plus 'car' anywhere.

Listing 5: Simple Prefix Match

   ls_match &
      = 'match( lower( string_column ), "^' &
      + ls_match &
      + '.*" ) or ' &
      + 'match( lower( string_column ), ".* ' &
      + ls_match &
      + '.*" )'

"That works on 'Breck Carter' but what about 'Jones,Carmen'?"

Yes, well, I guess that calls into doubt our definition of prefix. Let's refine it to include the words "following a separator character" instead of just a blank. And we'll include a few other possibilities besides the comma. Listing 6 shows how the square bracket metacharacters can be used to match one blank, comma, colon, semicolon or underscore so it will find 'car' in 'Jones;Carmen' as well as 'Jones,Carmen'.

Listing 6: Prefixes Follow Separators

   ls_match &
      = 'match( lower( string_column ), "^' &
      + ls_match &
      + '.*" ) or ' &
      + 'match( lower( string_column ), ' &
      + '".*[ ,_:;]' &
      + ls_match &
      + '.*" )'

"What about finding 'ted' or 'day' in Edward "Ted" O'Day? How about 'com' in fpostle@powersoft.com?"

Listing 2 dealt with searching for quotes. Now we're talking about treating quotes as separator characters. In the first case, the program handled quotes embedded in the search string, while in this case the program must embed quotes in the match string. Eventually we'll want to do both, but first it's time to talk seriously about tildes. Here are some hints that might help:

  1. Tildes serve two purposes: to denote special characters like tilde-t for tab, and to embed tildes and quotes within a string literal.
  2. PowerBuilder will strip one layer of tildes from a string literal whenever it is parsed.
  3. Stripping of tildes means to repeatedly look for the next tilde, throw it away, and then store following character as-is in the case of quotes and tildes, or as the real character it represents as in the case of tilde-t for tab.
  4. Once a special character pair like tilde-t has been replaced with its corresponding real character, nothing further can happen to it. In other words, it's now a tab and will remain a tab. Special characters aren't the problem. It's character pairs like tilde-tilde and tilde-quote that cause all the confusion.
  5. Parsing of string literals and the associated stripping of tildes can happen twice to the same string value.
  6. The first time parsing happens is when a string literal is assigned to a PowerScript variable or passed as an argument in a function call and the script is compiled and saved. Think of this as the first compile.
  7. The second happens at runtime when a string literal or variable has been passed to some DataWindow function like Modify or SetFilter, and it contains a quoted string embedded within it that is parsed by the DataWindow engine. Think of this as a second compile.
  8. Single and double quotes are interchangeable in pairs, meaning you are completely free to choose either kind to wrap around a string literal.
  9. Single quotes may be embedded within a double quoted string literal, and vice versa, without the need for tildes. This helps a lot for the first level of nesting. For a second level of nesting (quotes within quotes within quotes) tildes are a necessity.
  10. Nesting level is different from the number of times a string is parsed. They are related, and they certainly affect one another, but it helps to understand them separately.
  11. If you want a tilde to survive the first compile and still be a tilde when the string is parsed for a second time, you must code it as two tildes. For example, tilde-tilde-quote will be tilde-quote after the first compile, and just a quote after the second.
  12. If you really want a tilde to survive both compiles, and still be a tilde when all is said and done, you must code it as four tildes. These four tildes will become two tildes after the first compile, and one tilde after the second.

Does your head hurt now? I once knew a Physics professor who used post-graduate thesis topics as questions on first-year mid-terms. The class would average about 10% but every once in a while some front-row genius would come up with a useful idea. If he was teaching PowerBuilder, here is what one of those questions might be: One, two or four tildes are all you ever need, never three. Two tildes are often superfluous, and the need for four is rare. The key is to carefully choose double quotes inside singles, or vice versa. True or false? Please explain.

I think I know the answer, but my head hurts too. Send your answer to bcarter@bcarter.com. For extra points, explain why the four tildes in Listing 2 are an example of point 12 above. And have a look at Listing 7, and explain why single and double-tildes are used.

Listing 7: Quotes as Separators

   ls_match &
      = 'match( lower( string_column ), "^' &
      + ls_match &
      + '.*" ) or ' &
      + 'match( lower( string_column ), ' &
      + '".*[ ,_:;~'~~"]' &
      + ls_match &
      + '.*" )'

"What about finding 'com' in fpostle@powersoft.com? Or 'car' in Smythe-Carter?"

The problem with treating the period as a separator is that it's a metacharacter itself. Putting it inside a match string has the effect of matching any character, not just the period. To embed a metacharacter as an ordinary character within a regular expression you must precede it with a backslash. This applies to several characters like the dollar sign, asterisk and question mark, all listed in the PowerBuilder Help.

But wait! There's one missing from the list! The hyphen in Smythe-Carter is also a metacharacter when used inside square brackets. For example [a-z] means "match any single character in the range from a to z" but not the hyphen itself. And that's exactly where we want to put a hyphen, inside the list of separators. It too must also be preceded by a backslash as shown in Listing 8.

Listing 8: Metacharacters as Separators

   ls_match &
      = 'match( lower( string_column ), "^' &
      + ls_match &
      + '.*" ) or ' &
      + 'match( lower( string_column ), ' &
      + '".*[ ,_:;~'~~"\.\-]' &
      + ls_match &
      + '.*" )'

This business of the hyphen not being explicitly documented as a metacharacter is troublesome. What if some other character becomes a metacharacter in a future release of PowerBuilder? What if the tilde became a metacharacter?

Here's another hint to add to the list: Backslashes are processed after all the tilde stripping is finished. And although the documentation says to use backslash before a metacharacter, it looks like backslashes are stripped from in front of any character. So today's solution to future metacharacters might be to put a backslash in front of every single character you want embedded in the match string. But not inserted inside tilde pairs that are going to be reduced by the earlier stripping processes, just ahead each tilde sequence that's going to end up as a single character. The mind boggles at the monstrosity of Listing 9, but Figure 2 is proof that it works.

Listing 9: The Dreaded Meta-Tilde

   ls_match &
      = 'match( lower( string_column ), "^' &
      + ls_match &
      + '.*" ) or ' &
      + 'match( lower( string_column ), ' &
      + '".*[\ \,\_\:\;\~'\~~"\.\-\~~~~]' &
      + ls_match &
      + '.*" )'

Figure 2: It Actually Works

"What if the user includes metacharacters and quotation marks in the search string?"

Enough! The ride is over, time to get off. As that lazy professor might have done, let's assign this problem as homework. The user wants to search for the prefix "test." so that it matches "test.dta" and "run test.com" but not "xtest.com" or "thorough testing". The search string might include quotes and tildes as well as metacharacters. In fact, the user might want to search this web page for some very strange strings indeed. Hint: combine the code from Listings 4 and 9 and add some logic to preprocess the user's input like the code in Listing 2.


See also: The Dreaded Meta-Tilde (I)

Breck Carter can be reached by phone at (416) 763-5200 or via email at bcarter@bcarter.com.