Sunday, June 17, 2007

Char_Instr - Another Matching Algorithm

I've using Soundex, Double Metaphone, Levenshtein Distance and other string matching algorithms, but somethings it is the simple things that work best.

Here is a function I wrote in PL/SQL utilizing the INSTR() function to traverse through one string, checking each character as to whether or not it exists in the another string.

This has been useful in checking for duplicates in mailing lists as well as identifying potential duplicates that should be reviewed. The function returns the number of characters from the second string that are the same as the first string.

I've found the use of the Soundex to be too loose and the Levenshtein Distance algorithm to be lacking in its handling of transposed characters. Use of this function will give you a count of how many chars are the same regardless of position in the string.


/*==========================================================
Function: char_instr (str1, str2)

Purpose: To check how many characters from one string

exist in another string. The return value is the number
of characters that exist in the in both strings.
==========================================================*/

FUNCTION char_instr
(
str1 IN VARCHAR2,
str2 IN VARCHAR2
)
RETURN NUMBER
AS
v_exists NUMBER := 0;
v_occurs NUMBER := 0;
v_length NUMBER;
BEGIN
/*==========================================================
Get the length of the str2
==========================================================*/
v_length := LENGTH(str2);
/*==========================================================
Loop through str2 and check if char exists in str1
=========================================================*/
IF( v_length > 0 ) THEN
FOR x IN 1..v_length LOOP
v_exists := INSTR(UPPER(str1), SUBSTR(UPPER(str2), x, 1));
IF(v_exists>0) THEN
v_occurs := v_occurs + 1;
END IF;
END LOOP;
END IF;
RETURN v_occurs;
END char_instr;
*