Wednesday, October 3, 2007

Recent international marketing database build

I commonly refer to these types of databases as data staging areas. The reason is that most large companies that I work for have many existing database systems which they already use. The staging area is a place where data has been pulled from all over an organization, cleaned, standardized, and de-duped. This is then put into an operational staging area for regular use in marketing programs.

So what were we dealing with to get started?

A series of large data extracts were sent to me with information from existing internal databases. This consisted of about 3 files in relatively good order. Next came an additional set of 250 files in every shape, size and format. Some files had only 15 or 20 records; others had thousands of records from many different countries.
My team and I spent the better part of a month evaluating and structuring these files for use. This included a substantial amount of communication with the client to clearly identify what every file meant and how it should be used and prioritized in the new staging area.

Just a few facts

We were looking at over 1,000 different field names relating to a clean list of about 220 actual field names. There were over 24,000 unique job titles that had to be translated into a standard set of job functions to make selections on the data possible. We identified over 6,000 industry descriptions in all types of industry classifications. Of course values like country and city names were also coded in numerous ways – human creativity is endless when it comes to doing the same thing in many different ways. In the end we mapped over 35,000 values to a standard list of variables in our data dictionary which made effective selections possible.
This number may seem impressive but field name matching is straightforward if you only have to match one field name against another one. Often, however, you face different data sources and some of them have account and contact data separated, while others list them together or do not distinguish between the two levels at all. Normalization to accounts and contacts is even more difficult, if you allow for the same type of value to be linked to either a contact or an account (e.g. address or telephone number).

Step by step

When all the entry data was ready with correct column mapping, we went to the lowest level of detail and searched for any wrong characters. At this stage we attempted to recover original values when we found broken data in different encodings. Often it is possible to recover original Cyrillic or Eastern European strings even from an unreadable state. This is precisely what we had to do in some of the files that had happened to be saved with wrong local settings in the past. After successful recovery, we noted the column names of local values for further creation of column versions that we named “Latin” or “ND”, for Latin characters only, or No Diactrics, respectively.
Next we dealt with issues relating to communications details. This includes company names, contact names, addresses, telephone and fax as well as email and website addresses. Once the communication details were amended we were able to run effective de-duplication routines on the files.
We spent a lot of time removing noise values that are commonly found in the company names as well as in the telephone numbers. Next we ran PAF validation (Postal Address Files). This provided us with clean address records according to local postal file standards and it also provides us with good idea of the correctness and accuracy of the file. For telephone numbers we checked their length in digits against the standards for each country and for websites we run a direct internet lookup to check for valid sites linked to the addresses.
Next we merged and ran our de-duplication software on all the data. This is the main point of data cleaning that requires exceptional care with duplicate identification and then with record merging. The de-duplication or “de-dupe” is first run on a company level and then on a contact level. Usually when working with various purchased data from outside vendors you can expect a duplicate rate of anywhere from 5 to 20%. In the case of internal data from one large company, you are talking about duplicate rates of more than 50% as the same data is found in multiple sources throughout the organization. We were able to correctly identify and merge duplicates only thanks to data policies carefully designed together with our client.
All the processes mentioned above included a lot of leg work but our crowning achievement was to design the proper merging of hundreds of thousands of records. This included information on source of the record, priority of the file it came from as well as judgments on the richness of the data.
This is a great moment in data staging work like this one, when you are finally able to view all the records merged and cleaned in one format. Then you know that there are only a few more tasks to be done to further improve the quality of the data. We capitalize the data, assign gender and local language salutations to contacts and generate the Latin field versions that were mentioned before for easy handling.
I am truly proud of this build. We now offer our customer the ability to see and use all the data found throughout their organization and we are talking about data from all over the world – Europe, Asia, North & South America, Africa, and Middle East.

Labels: , , ,