Monthly Archives: June 2014

Data Acquisition and Cleansing using IS DataStage and Quality Stage

Lets take each question one by one in the series
1 How is the information generated
As we have seen that data gets generated over a period of time  from various sources. What are the sources are we looking at

Lets take an example for a CRM in a MART
1. Data is generated when there is acquisition of material  for sales from vendors
2. Data relating to the Sales units (shops) and materials  to be dispatched ie quantity and type of materials
3. Data generated due to Sale of material.
4. data generated  due to accounting

Looking above statements we come to know from where data is generated. This data is not information yet. We need to cleanse it to become the Information.
How do we acquire the data or create a ware house
We have various tools which allow you to do move the data from a OLTP machine to a ware house. One of such tools is
IBM Infosphere InformationServer. The suite provides a complete set of tools which can assist the user to Extractdata from different sources transform data to Load the ware house  and feed the customer with intelligence to make it as information.
Would broadly talk about the different tools available for what
Data Extraction and transformation – InformationServer DataStage
Data Cleansing and standardization and Analysis – InformationServer Quality Stage and InformationServer InformationAnalyzer

By using above tools we will be able to move data from a  source to a Target Warehouse. In the process of moving Data is cleansed using the Quality Stage and InformationAnalyzer.

Process is to investigate the data  either thru InformationAnalyzer to get the depedency of a column data  on other columns and with weightage along with graphs.  We can also define Rules on how the data is distributed or investigated.

Quality Stage has a stage called Investigate which also provides similar analysis but the analysis can be  reused in  further processing of data cleansing.  IA analysis is being integrated to DataStage  to provide complete analysis and use of the data.

Once the investigation is completed, Data is standardized based on the Rules defined ,

Address standardization tools are approved by some Government authorities and also give some discounts for the mail processing if these tools are used in Address standardizations like (CASS – US Address, DPID- Australian Address verification, SERP and AddressDoctor)

Once the standardization is done matching or comparison to remove the duplicates in the data and only unique data is pushed forward. Now we have Clean DATA.

In a typical customer Dataset,  atleast 30 -40 % data is either duplicate or incomplete   and not fit enough to augment to decision making process.


There are  few virtual session which are going in parallel in IBM

Is Your Data Secure? A Conversation with Experts – June 17 / 2:30 p.m. ET
Join us for a live interactive virtual event to explore The issues and challenges facing organizations as they look to privatize and secure their most sensitive data. Register today –