Monday, 14 March 2011

How to Transform and Filter data in datastage

It's a very common situation and a good practice to design data stage jobs in which data flow goes in the following way:

EXTRACT SOURCE -It's a very common situation and a good practice to design datastage jobs in which data flow goes in the following way:

EXTRACT SOURCE -> DATA VALIDATION, REFINING, CLEANSING -> MAPPING -> DESTINATION
The data refining, validation and mapping part of the process is mainly handled by a transformer stage. Transformer stage doesn't extract or write data to a target database. It handles extracted data, performs conversions, mappings, validations, passes values and controls the data flow.
Transformer stages can have any number of input and output links. Input links can be primary or reference (used for lookups) and there can only be one primary input and any number of reference inputs.
Please refer to the examples below to find out what is the use of transformers.
In the job design depicted below there is a typical job flow implemented. The job is used for loading customers into the datawarehouse. The data is extracted from an ODBC data source, then filtered, validated and refined in the first transformer. Rejected (not validated) records are logged into a sequential file. The second transformer performs a lookup (into a country dictionary hash file) and does some other data mappings. The data is loaded into an Oracle database.
Design of a common datastage job with validations and mapping

Design of a simple transformer

Design of a sample transformer with data filter and reject flow


> DATA VALIDATION, REFINING, CLEANSING -> MAPPING -> DESTINATION
The data refining, validation and mapping part of the process is mainly handled by a transformer stage. Transformer stage doesn't extract or write data to a target database. It handles extracted data, performs conversions, mappings, validations, passes values and controls the data flow.
Transformer stages can have any number of input and output links. Input links can be primary or reference (used for lookups) and there can only be one primary input and any number of reference inputs.
Please refer to the examples below to find out what is the use of transformers.
In the job design depicted below there is a typical job flow implemented. The job is used for loading customers into the datawarehouse. The data is extracted from an ODBC data source, then filtered, validated and refined in the first transformer. Rejected (not validated) records are logged into a sequential file. The second transformer performs a lookup (into a country dictionary hash file) and does some other data mappings. The data is loaded into an Oracle database.
Design of a common datastage job with validations and mapping

Design of a simple transformer

Design of a sample transformer with data filter and reject flow

SCD type 2 implementation in Datastage

Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension and the fact table usually stores dimension key and version number.
SCD 2 implementation in Datastage
The job described and depicted below shows how to implement SCD Type 2 in Datastage. It is one of many possible designs which can implement this dimension.
For this example, we will use a table with customers data (it's name is D_CUSTOMER_SCD2) which has the following structure and data:
D_CUSTOMER dimension table before loading

Datastage SCD2 job design


The most important facts and stages of the CUST_SCD2 job processing:
• The dimension table with customers is refreshed daily and one of the data sources is a text file. For the purpose of this example the CUST_ID=ETIMAA5 differs from the one stored in the database and it is the only record with changed data. It has the following structure and data:
SCD 2 - Customers file extract:


• There is a hashed file (Hash_NewCust) which handles a lookup of the new data coming from the text file.
• A T001_Lookups transformer does a lookup into a hashed file and maps new and old values to separate columns.
SCD 2 lookup transformer

• A T002_Check_Discrepacies_exist transformer compares old and new values of records and passes through only records that differ.
SCD 2 check discrepancies transformer

• A T003 transformer handles the UPDATE and INSERT actions of a record. The old record is updated with current indictator flag set to no and the new record is inserted with current indictator flag set to yes, increased record version by 1 and the current date.
SCD 2 insert-update record transformer

• ODBC Update stage (O_DW_Customers_SCD2_Upd) - update action 'Update existing rows only' and the selected key columns are CUST_ID and REC_VERSION so they will appear in the constructed where part of an SQL statement.
• ODBC Insert stage (O_DW_Customers_SCD2_Ins) - insert action 'insert rows without clearing' and the key column is CUST_ID.
D_CUSTOMER dimension table after Datawarehouse refresh

How to perform a Lookup in Datastage

The data in Datastage can be looked up from a hashed file or from a database (ODBC/ORACLE) source. Lookups are always managed by the transformer stage.

A Hashed File is a reference table based on key fields which provides fast access for lookups. They are very useful as a temporary or non-volatile program storage area. An advantage of using hashed files is that they can be filled up with remote data locally for better performance.
To increase performance, hashed files can be preloaded into memory for fast reads and support write-caching for fast writes.

There are also situations where loading a hashed file and using it for lookups is much more time consuming than accessing directly a database table. It usually happens where there is a need to access more complex data than a simple key-value mapping, for example what the data comes from multiple tables, must be grouped or processed in a database specific way. In that case it's worth considering using ODBC or Oracle stage.
Please refer to the examples below to find out what is the use of lookups in Datastage
In the transformer depicted below there is a lookup into a country dictionary hash file. If a country is matched it is written to the right-hand side column, if not - a "not found" string is generated.

Design of a datastage transformer with lookup


In the job depicted below there is a sequential file lookup, linked together with a hash file which stores the temporary data.
Sequential file lookup

Monday, 28 February 2011

Datastage FAQS

1) Why job sequence is use? What are batches? What is the difference between job sequence and batches?
Job Sequence is allows you to specify a sequence of server or parallel jobs to run. The sequence can also contain control information, for example, you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Once you have defined a job sequence, it can be scheduled and run using the DataStage Director. It appears in the DataStage Repository and in the DataStage Director client as a job.
2) Why is hash file is faster than sequential file n odbc stage??
Hash file is indexed. Also it works under hashing algo. That's why the search is faster in hash file.
What is complex stage?
In which situation we are using this one? CFF stage is used to read the files in ebcidic format. Mainly mainframe files with redefines the answer depends on the situation you are writing code for. Macros have the distinct advantage of being more efficient (and faster) than functions,
3) What are the main diff between server job and parallel job in DataStage
In server jobs we have few stages and its mainly logical intensive and we r-using transformer for most of the things and it does not uses MPP systems
In parallel jobs we have lots of stages and its stage intensive and for particular thing we have in built stages in parallel jobs and it uses MPP systems
4) Differentiate between pipeline and partion parallelism?
Consider three CPU connected in series. When data is being fed into the first one, it start processing, simultaneously is being transferred into the second CPU and so on. U can compare with 3 section of pipe. As water enters s the pipe it start moving into all the section of pipe.
Partition Pipeline- consider 3 CPU connected in parallel and being fed with data at same time thus reduces the load and efficiency. You can compare a single big pipe having 3 inbuilt pipes. As water is being fed to them it consumes large quantity in less time.
5) How to read the data from XL FILES? My problem is my data file having some commas in data, but we are using delimiter is|? How to read the data, explain with steps?
1. Create DSN for your XL file by picking Microsoft Excel Driver
2. Take ODBC as source stage
3. Configure ODBC with DSN details
4. While importing metadata for XL sheet, make sure you should select on system tables check box.
5. Note: In XL sheet the first line should be column names.
6) What’s the meaning of performance tuning technique, Example??
Meaning of performance tuning meaning we have to take some action to increase performance of slowly running job by
1) Use link partitioner and link collector to speedup performance
2) Use sorted data for aggregation
3) Use sorter at source side and aggregation at target side
4) Tuned the oci stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
5) Do not use ipc stage at target side.
Is this only related with server jobs? Because in parallel extender these things are taken care by stages

7) How to distinguish the surrogate key in different dimensional tables?
The Surrogate key will be the key field in the dimensions
8)How to read the data from XL FILES? Explain with steps?
Reading data from Excel file is
* Save the file in .csv (comma separated files).
* Use a flat file stage in DataStage job panel.
* Double click on the flat file stage and assign input file to the .csv file (which you stored).
• Import metadata for the file. (Once you imported or typed metadata, click view data to check the data values)
Then do the rest transformation as needed
-Debases
Create a new DSN for the Excel driver and choose the workbook from which u want data
Select the ODBC stage and access the Excel through that i.e., import the excel sheet using the new DSN created for the Excel
9) How can we generate a surrogate key in server/parallel jobs?
In parallel jobs we can use surrogate key generator stage.
N server jobs we can use an inbuilt routine called KeyMgtGetNextValue.
You can also generate the surrogate key in the database using the sequence generator.

10) What is an environment variable??
Basically Environment variable is predefined variable those we can use while creating DS job. We can set either as Project level or Job level. Once we set specific variable that variable will be available into the project/job.
11) We can also define new environment variable. For that we can get to DS Admin.
I hope u understand. For further details refer the DS Admin guide.
There are the variables used at the project or job level. We can use them to configure the job i.e. can associate the configuration file (Without this u can not run ur job); increase the sequential or dataset read/ write buffer.
Ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click on Paramer tab then click on "add environment variable" to see most of the environment variables.


12) How can we test the jobs?
Testing of jobs can be performed at many different levels: Unit testing, SIT and UAT phases.
Testing basically involves functionality and performance tests.
Firstly data for the job needs to be created to test the functionality. By changing the data we will see whether the requirements are met by the existing code. Every iteration of code change should be accompanied by a testing iteration.
Performance tests basically involve load tests and see how well the existing code performance in a finite period of time. Performance tuning can be performed on sql or the job design or the basic/osh code for faster processing times.
In addition all job designs should include an error correction and fail over support so that the code is robust.
13) What is the use of Hash file?? Instead of hash file why can we use sequential file itself?
Hash file is used to eliminate the duplicate rows based on hash key, and also used for lookups. Data stage not allowed to use sequential file as lookup. Actually the primary use of the hash file is to do a look up. You can use a sequential file for look up but you need to write your own routine to match the columns. Coding time and execution time will be more expensive. But when you generate a hash file the hash file indexes the key by an inbuilt hashing algorithm. So when a look up is made is much much faster. Also it eliminates the duplicate rows. These files are stored in the memory hence faster performance than from a sequential
14) What is a routine?
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines:
1) Transform functions
2) Before-after job subroutines
3) Job Control routines
Routine is user-defined functions that can be reusable with in the project. .
15) How can we create environment variables in DataStage?
This mostly will come under Administrator part. As a Designer only we can add directly byDesigner-view-jobprops-parameters-addenvironment variable-under user defined-then add. .
16) How to eliminate duplicate rows in data stage? TO remove duplicate rows you can achieve by more than one way
1.In DS there is one stage called "Remove Duplicate" is exist where you can specify the key.
2.Other way you can specify the key while using the stage I mean stage itself remove the duplicate rows based on key while processing time.
17) What is pivot stage? Why are u using? What purpose that stage will be used?
Pivot stage is used to make the horizontal rows into vertical and vice versa
Pivot stage supports only horizontal pivoting – columns into rows
Pivot stage doesn’t supports vertical pivoting – rows into columns
Example: In the below source table there are two cols about quarterly sales of a product but biz req. as target should contain single col. to represent quarter sales, we can achieve this problem using pivot stage, i.e. horizontal pivoting.
Source Table
ProdID Q1_Sales Q2_Sales
1010 123450 234550
Target Table
ProdID Quarter_Sales Quarter
1010 123450 Q1
1010 234550 Q2
18) What are the various processes, which start when the DataStage engine starts?
What are the changes need to be done on the database side, if I have to use dB2 stage?
DataStage engine is responsible for compilation or execution or both?
There are three processes start when the DataStage engine starts:
1. DSRPC
2. DataStage Engine Resources
3. DataStage telnet Services
19) What is the difference between reference link and straight link?
The differerence between reference link and straight link is
The straight link is the one where data are passed to next stage directly and the reference link is the one where it shows that it has a reference (reference key) to the main table
For example in oracle EMP table has reference with DEPT table.
In DATASTAGE
2 table stage as source (one is straight link and other is reference link) to 1 transformer stage as process.
If 2 source as file stage (one is straight link and other is reference link to Hash file as reference) and 1 transformer stage.

20) What is Runtime Column Propagation and how to use it?
If your job has more columns which are not defined in metadata if runtime propagation is enabled it will propagate those extra columns to the rest of the job
Can both Source system (Oracle, SQLServer...etc) and Target Data warehouse (may be oracle, SQLServer.Etc) can be on windows environment or one of the systems should be in UNIX/Linux environment.
Your Source System can be (Oracle, SQL, DB2, Flat File... etc) but your Target system for complete Data Warehouse should be one (Oracle or SQL or DB2 or.)
21) What is the difference between OCI stage and ODBC stage?
Oracle OCI:
We can write the source query in this stage but we can’t write lookup query in this stage instead of this we are using hash file stage for the lookup.
We are having the facility to write multiple queries before (Oracle OCI/Output/SQL/Before) or after (Oracle OCI/Output/SQL/After) executing the actual query (Oracle OCI/Output/SQL/Before)
We don’t have multi-row lookup facility in this stage.

ODBC:
We can write both source query as well as lookup query in this stage itself
We are not having the facility to write multiple queries in this stage.
We are having the multi-row lookup facility in this stage.
22) How to find the process id? Explain with steps?
You can find it in UNIX by using ps -ef command it displays all the process currently running on the system along with the process ids
From the DS Director. Follow the path:
Job > Cleanup Resources.
There also you can see the PID.It also displays the entire current running processes.
Depending on your environment, you may have lots of process ids. From one of the DataStage docs: you can try this on any given node: $ ps -ef | grep dsuserwhere dsuser is the account for DataStage. If the above (ps command) doesn't make sense, you'll need some background theory about how processes work in Unix (or the mksenvironment when running in windows). Also from the DataStage docs (I haven't tried this one yet, but it looks interesting):APT_PM_SHOW_PIDS - If this variable is set, players will output an informational message upon startup, displaying their process id.
23) How can I schedule the cleaning of the file &PH& by dsjob?
Create a job with dummy transformer and sequential file stage. In Before Job subroutine, use ExecTCL to execute the following command
CLEAR.FILE &PH&
24) If we using two sources having same Meta data and how to check the data in two sources is same or not? And if the data is not same I want to abort the job? How we can do this?
Use a change Capture Stage. Output it into a Transformer.
Write a routine to abort the job, which is initiated at the Function.
@INROWNUM = 1.
So if the data is not matching it is passed in the transformer and the job is aborted.
25) What is difference between ETL and ELT?
ETL usually scrubs the data then loads into the Data mart or Data Warehouse where as ELT Loads the data then use the RDMBS to scrub and reload into the Data mart or Data warehouse
ETL = Extract >>> Transform >>> Load
ELT = Extract >>> Load >>> Transform
ETL-> transformation takes place in staging area
And in ELT-> transformation takes at either source side r target side............
26) Can you tell me for what purpose .dsx files are used in the DataStage?

. Dsx is the standard file extension of all the various DataStage jobs. Whenever we export a job or a sequence, the file is exported in the .dsx format. A standard usage for the same can be that, we develop the job in our test environment and after testing we export the file and save it as x.dsx . This can be done using DataStage Manager.

27) How you remove duplicates without using remove duplicate stage?
In the target make the column as the key column and run the job.
Using a sort stage, set property: ALLOW DUPLICATES: false
Just do a hash partion of the input data and check the options Sort and Unique
28) How do you call procedures in DataStage?
Use the Stored Procedure Stage
29) What is an environment variable? What is the use of this?
Basically Environment variable is predefined variable those we can use while creating DS job. We can set either as Project level or Job level. Once we set specific variable that variable will be available into the project/job.
We can also define new environment variable. For that we can get to DS Admin.
I hope u understand. For further details refer the DS Admin guide.
30) How can we create read only jobs in DataStage?
N export there is an options just CLICK ON OPTIONS TAB THEN THERE UNDER INCLUDE OPTIONU WILL FIND READ ONLY DATASTAGE u just enables that
31) How to run the job in command prompt in Unix?
Using dsjob command,
-Options
Dsjob -run -job status project name job name
32) What is the difference between Transform and Routine in DataStage?
Transformer transform the data from one from to another form. Where as Routines describes the business logic.
33) How do u clean the DataStage repository.
Remove log files periodically
CLEAR.FILE &PH&
34) What is the transaction size and array size in OCI stage? How these can be used?
Transaction Size - This field exists for backward compatibility, but it is ignored for release 3.0 and later of the Plug-in. The transaction size for new jobs is now handled by Rows per transaction on the Transaction Handling tab on the Input page.
Rows per transaction - The number of rows written before a commit is executed for the transaction. The default value is 0, that is, all the rows are written before being committed to the data table.
Array Size - The number of rows written to or read from the database at a time. The default value is 1, that is, each row is written in a separate statement.
35) How to know the no. Of records in a sequential file before running a server job?
If your environment is UNIX, you can check with WC -l filename command.
36) Other than Round Robin, What is the algorithm used in link collector? Also Explain How it will work?
Other than round robin, the other algorithm is Sort/Merge.
Using the sort/merge method the stage reads multiple sorted inputs and writes one sorted output.
37) How to improve the performance of hash file?
You can improve performance of hashed file by
1. Preloading hash file into memory -->this can be done by enabling preloading options in hash file output stage
2. Write caching options -->.It makes data written into cache before being flushed to disk. You can enable this to ensure that hash files are written in order onto cash before flushed to disk instead of order in which individual rows are written
3. Preallocating--> estimating the approx size of the hash file so that file needs not to be splitted to often after write operation
38) What is the size of the flat file?
The flat file size depends amount of data contained by that flat file
39) What is data stage engine? What is its purpose?
DataStage sever contains DataStage engine DS Server will interact with Client components and Repository. Use of DS engine is to develop the jobs. Whenever the engine is on then only we will develop the jobs.
40) How to implement slowly changing dimensions in DataStage?
Slowly changing dimensions is concept of DWH.
DataStage is tool for ETL purpose not for Slowly changing dimensions.
Does any one have any idea?
Informatics power center, there is a way to implement slowly changing dimension through wizard. DataStage does not have that type of wizard to implement SCD, should be implemented by manual logic.
41) What is the difference between Symmetrically parallel processing, Massively parallel processing?
Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processor communicates via shared memory and has single operating system.
Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive access to hardware resources. Cluster systems can be physically dispoersed. The processor have their own operations system and communicate via high speed network
42) Give one real time situation where
link partitioner stage used?
If we want to send more data from the source to the targets quickly we will be using the link partioner stage in the server jobs we can make a maximum of 64 partitions. And this will be in active stage. We can't connect two active stages but it is accepted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partioner will be collected by the link collector at a max of 64 partitions. This is also an active stage so in order to avoid the connection of active stage from the transformer to the link collector we will be using inter process communication. As this is a passive stage by using this data can be collected by the link collector. But we can use inter process communication only when the target is in passive stage
43) What does separation option in static hash-file mean?
The different hashing algorithms are designed to distribute records evenly among the groups of the file based on characters and their position in the record ids.
When a hashed file is created, Separation and modulo respectively specifies the group buffer size and the number of buffers allocated for a file. When a Static Hash file is created, DATASTAGE creates a file that contains the number of groups specified by modulo.
Size of Hash file = modulus (no. Groups) * Separations (buffer size)
44) What is the purpose of exception activity in data stage 7.5?
It is used to catch the exception raised while running the job
The stages followed by exception activity will be executed whenever there is an unknown error occurs while running the job sequencer.
45) What is the difference between sequential file and a dataset? When to use the copy stage?
Sequential file stores small amount of the data with any extension .txt where as Dataset stores huge amount of the data and opens the file only with an extension .ds. Sequential Stage stores small amount of the data with any extension in order to access the file where as Dataset is used to store huge amount of the data and it opens only with an extension (.ds ) .The Copy stage copies a single input data set to a number of output datasets. Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns. Main difference b/w sequential file and dataset is: Sequential stores small amount of data and stores normally. But dataset loads the data like ansi format.

46) Where we use link partitioner in data stage job? Explain with example?
We use Link Partitioner in DataStage Server Jobs. The Link Partitioner stage is an active stage, which takes one input and allows you to distribute partitioned rows to up to 64 output links.
47) How to kill the job in data stage?
By killing the respective process ID
You should use kill -14 so the job ends nicely. Sometimes use -9 leaves things in a bad state.
48) How to parameterize a field in a sequential file? I am using DataStage as ETL Tool, Sequential file as source.
We cannot parameterize a particular field in a sequential file; instead we can parameterize the source file name in a sequential file
#FILENAME#
49) How to drop the index before loading data in target and how to rebuild it in data stage?
This can be achieved by "Direct Load" option of SQLLoaded utily.
50) If the size of the Hash file exceeds 2GB.What happens? Does it overwrite the current rows?
It overwrites the file
51) It is possible to access the same job two users at a time in DataStage?
No, it is not possible to access the same job two users at the same time. DS will produce the following error: "Job is accessed by other user"
T is possible. Before that u have to kill that job in DataStage Director. Using "Clean up resource" option.
52) How to find errors in job sequence?
Using DataStage Director we can find the errors in job sequence
53) What is job control? How can it used explain with steps?
JCL defines Job Control Language it is used to run more number of jobs at a time with or without using loops. Steps: click on edit in the menu bar and select 'job properties' and enter the parameters asparamete prompt typeSTEP_ID STEP_ID string Source SRC stringDSN DSN string Username unm string Password pad string after editing the above steps then set JCL button and select the jobs from the list box and run the job
54) What is job control? How it is developed? Explain with steps?
Controlling DataStage jobs through some other DataStage jobs. Ex: Consider two Jobs XXX and YYY. The Job YYY can be executed from Job XXX by using DataStage macros in Routines.
To execute one job from other job, following steps needs to be followed in Routines.
1. Attach job using DSAttachjob function.
2. Run the other job using DSRunjob function
3. Stop the job using DSStopJob function

55) How we can call the routine in datastage job? Explain with steps?
Routines are used for implementing the business logic they are two types 1) Before Sub Routines and 2) After Sub Routinestepsdouble click on the transformer stage right click on any one of the mapping field select [dstoutines] option within edit window give the business logic and select the either of the options (Before / After Sub Routines)
56) What are the most important aspects that a beginner must consider doing his first DS project?
He should be good at Data Warehousing Concepts and he should be familiar with all stages
57) What are the different types of lookups in DataStage?
- Look-up file stage - Generally used with Look Up stage
- Hash Look-up
- You can also implement a "look up" using Merge stage
There are two types of lookupslookup stage and lookupfilesetLookup: Lookup reference to another stage or Database to get the data from it and transforms to other database.LookupFileSet:It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs.
58) Where actually the flat files store? What is the path?
Flat files stores the data and the path can be given in general tab of the sequential file stage
Normally flat file will be stored at FTP servers or local folders and more over .CSV, .EXL and .TXT file formats available for Flat files.
59) How to find the number of rows in a sequential file?
Using Row Count system variable
60) How to implement type2 slowly changing dimension in DataStage? Give me with example?
Slow changing dimension is a common problem in Data ware housing. For example: There exists a customer called Lisa in a company ABC and she lives in New York. Later she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem

Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified to reflect the changes.

In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use.

In Type2 New record is added, therefore both the original and the new record will be present, the new record will get its own primary key, Advantage of using this type2 is, Historical information is maintained but size of the dimension table grows, storage and performance can become a concern.
Type2 should only be used if it is necessary for the data warehouse to track the historical changes.

In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. Example a new column will be added which shows the original address as New York and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the New York information is lost. So Type 3 should only be used if the changes will only occur for a finite number of times.

61) What is difference between Merge stage and Join stage?
Join can have max of two input datasets; Merge can have more than two input datasets.
Merge and Join Stage Difference:
1. Merge Reject Links are there
2. Can take Multiple Update links
3. If you used it for comparison, then first matching data will be the output.
Because it uses the update links to extend the primary details which are coming from master link
Someone was saying that join does not support more than two input, while merge support two or more input (one master and one or more update links). I will say, that is highly incomplete information. The fact is join does support two or more input links (left right and possibly intermediate links). But, yes, if you are talking about full outer join then more than two links are not supported.
Coming back to main question of difference between Join and Merge Stage, the other significant differences that I have noticed are:
1) Number Of Reject Link
(Join) does not support reject link.
(Merge) has as many reject link as the update links (if there are n-input links then 1 will be master link and n-1 will be the update link).


2) Data Selection
(Join) There are various ways in which data is being selected. E.g. we have different types of joins, inner, outer (left, right, full), cross-join, etc. So, you have different selection criteria for dropping/selecting a row.
(Merge) Data in Master record and update records are merged only when both have same value for the merge key columns.
62) What is the difference between validated ok and compiled in DataStage.
When you compile a job, it ensures that basic things like all the important stage parameters have been set, mappings are correct, etc. and then it creates an executable job.
You validate a compiled job to make sure that all the connections are valid. All the job parameters are set and a valid output can be expected after running this job. It is like a dry run where you don't actually play with the live data but you are confident that things will work.
When we say, "Validating a Job", we are talking about running the Job in the "check only" mode. The following checks are made:
- Connections are made to the data sources or data warehouse.
- SQL SELECT statements are prepared.
- Files are opened. Intermediate files in Hashed File, UniVerse, or ODBC stages that use the local data source are created, if they do not already exist.
63) What are the environment variables in DataStage? Give some examples?
There are the variables used at the project or job level. We can use them to configure the job i.e. can associate the configuration file (With out this u can not run ur job); increase the sequential or dataset read/ write buffer.
Ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click on "add environment variable" to see most of the environment variables.
64) Purpose of using the key and difference between Surrogate keys and natural key
We use keys to provide relationships between the entities (Tables). By using primary and foreign key relationship, we can maintain integrity of the data.
The natural key is the one coming from the OLTP system.
The surrogate key is the artificial key, which we are going to create in the target DW. We can use these surrogate keys instead of using natural key. In the SCD2 scenarios surrogate keys play a major role
65) How do you do Usage analysis in DataStage?
1. If u wants to know some job is a part of a sequence, then in the Manager right click the job and select Usage Analysis. It will show all the jobs dependents.
2. To find how many jobs are using a particular table.
3. To find how many jobs are using particular routine.
Like this, u can find all the dependents of a particular object.
Its like nested. U can move forward and backward and can see all the dependents.
66) How to remove duplicates in server job
1) Use a hashed file stage or
2) If you use sort command in UNIX (before job sub-routine), you can reject duplicated records using -u parameter or
3) using a Sort stage
67) Will DataStage consider the second constraint in the transformer if the first constraint is satisfied (if link ordering is given)?"
Answer: Yes.
68) What are constraints and derivation?
Explain the process of taking backup in DataStage?
What are the different types of lookups available in DataStage?
Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a constraint and it means and only those records meeting this will be processed further.
Derivation is a method of deriving the fields, for example if you need to get some SUM, AVG etc.
Constraints are condition and once meeting those records will be processed further. Example process all records where cust_id<>0.
Derivations are derived expressions. For example I want to do a SUM of Salary or Calculate Interest rate etc
69) What is a project? Specify its various components?
You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains:
DataStage jobs.
Built-in components. These are predefined components used in a job.
User-defined components. These are customized components created using the DataStage Manager or DataStage Designer