Datastage Questions1
Datastage Questions1
SALE_HEADER_XXXXX_YYYYMMDD.PSV
SALE_LINE_XXXXX_YYYYMMDD.PSV
OCI supports OraBulk and Plug-ins using these concepts Bulk loads possible quickly
2 GB
1 .Preloading hash file into memory -->this can be done by enabling preloading options in hash
file output stage
2. Write caching options -->.It makes data written into cache before being flushed to disk. you
can enable this to ensure that hash files are written in order onto cash before flushed to disk
instead of order in which individual rows are written
3 .Pre allocating--> Estimating the approx size of the hash file so that file needs not to be
splitted to often after write operation
Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A
new record is added into the customer dimension table. Therefore, the customer is treated
essentially as two different people. Type 3: The original record is modified to reflect the changes.
In Type1 the new one will over write the existing one that means no history is maintained,
History of the person where she stayed last is lost, simple to use.
In Type2 New record is added, therefore both the original and the new record Will be present, the
new record will get its own primary key, Advantage of using this type2 is, Historical information
is maintained But size of the dimension table grows, storage and performance can become a
concern.
Type2 should only be used if it is necessary for the data warehouse to track the historical changes.
In Type3 there will be 2 columns one to indicate the original value and the other to indicate the
current value. Example a new column will be added which shows the original address as
Newyork and the current address as Florida. Helps in keeping some part of the history and table
size is not increased. But one problem is when the customer moves from Florida to Texas the
Newyork information is lost. So Type 3 should only be used if the changes will only occur for a
finite number of times.
========
We can handle SCD in the following ways
Type 1: Just use, “Insert rows Else Update rows”
Or
“Update rows Else Insert rows”, in update action of target
Type 2: Use the steps as follows
DATA Q Technologies
a) U have use one hash file to Look-Up the target
b) Take 3 instances of target
c) Give different conditions depending on the process
d) Give different update actions in target
e) Use system variables like Sysdate and Null.
========
To develope scd type 2 us the update action in target as “insert new rows only"
For this u need to maintain primary key as composite key in target table. It would be better to use
timestamp column as one of key column in target table.
What is the difference between symmetrically parallel processing, massively parallel processing?
How to remove the locked jobs using Datastage?
Go to Director Tools then click Clear the Job Resources option there u note the PID Number 1.Go
to Administrator 2.Properties3.Command 4.type ds.tools5. U will get options6.select 4th
option6.select the 6th option7.Then enter the PID Number
How to use Excel file as input in Datastage??
Store input excel file as .csv file and use sequential file stage to read that file..!!!
==========
You can use excel file as input by importing the .xls file.
step1 --> Go to Administrative Tools -> Data Source (ODBC) --> System DSN. Click on Add
button and configure the corresponding .xsl file in your system DSN. Make sure that workbook
contains the name of your excel sheet.
Step2 --> Import the excel file into the Datastage as ODBC table definition.
You should be able to use excel file very effectively. Please let me know if you face any problem.
How can I identify the duplicate rows in a seq or comma delimited file?
The case is...> the source has 4 values like, agent id, agent name, etc... Our requirement is that the
ID shouldn't be repeated. So how can i identify the duplicate rows, set a flag and send the rejects
to the specified reject file? The source systems data is directly given to us. That’s why we are
getting these duplicates. If it has a primary key set up already then it would have been very easy.
· Developer Productivity –
· Informatica was 6-times more productive (It took 6 Datastage mappings compared to 1
Informatica mapping).
· Informatica mappings required no custom coding. Ascential Datastage required 20% custom
coding.
I would like a suggestion from someone that has worked on both the tools.
I am new to Datastage can we need to load a excel file which has both ARABIC and ENGLISH
text. Here my source id IBM DB2. If I want to display the Arabic content in my browser I need to
store it in the UNICODE format in the Database. Can any one tell how to store the content in a
excel file as UNICODE NLS . In Datastage the option has been provided only for Sequential
files.
Plz let me know detail steps to pass the parameter through shell script
I am trying to execute batch file which has dsjob command to run data stage job from CLI.
I am getting error code = 81011 (project not found but project is there)
But I can run other jobs. I am getting into problem for one particular batch.
How to eliminate the errors in delimited flat file?
Can a sequential file stage can read the path from other sequential file. in which the data is stored
The entire user defined environmental variables and assignee values for it. But Encrypted IE
password i am not able to read. If it reads only in the server job it is able to read. Not from the
sequencer. it says password error...
What are the different types of links in Datastage?
There are 3 different links in the Datastage
1. stream link means straight link
2. Reference link it acts like a lookup.
3. Rejected link used in parallel jobs
=======================
The link partitioner collects the data in single input link & diverts to the maximum of 64 output
links insert at the same time
Link collector stage should collect the data from the passive stage.
Yes we can alter Column name in a table using the following query alter table rename column.
If a Datastage job has aborted after say 1000 records, how to continue the load from 1000th
record after fixing the error?
What are Static Hash files and Dynamic Hash files?
The hashed files have the default size established by their modulus and separation when you
create them, and this can be static or dynamic.
Overflow space is only used when data grows over the reserved size for someone of the groups
(sectors) within the file. There are many groups as the specified by the modulus.
What does separation option in static hash-file mean?
What is the difference between sequential file and a dataset? When to use the copy stage?
Sequential file stores small amount of the data with any extension .txt where as DataSet stores
huge amount of the data and opens the file only with an extension .ds.
====
Sequential Stage stores small amount of the data with any extension in order to acces the file
where as DataSet is used to store Huge amount of the data and it opens only with an extension
(.ds ) .The Copy stage copies a single input data set to a number of output datasets. Each record of
DATA Q Technologies
the input data set is copied to every output data set .Records can be copied without modification
or you can drop or change the order of columns.
What is the purpose of exception activity in data stage 7.5?
It is used to catch the exception raised while running the job
Where we use link partitioner in data stage job? Explain with example?
We use Link Partitioner in Datastage Server Jobs. The Link Partitioner stage is an active stage
which takes one input and allows you to distribute partitioned rows to up to 64 output links.
How to kill the job in data stage?
By killing the respective process ID
How to parameterize a field in a sequential file? I am using Datastage as ETL Tool, Sequential
file as source.
We cannot parameterize a particular field in a sequential file; instead we can parameterize the
source file name in a sequential file.
2 #FILENAME#
How to drop the index before loading data in target and how to rebuild it in data stage?
If the size of the Hash file exceeds 2GB..What happens? Does it overwrite the current rows?
It overwrites the file
It is possible to access the same job two users at a time in Datastage?
No, it is not possible to access the same job two users at the same time. DS will produce the
following error: "Job is accessed by other user"
How to find errors in job sequence?
Using Datastage Director we can find the errors in job sequence
What is job control? How can it used explain with steps?
JCL defines Job Control Language it is used to run more number of jobs at a time with or without
using loops. steps: click on edit in the menu bar and select 'job properties' and enter the
parameters asparamete prompt type STEP_ID STEP_ID string Source SRC string DSN string
Username, string Password pwd string after editing the above steps then set JCL button and select
the jobs from the listbox and run the job
How we can call the routine in Datastage job? Explain with steps?
Routines are used for impelementing the business logic they are two types 1) Before Sub
Routines and 2)After Sub Routinestepsdouble click on the transformer stage right click on any
one of the mapping field select [dsroutines] option within edit window give the business logic and
select the either of the options( Before / After Sub Routines)
What are the most important aspects that a beginner must consider doing his first DS project?
He should be good at Data Warehousing Concepts and he should be familiar with all stages
What are the different types of lookups in Datastage?
- Hash Look-up
Join can have max of two input datasets; Merge can have more than two input datasets.
3. If you used it for comparison, then first matching data will be the output.
Because it uses the update links to extend the primary details which are coming from master link
Someone was saying that join does not support more than two input, while merge support two or
more input (one master and one or more update links). I will say, that is highly incomplete
information. The fact is join does support two or more input links (left right and possibly
intermediate links). But, yes, if you are talking about full outer join then more than two links are
not supported.
Coming back to main question of difference between Join and Merge Stage, the other significant
differences that I have noticed are:
(Merge) has as many reject link as the update links( if there are n-input links then 1 will be
master link and n-1 will be the update link).
2) Data Selection
(Join) There are various ways in which data is being selected. e.g. we have different types of
joins, inner, outer( left, right, full), cross join, etc. So, you have different selection criteria for
dropping/selecting a row.
(Merge) Data in Master record and update records are merged only when both have same value
for the merge key columns.
DATA Q Technologies
What is the difference between validated ok and compiled in Datastage.
When you compile a job, it ensures that basic things like all the important stage parameters have
been set, mappings are correct, etc. and then it creates an executable job.
You validate a compiled job to make sure that all the connections are valid. All the job
parameters are set and a valid output can be expected after running this job. It is like a
dry run where you don't actually play with the live data but you are confident that things
will work.
When we say "Validating a Job", we are talking about running the Job in the "check only" mode.
The following checks are made:
There are the variables used at the project or job level. We can use them to configure the job ie.
We can associate the configuration file (Without this u can not run ur job), increase the sequential
or dataset read/ write buffer.
ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click on "add
environment variable" to see most of the environment variables.
Purpose of using the key and difference between Surrogate keys and natural key
We use keys to provide relationships between the entities (Tables). By using primary and foreign
key relationship, we can maintain integrity of the data.
The natural key is the one coming from the OLTP system.
The surrogate key is the artificial key which we are going to create in the target DW. We can use
these surrogate keys instead of using natural key. In the SCD2 scenarios surrogate keys play a
major role
1. If u wants to know some job is a part of a sequence, then in the Manager right click the job and
select Usage Analysis. It will show all the jobs dependents.
DATA Q Technologies
2. To find how many jobs are using a particular table.
Its like nested. U can move forward and backward and can see all the dependents.
Will the data stage consider the second constraint in the transformer once the first condition is
satisfied (if the link ordering is given?)
"Will Datastage consider the second constraint in the transformer if the first constraint is satisfied
(if link ordering is given)?"
A: Yes.
Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a
constraint and it means and only those records meeting this will be processed further.
Derivation is a method of deriving the fields, for example if you need to get some SUM, AVG
etc.
You always enter Datastage through a Datastage project. When you start a Datastage client you
are prompted to connect to a project. Each project contains:
• Datastage jobs.
• Built-in components. These are predefined components used in a job.
• User-defined components. These are customized components created using the Datastage
Manager or Datastage Designer
Datastage Director. A user interface used to validate, schedule, run, and monitor
Datastage jobs.
Datastage Manager. A user interface used to view and edit the contents of the
Repository.
Select the data source stage depending upon the sources for ex: flat file, database, xml etc
Select the required stages for transformation logic such as transformer, link collector, link
partitioner, Aggregator, merge etc
Select the final target stage where u want to load the data either it is datawarehosuing, datamart,
ODS, staging etc
What is meaning of file extender in data stage server jobs. can we run the data stage job
from one job to another job that file data where it is stored and what is the file extender in
dsjobs.?
File extender means the adding the columns or records to the already existing the file, in
the data stage,
We can run the data stage job from one job to another job in data stage.
DRS and ODBC stage are similar as both uses the Open Database Connectivity to connect to a
database. Performance wise there is not much of a difference. We use DRS stage in parallel jobs.
in process is the process where the server transfers only one row at a time to target and interposes
means that the server sends group of rows to the target table...these both are available at the
tunables tab page of the administrator client component..
Yes, we can convert by attaching one hierarchy to lowest level of another hierarchy.
Is it possible to move the data from oracle ware house to SAP Warehouse using with
DATASTAGE Tool.
We can use Datastage Extract Pack for SAP R/3 and Datastage Load Pack for SAP
BW to transfer the data from oracle to SAP Warehouse. These Plug-in Packs are available with
Datastage Version 7.5
If a Datastage job aborts after say 1000 records, how to continue the job from 1000th record after
fixing the error?
By specifying Checkpointing in job sequence properties, if we restart the job. Then job will start
by skipping up to the failed record. This option is available in 7.5 edition.
What is OCI?
If you mean by Oracle Call Interface (OCI), it is a set of low-level APIs used to interact with
Oracle databases. It allows one to use operations like logon, execute, parses etc. using a C or C++
program
Hashing is key-to-address translation. This means the value of a key is transformed into a disk
address by means of an algorithm, usually a relative block and anchor point within the block. It's
closely related to statistical probability as to how well the algorithms work.
It sounds fancy but these algorithms are usually quite simple and use division and remainder
techniques. Any good book on database systems will have information on these techniques.
Interesting to note that these approaches are called "Monte Carlo Techniques" because the
behavior of the hashing or randomizing algorithms can be simulated by a roulette wheel where
the slots represent the blocks and the balls represent the records (on this roulette wheel there are
many balls not just one).
DATA Q Technologies
It is possible to call one job in another job in server jobs?
We cannot call one job within another in Datastage, however we can write a wrapper to access
the jobs in a stated sequence. We can also use sequencer to sequence the series of jobs.
There u can add an new environment variable r u can use the existing one
2] U can view all the environment variables in designer. U can check it in Job properties. U can
add and access the environment variables from Job properties
What are the enhancements made in Datastage 7.5 compare with 7.0?
Many new stages were introduced compared to Datastage version 7.0. In server jobs we have
stored procedure stage, command stage and generate report option was there in file tab. In job
sequence many stages like start loop activity, end loop activity, terminate loop activity and user
variables activities were introduced. In parallel jobs surrogate key stage, stored procedure stage
was introduced. For all other specifications,
I assume you are referring Lookup fileset only. It is only used for lookup stages only. Dataset:
Datastage parallel extender jobs use data sets to manage data within a job. You can think of each
link in a job as carrying a data set. The Data Set stage allows you to store data being operated on
in a persistent form, which can then be used by other Datastage jobs. FileSet: Datastage can
DATA Q Technologies
generate and name exported files, write them to their destination, and list the files it has generated
in a file whose extension is, by convention, .fs. The data files and the file that lists them are called
a file set. This capability is useful because some operating systems impose a 2 GB limit on the
size of a file and you need to distribute files among nodes to prevent overruns.
How the hash file is doing lookup in serverjobs? How is it comparing the key values?
Hashed File is used for two purposes: 1. Remove Duplicate Records 2. Then Used for reference
lookups. The hashed file contains 3 parts: Each record having Hashed Key, Key Header and Data
portion. By using hashed algorithm and the key valued the lookup is faster.
What are the differences between the data stage 7.0 and 7.5
in server jobs?
There is lot of Differences: There are lot of new stages is available in DS7.5 For Eg: CDC Stage
Stored procedure Stage etc...
2] we can handle rejected rows in two ways with help of Constraints in a Tansformer.1) By
Putting on the Rejected cell where we will be writing our constraints in the properties of the
Transformer2)Use REJECTED in the expression editor of the Constraint Create a hash file as a
temporary storage for rejected rows. Create a link and use it as one of the output of the
transformer. Apply either of the two steps above said on that Link. All the rows which are
rejected by all the constraints will go to the Hash File.
ORABULK is used to load bulk data into single table of target oracle database.
BCP is used to load bulk data into a single table for microsoft sql server and sysbase.
How is Datastage 4.0 functionally different from the enterprise edition now? What are the exact
changes?
The main difference between data stage and informatica is the SCALABILTY. Informatica is
scalable than Datastage
DATA Q Technologies
In my view Datastage is also Scalable, the difference lies in the number of built-in functions
which makes Datastage more user friendly
Datastage from Staging to MDW is only running at 1 row per second! What do we do to remedy?
I am assuming that there are too many stages, which is causing problem and providing the
solution.
In general if you too many stages (especially transformers, hash look up), there would be a lot of
overhead and the performance would degrade drastically. I would suggest you to write a query
instead of doing several look ups. It seems as though embarrassing to have a tool and still write a
query but that is best at times.
If there are too many look ups that are being done, ensure that you have appropriate indexes while
querying. If you do not want to write the query and use intermediate stages, ensure that you use
proper elimination of data between stages so that data volumes do not cause overhead. So, there
might be a re-ordering of stages needed for good performance.
1) For massive transaction set hashing size and buffer size to appropriate values to perform as
much as possible in memory and there is no I/O overhead to disk.
2) Enable row buffering and set appropriate size for row buffering
What user variable activity when it used how it used! Where it is used with real example
By using This User variable activity we can create some variables in the job sequence, this
variables r available for all the activities in that sequence.
Here are three different types of user-created stages available for PX.
What are they? Which would you use? What is the disadvantage for using each type?
These are the three different stages: i) Custom ii) Build iii) Wrapped
What is the exact difference between Join, Merge and Lookup Stage??
if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort
on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is
all highly optimized and sequential. Once the sort is over the join processing is very fast and
never involves paging or other I/O
Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links
as many as input links.
Can any one tell me how to extract data from more than 1 heterogeneous Sources mean, example
1 sequence file, Sybase , Oracle in a single Job.
Yes you can extract the data from two heterogeneous sources in data stages using the transformer
stage it's so simple you need to just form a link between the two sources in the transformer stage
that's it
2] U can convert all heterogeneous sources into sequential files & join them using merge
Or U can write user defined query in the source itself to join them
How can I specify a filter command for processing data while defining sequential file output
data?
We have some thing called as after job subroutine and before subroutine, with then we can
execute the UNIX commands. Here we can use the sort command or the filter command
And check for input files are existed or not and also check for input tables existed or not and also
usernames, datasource names, passwords like that
If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could
arise?
Data will partition on both the keys! Hardly will it take more for execution.
If you’re running 4 ways parallel and you have 10 stages on the canvas, how many processes does
Datastage create?
A is 40, you have 10 stages and each stage can be partitioned and run on 4 nodes which makes
total number of processes generated are 40
You can create a table where u can store the last successful refresh time for each
table/Dimension.
Then in the source query take the delta of the last successful and sysdate should give you
incremental load.
Does Enterprise Edition only add the parallel processing for better performance?
Are any stages/transformations available in the enterprise edition only?
Datastage Standard Edition was previously called Datastage and Datastage Server Edition. •
Datastage Enterprise Edition was originally called Orchestrate, then renamed to Parallel Extender
when purchased by Ascential. • Datastage Enterprise: Server jobs, sequence jobs, parallel jobs.
The enterprise edition offers parallel processing features for scalable high volume solutions.
Designed originally for Unix, it now supports Windows, Linux and Unix System Services on
mainframes. • Datastage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs.
MVS jobs are jobs designed using an alternative set of stages that are generated into cobol/JCL
code and are transferred to a mainframe to be compiled and run. Jobs are developed on a UNIX
or Windows server transferred to the mainframe to be compiled and run. The first two versions
share the same Designer interface but have a different set of design stages depending on the type
of job you are working on. Parallel jobs have parallel stages but also accept some server stages
via a container. Server jobs only accept server stages, MVS jobs only accept MVS stages. There
are some stages that are common to all types (such as aggregation) but they tend to have different
fields and options within that stage.
Complex design means having more joins and more look ups. Then that job design will be called
as complex job. We can easily implement any complex design in Datastage by following simple
tips in terms of increasing performance also. There is no limitation of using stages in a job. For
better performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go
DATA Q Technologies
for another job. Use not more than 7 look ups for a transformer otherwise go for including one
more transformer. Am I Aed for u'r abstract Question.
Yes, we can join a flat file and database in an indirect way. First create a job which can populate
the data from database into a Sequential file and name it as Seq_ First. Take the flat file which
you are having and use a Merge Stage to join these two files. You have various join types in
Merge Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You can use any one of
these which suits your requirements.
1) If an input file has an excessive number of rows and can be split-up then use standard
If u have SMP machines u can use IPC, link-collector, link-partitioner for performance tuning
What is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself? This
will eliminate the unnecessary records even getting in before joins are made?
t probably means that u can put the selection criteria in the where clause, i.e whatever data u need
to filter ,filter it out in the SQL ,rather than carrying it forward and then filtering it out.
Constraints are nothing but restrictions to data. Here it is restriction to data at entry itself, as he
told it will avoid unnecessary data entry.
This means try to improve the performance by avoiding use of constraints wherever possible and
instead using them while selecting the data itself using a where clause. This improves
performance.
Take the source file (excel file) in the .csv format and apply the conditions which satisfies the
datamart.
DATA Q Technologies
What is difference between serverjobs & parallel jobs?
Server jobs. These are available if you have installed Datastage Server. They run on the
Datastage Server, connecting to other data sources as necessary.
Parallel jobs. These are only available if you have installed Enterprise Edition. These run on
Datastage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS
(USS) machine if required.
Merge is nothing but a filter conditions that have been used for filter condition
2] Merge is a stage that is available in both parallel and server jobs. The merge stage is used to
join two tables (server/parallel) or two tables/datasets (parallel). Merge requires that the master
table/dataset and the update table/dataset to be sorted. Merge is performed on a key field, and the
key field is mandatory in the master and update dataset/table.
How we use NLS function in Datastage? What are advantages of NLS function? Where we can
use that one? Explain briefly?
Datastage understands the architecture of the system through this file (APT_CONFIG_FILE). For
example this file consists, information of node names, disk storage information...etc.
OCI doesn't mean the orabulk data. It actually uses the "Oracle Call Interface" of the oracle to
load the data. It is kind of the lowest level of Oracle being used for loading the data.
DATA Q Technologies
How can I extract data from DB2 (on IBM iSeries) to the data warehouse via Datastage as the
ETL tool? I mean do I first need to use ODBC to create connectivity and use an adapter for the
extraction and transformation of data? Thanks so much if anybody could provide an A.
If u connect output of hash file to transformer, it will act like reference .there is no errors at all!! It
can be used in implementing SCD's
Version Control, stores different versions of DS jobs, runs different versions of same job, reverts
to previous version of a job, view version histories
What are the Repository Tables in Datastage and what are they?
Dear User. A data warehouse is a repository (centralized as well as distributed) of Data, able to A
any Adhoc, analytical, historical or complex queries. Metadata is data about data. Examples of
metadata include data element descriptions, data type descriptions, attribute/property descriptions,
range/domain descriptions, and process/method descriptions. The repository environment
encompasses all corporate metadata resources: database catalogs, data dictionaries, and
navigation services. Metadata includes things like the name, length, valid values, and description
of a data element. Metadata is stored in a data dictionary and repository. It insulates the data
warehouse from changes in the schema of operational systems. In data stage I/O and Transfer,
under interface tab: input, out put & transfer pages. U will have 4 tabs and the last one is build
under that u can find the TABLE NAME .The Datastage client components are: Administrator
Administers Datastage projects and conducts housekeeping on the server Designer Creates
Datastage jobs that are compiled into executable programs Director Used to run and monitor the
Datastage jobs Manager Allows you to view and edit the contents of the repository.
Question: How do you execute DataStage job from command line prompt?
A: Using "dsjob" command as follows.
Dsjob -run -jobstatus projectname jobname
Question: What is the default cache size? How do you change the cache size if needed?
A: Default cache size is 256 MB. We can increase it by going into Datastage
Administrator and selecting the Tunable Tab and specify the cache size over there.
Question: How to run a Shell Script within the scope of a Data stage job?
A: By using "ExcecSH" command at Before/After job properties
Link Collector: It collects the data coming from partitions, merges it into a single data
flow and loads to target.
Question: Suppose if there are million records did you use OCI? if not then what stage
do you prefer
Question: What are the often used Stages or stages you worked with in your last project?
A: A) Transformer, ORAOCI8/9, ODBC, Link-Partitioner, Link-Collector, Hash, ODBC,
Aggregator, Sort.
Question: How many jobs have you created in your last project?
A: 100+ jobs for every 6 months if you are in Development, if you are in testing 40 jobs
for every 6 months although it need not be the same number for everybody
Question: Did you Parameterize the job or hard-coded the values in the jobs?
A: Always parameterized the job. Either the values are coming from Job Properties or
from a ‘Parameter Manager’ – a third part tool. There is no way you will hard–code some
parameters in your jobs. The often Parameterized variables in a job are: DB DSN name,
username, password, dates W.R.T for the data to be looked against at.
Question: Have you ever involved in updating the DS versions like DS 5.X, if so tell us
some the steps you have taken in doing so?
A: Yes. The following are some of the steps; I have taken in doing so:
1) Definitely take a back up of the whole project(s) by exporting the project as a
.dsx file
2) See that you are using the same parent folder for the new version also for your
old jobs using the hard-coded file path to work.
DATA Q Technologies
3) After installing the new version import the old project(s) and you have to
compile them all again. You can use 'Compile All' tool for this.
4) Make sure that all your DB DSN's are created with the same name as old one's.
This step is for moving DS from one machine to another.
5) In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there is
tool on DS CD that can do this for you.
6) Do not stop the 6.0 server before the upgrade; version 7.0 install process
collects project information during the upgrade. There is NO rework (recompilation of
existing jobs/routines) needed after the upgrade.
Question: What are Static Hash files and Dynamic Hash files?
A: As the names itself suggest what they mean. In general we use Type-30 dynamic Hash
files. The Data file has a default size of 2Gb and the overflow file is used if the data
exceeds the 2GB size.
Question: What are OConv () and Iconv () functions and where are they used?
A: IConv () - Converts a string to an internal storage format
OConv () - Converts an expression to an output format
Question: What are Routines and where/how are they written and have you written any
routines before?
A: Routines are stored in the Routines branch of the DataStage Repository, where you
can create, view or edit. The following are different types of routines:
1) Transform functions
2) Before-after job subroutines
3) Job Control routines
Question: If worked with DS6.0 and latest versions what are Link-Partitioner and Link-
Collector used for?
A: Link Partitioner - Used for partitioning the data.
Link Collector - Used for collecting the partitioned data.
Question: What are other Performance tunings you have done in your last project to
increase the performance of slowly running jobs?
A:
• Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the
server using Hash/Sequential files for optimum performance also for data
recovery in case job aborts.
• Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values
for faster inserts, updates and selects.
• Tuned the 'Project Tunables' in Administrator for better performance.
• Used sorted data for Aggregator.
• Sorted the data as much as possible in DB and reduced the use of DS-Sort for
better performance of jobs
• Removed the data not used from the source as early as possible in the job.
• Worked with DB-admin to create appropriate Indexes on tables for better
performance of DS queries
• Converted some of the complex joins/business in DS to Stored Procedures on DS
for faster execution of the jobs.
DATA Q Technologies
• If an input file has an excessive number of rows and can be split-up then use
standard logic to run jobs in parallel.
• Before writing a routine or a transform, make sure that there is not the
functionality required in one of the standard routines supplied in the sdk or ds
utilities categories.
• Constraints are generally CPU intensive and take a significant amount of time to
process. This may be the case if the constraint calls routines or external macros
but if it is inline code then the overhead will be minimal.
• Try to have the constraints in the 'Selection' criteria of the jobs itself. This will
eliminate the unnecessary records even getting in before joins are made.
• Tuning should occur on a job-by-job basis.
• Use the power of DBMS.
• Try not to use a sort stage when you can use an ORDER BY clause in the
database.
• Using a constraint to filter a record set is much slower than performing a SELECT
… WHERE….
• Make every attempt to use the bulk loader for your particular database. Bulk
loaders are generally faster than using ODBC or OLE.
Question: What will you in a situation where somebody wants to send you a file and use
that file as an input or reference and then run job.
DATA Q Technologies
A: A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run
the job. May be you can schedule the sequencer around the time the file is expected to
arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending
on the file.
Question: How would call an external Java function which are not supported by
DataStage?
A: Starting from DS 6.0 we have the ability to call external Java functions using a Java
package from Ascential. In this case we can even use the command line to invoke the
Java function and write the return values from the Java program (if any) and use that files
as a source in DataStage job.
Question: What is the utility you use to schedule the jobs on a UNIX server other than
using Ascential Director?
A: Use crontab utility along with dsexecute() function along with proper parameters
passed.
Question: What are the command line functions that import and export the DS jobs?
A: A. dsimport.exe- imports the DataStage components.
B. dsexport.exe- exports the DataStage components.
Question: How will you determine the sequence of jobs to load into data warehouse?
A: First we execute the jobs that load the data into Dimension tables, then Fact tables,
then load the Aggregator tables (if any).
Question: The above might rise another question: Why do we have to load the
dimensional tables first, then fact tables:
A: As we load the dimensional tables the keys (primary) are generated and these keys
(primary) are Foreign keys in Fact tables.
Question: Tell me one situation from your last project, where you had faced problem and
How did u solve it?
A: A. The jobs in which data is read directly from OCI stages are running extremely
slow. I had to stage the data before sending to the transformer to make the jobs run faster.
B. The job aborts in the middle of loading some 500,000 rows. Have an option
either cleaning/deleting the loaded data and then run the fixed job or run the job again
from the row the job has aborted. To make sure the load is proper we opted the former.
Question: Does the selection of 'Clear the table and Insert rows' in the ODBC stage send
a Truncate statement to the DB or does it do some kind of Delete logic.
A: There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a
delete from statement. On an OCI stage such as Oracle, you do have both Clear and
DATA Q Technologies
Truncate options. They are radically different in permissions (Truncate requires you to
have alter table permissions where Delete doesn't).
Question: How do you rename all of the jobs to support your new File-naming
conventions?
A: Create a Excel spreadsheet with new and old names. Export the whole project as a
dsx. Write a Perl program, which can do a simple rename of the strings looking up the
Excel file. Then import the new dsx file probably into a new project for testing.
Recompile all jobs. Be cautious that the name of the jobs has also been changed in your
job control jobs or Sequencer jobs. So you have to make the necessary changes to these
Sequencers.
Question: What is difference between operational data stage (ODS) & data warehouse?
Question: What are the difficulties faced in using DataStage? Or what are the constraints
in using DataStage?
A: Many, depends on project/transformation rules.
Question: Suppose if there are million records, do you use OCI? if not then what stage
do you prefer?
A: That depends on the target database; using native database stage is faster than
using ODBC stage.
Question: How do you pass the parameter to the job sequence if the job is running at
night?
A: Using Scheduling tools or OS scripts using command line.
Question: What is project life cycle and how do you implement it?
A: Go through SDLC (Software Development Life Cycle) from any Software
Engineering book. (Requirement gathering, gap analysis, design target model (Logical
and physical), design the data flow, development, implementation.)
Question: How will you call external function or subroutine from datastage?
A: External function using before or after subroutines.
Sub routines using Transformer derivation.
Question: How do you do oracle 4 way inner join if there are 4 oracle input files?
A: Use the single OCI/ODBC stage and join the tables.
Question: What is the order of execution done internally in the transformer with the
stage editor having input links on the lft hand side and output links?
Question: How can we join one Oracle source and Sequential file?
A: First store the Oracle into seq file and join 2 seq files using merge stage server edition.
Use merge stage to join the oracle and seq file in PX.
Question: How can we implement Lookup in DataStage Server jobs using hash files?
Question: What are all the third party tools used in DataStage?
A: Shell scripts, oracle procedures etc.
Question: what is the difference between routine and transform and function?
A: Transform is single line function and routine is multi line function. Routines will be
executed outside of the job, it will be not be compiled with the job executables.
Transform will be executed with in the job as it is compiled with in the job executable.
Hence using transform is faster than routine.
8. What is NLS?
A: National Language Support it means Datastage support any type of character set
10. What is job sequence and what r the types of triggers u have in job sequence?
A: To run the jobs either server or parallel in sequence mode we can use job sequence
and also lets us to run the jobs by defining triggers
16. What is merge and join stages and different between them?
A: Merge is for merging the two seq files where as join for joining any two databases