I would start with saying that it’s not mandatory for you to know the Shell scripting. It is a good to have skill and gives you the edge over others who do not have the skill, especially during the interview selection.
But, considering the complexity of the architecture and the varied types of data which are sourced on to the DWH systems for the project and implementation you have to get into the shoes of the UNIX developer to put some scripts in place to automate things from both the development and Administration perspective.
Let’s take a simple example which is becoming very common these days. One of the business requirements is to process the data in a real time fashion or near real time fashion and perform some basic Data Quality checks before the data is loaded on to the Database (Either Staging or the Data Warehouse)
In this kind of a situation, you can automate the identification of the file arrival and trigger using any third party scheduling tools like Control M or Autosys. We can also use Informatica scheduler with multiple workarounds.
Now, consider a situation where you have to identify the file arrival and verify the total number of records present in the file to that of the number which is available in the control file (Control file is the additional file which will just have the name of the parent file, the date and the count of total number of records in the parent file) or the footer of the same file.
Only file identification will not suffice here. As a second step you have to have some job to verify the total numbers of source records to that of the footer or the control file.
Yes, there are options to set this up in Informatica mappings, but take a situation where you have to do this for a 1000 files at the same time. Though, you have concurrent runs enabled for Informatica workflows, there will be performance issues and if there is any failure you have to start all over again. And, tracking the success and failure of the overall process will become a complex task.
Now, if the file arrival identification, the records count check and if any other basic checks can be performed using a UNIX script (this can also be done using perl, phyton or .NET) then it would be quick and faster. Once these checks are done, you can use Informaitca or any other ETL tool to perform the rest of the business specific rules or calculations and process the data.
Another common situation in DWH for shell scripting is to trigger the jobs or workflows via pmcmd (in some projects it’s called as a wrapper script), it is used to validate the file’s presence and trigger the workflows. In the case of validating the files if you know basic commands and the commands like grep, awk, sed it would be enough.
ETL tools now are advanced and have lot of features inbuilt with basic configurations but scripting gives you that additional boost which normal jobs don’t do.
Another example, which can be taken from the administration perspective, is to programmatically define environments, interface with subversion to stage deployments, and copy application code to target locations, as apart of automated environment management and release management.
The key thing that shell scripting does for us in our environment is that it provides the entry point for applications, passing in “environment” variables such as the paths to location of application code and data. This is the standard way to achieve environment independence so that the same deployment process can deploy the entire application to any environment. It allows us to do things like use our automated deployment tool to redeploy a new version of the automated deployment tools.
To wrap this all up, I would say (from my personal experience) understand the basics of scripting. Get some sample scripts from the internet just read them thru, understand the logic and start making enhancements to it to make certain changes to the output. Once you are comfortable doing this stuff, you can start writing the scripts to complex requirements and practice it over time.
From interview, point of view start with the interview questions of shell scripting, syntax, error handling mechanism and how the outputs can be read by any database jobs/other shell scripts to trigger the dependent jobs or other applications which can take instructions form UNIX OS.