Home » Interview Preparation » Azure Data Factory – Interview Questions
1.What is Azure Data Factory?
Answer: Azure Data Factory is a cloud-based, fully managed, server less ETL and data integration service offered by Microsoft Azure for automating data movement from its native place to say a data lake or data warehouse using ETL (extract-transform-load) OR extract-load-transform (ELT). It lets you create and run data pipelines that can help move and transform data and run scheduled pipelines.
2. Is Azure Data Factory ETL or ELT tool?
Answer: It is a cloud-based Microsoft tool that provides a cloud-based integration service for data analytics at scale and supports ETL and ELT paradigms.
Answer: With an increasing amount of big data, there is a need for a service like ADF that can orchestrate and operationalize processes to refine the enormous stores of raw business data into actionable business insights.
Answer: To effectively utilize Data Factory, it’s crucial to understand the following concepts:
- Pipelines: Logical groupings of tasks for specific functions, facilitating data movement and transformation.
- Activities: Individual processing steps within pipelines, performing tasks like data movement and control.
- Datasets: Represent data structures within data stores, used as inputs or outputs in activities.
- Linked service: Connection strings enabling Data Factory to connect to various sources.
- Integration Runtime: Computing environment ensuring activities run efficiently and securely.
- Data Flows: Visual tools for designing data transformations using backend Spark services.
Answer: There are three approaches to executing a pipeline in Data Factory:
- Debug mode serves as a tool for testing and troubleshooting pipeline code.
- Manual Execution involves triggering a pipeline directly by selecting the ‘Trigger now’ option, suitable for ad-hoc runs.
- Scheduling pipelines at predetermined times and intervals using Triggers, with three types available in Data Factory.
Answer:
Linked services serve two primary purposes within Data Factory:
- They represent Data Stores, including storage systems like Azure Blob storage, file shares, or database instances such as Oracle DB or SQL Server.
- They represent Compute resources, where the underlying virtual machine executes activities defined within the pipeline.
Answer: The Integration Runtime (IR) serves as the computational backbone for Azure Data Factory pipelines, acting as a conduit between activities and linked services. Referenced by either entity, it furnishes the computational framework for activity execution or dispatch. This setup enables activities to be conducted in the closest proximity to the target data stores or compute services, optimizing efficiency. The diagram illustrating the location settings for Data Factory and its integration runtimes can be found at Source: Microsoft Docs. Azure Data Factory supports three types of integration runtime, each tailored to specific data integration capabilities and network environment requirements:
- Azure Integration Runtime: Facilitates data copying between cloud data stores and dispatches activities to various computing services like SQL Server, Azure HDInsight, etc.
- Self-Hosted Integration Runtime: Utilized for executing copy activities between cloud data stores and data stores within private networks. This runtime is software sharing the same codebase as the Azure Integration Runtime but installed on local systems or virtual machines over a virtual network.
- Azure SSIS Integration Runtime: Enables the execution of SSIS packages in a managed environment. When transitioning SSIS packages to Data Factory, the Azure SSIS Integration Runtime is employed.
Answer: In Data Factory, the default ceiling on all entities, encompassing pipelines, datasets, triggers, linked services, Private Endpoints, and integration runtimes, is established at 5000. Should there be a necessity for a higher limit, one can escalate the issue by submitting an online support ticket.
Answer: An ARM template is a JSON (JavaScript Object Notation) file that defines the infrastructure and configuration for the data factory pipeline, including pipeline activities, linked services, datasets, etc. The template will contain essentially the same code as our pipeline.
ARM templates are helpful when we want to migrate our pipeline code to higher environments, say Production or Staging from Development, after we are convinced that the code is working correctly.
Answer:
At a very high level, we can achieve this with the below set of steps:
- Create a feature branch that will store our code base.
- Create a pull request to merge the code after we’re sure to the Dev branch.
- Publish the code from dev to generate ARM templates.
- This can trigger an automated CI/CD DevOps pipeline to promote code to higher environments like Staging or Production.
Answer: Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data engineers to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can be operationalized using existing Azure Data Factory scheduling, control flow, and monitoring capabilities.
Answer: The ETL (Extract, Transform, Load) process follows four main steps:
- Connect and Collect: Connect to the data source/s and move data to local and crowdsource data storage.
- Data transformation using computing services such as HDInsight, Hadoop, Spark etc.
- Publish: To load data into Azure data lake storage, Azure SQL data warehouse, Azure SQL databases, Azure Cosmos DB, etc.
- Monitor: Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.
- Look-up activity can return the result of executing a query or stored procedure.
- The output can be a singleton value or an array of attributes, which can be consumed in subsequent copy data activity, or any transformation or control flow activity like ForEach activity.
Yes, parameters are a first-class, top-level concept in Data Factory. We can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or using a trigger.
Answer: We can use execute notebook activity to pass code to our databricks cluster. We can pass parameters to a notebook activity using baseParameters property. If the parameters are not defined/ specified in the activity, default values from the notebook are executed.
Answer: Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to develop and deliver your ETL processes incrementally before publishing the finished product. After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse or Azure SQL Azure Data Lake, Azure Cosmos DB, or whichever analytics engine your business uses can point to from their business intelligence tools.
16. What to you mean by variables in the Azure Data Factory?
Answer: Variables in the Azure Data Factory pipeline provide the functionality to hold the values. They are used for a similar reason as we use variables in any programming language and are available inside the pipeline.
Set Variable and append variable are two activities used for setting or manipulating the values of the variables. There are two types of the variables in a data factory: –
- System variables: These are fixed variables from the azure pipeline. For example, pipeline name, pipeline id, trigger name etc. You mostly need these to get the system information that might be needed in your use case.
- User variable: A user variable is declared manually in your code based on your pipeline logic.
17. What is copy activity in the ADF?
Answer: Copy activity is one of the most popular and universally used activity in the Azure data factory. It is used for ETL or Lift and Shift, where you want to move the data from one data source to another. While you copy the data, you can also do the transformation; for example, you read the data from txt/csv file, which contains 12 columns; however, while writing to your target data source, you want to keep only seven columns. You can transform it and send only the required number of columns to the destination data source.
18. Can you elaborate more on the Copy activity?
The copy activity performs the following steps at high-level:
- Read data from the source data store. (e.g., blob storage)
- Perform the following tasks on the data:
- Serialization/deserialization
- Compression/decompression
- Column mapping
- Write data to the destination data store or sink. (e.g., azure data lake) This is summarized in the below graphic:
Here you can share some of the major activities if you have used them in your career be it your work or college project. Here are a few of the most used activities :
- Copy Data Activity to copy the data between datasets.
- ForEach Activity for looping.
- Get Metadata Activity which can provide metadata about any data source.
- Set Variable Activity to define and initiate variables within pipelines.
- Lookup Activity to do a lookup to get some values from a table/file.
- Wait Activity to wait for a specified amount of time before/in between the pipeline run.
- Validation Activity will validate the presence of files within the dataset.
- Web Activity to call a custom REST endpoint from an ADF pipeline.
20. How can I schedule a pipeline?
You can use the time window trigger or scheduler trigger to schedule a pipeline. The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).
Currently, the service supports three types of triggers:
- Tumbling window trigger: A trigger that operates on a periodic interval while retaining a state.
- Schedule Trigger: A trigger that invokes a pipeline on a wall-clock schedule.
- Event-Based Trigger: A trigger that responds to an event. e.g., a file getting placed inside a blob.
Pipelines and triggers have a many-to-many relationship (except for the tumbling window trigger). Multiple triggers can kick off a single pipeline, or a single trigger can kick off numerous pipelines.
21. When should you choose Azure Data Factory?
One should consider using Data Factory:
- When working with big data, there is a need for a data warehouse to be implemented; you might require a cloud-based integration solution like ADF for the same.
- Not all the team members are experienced in coding and may prefer graphical tools to work with data.
- When raw business data is stored at diverse data sources, which can be on-prem and on the cloud, we would like to have one analytics solution like ADF to integrate them all in one place.
- We would like to use readily available data movement and processing solutions and like to be light in terms of infrastructure management. So, a managed solution like ADF makes more sense in this case.
Answer: The mapping data flow feature allows Azure SQL Database, Azure Synapse Analytics, delimited text files from azure storage account or Azure Data Lake Storage Gen2, and Parquet files from blob storage or Data Lake Storage Gen2 natively for source and sink data source.
Use the Copy activity to stage data from any other connectors, and then execute a Data Flow activity to transform data after it’s been staged.
23. Is it possible to calculate a value for a new column from the existing column from mapping in ADF?
We can derive transformations in the mapping data flow to generate a new column based on our desired logic. We can create a new derived column or update an existing one when generating a derived one. Enter the name of the column you’re creating in the Column textbox.
You can use the column dropdown to override an existing column in your schema. Click the Enter expression textbox to start creating the derived column’s expression. You can input or use the expression builder to build your logic.
24. How is lookup activity useful in the Azure Data Factory?
In the ADF pipeline, the Lookup activity is commonly used for configuration lookup purposes, and the source dataset is available. Moreover, it is used to retrieve the data from the source dataset and then send it as the output of the activity. Generally, the output of the lookup activity is further used in the pipeline for taking some decisions or presenting any configuration as a result.
In simple terms, lookup activity is used for data fetching in the ADF pipeline. The way you would use it entirely relies on your pipeline logic. It is possible to obtain only the first row, or you can retrieve the complete rows depending on your dataset or query.
25. Elaborate more on the Get Metadata activity in Azure Data Factory.
The Get Metadata activity is used to retrieve the metadata of any data in the Azure Data Factory or a Synapse pipeline. We can use the output from the Get Metadata activity in conditional expressions to perform validation or consume the metadata in subsequent activities.
It takes a dataset as an input and returns metadata information as output. Currently, the following connectors and the corresponding retrievable metadata are supported. The maximum size of returned metadata is 4 MB.
Please refer to the snapshot below for supported metadata which can be retrieved using the Get Metadata activity.
26. How to debus an ADF pipeline?
Debugging is one of the crucial aspects of any coding-related activity needed to test the code for any issues it might have. It also provides an option to debug the pipeline without executing it.
27. What does it mean by the breakpoint in the ADF pipeline?
To understand better, for example, you are using three activities in the pipeline, and now you want to debug up to the second activity only. You can do this by placing the breakpoint at the second activity. To add a breakpoint, you can click the circle present at the top of the activity.
28. Explain the data source in the Azure Data Factory.
The data source is the source or destination system that comprises the data intended to be utilized or executed. The type of data can be binary, text, csv files, JSON files, and it. It can be image files, video, audio, or might be a proper database.
Examples of data sources include azure data lake storage, azure blob storage, or any other database such as mysql db, azure sql database, postgres, etc.
29. Can you share any difficulties you faced while getting data from on-premises to Azure cloud Data Factory?
One of the significant challenges we face while migrating from on-prem to cloud is throughput and speed. When we try to copy the data using Copy activity from on- prem, the speed of the process is relatively slow, and hence we don’t get the desired throughput.
There are some configuration options for a copy activity, which can help in tuning this process and can give desired results.
- We should use the compression option to get the data in a compressed mode while loading from on-prem servers, which is then de-compressed while writing on the cloud storage.
- Staging area should be the first destination of our data after we have enabled the compression. The copy activity can decompress before writing it to the final cloud storage buckets.
- Degree of Copy Parallelism is another option to help improve the migration process. This is identical to having multiple threads processing data and can speed up the data copy process.
- There is no right fit-for-all here, so we must try out different numbers like 8, 16, or 32 and see which gives a good performance.
- Data Integration Unit is loosely the number of CPUs used, and increasing it may improve the performance of the copy process.
30. How to copy multiple sheet data from an Excel file?
When we use an excel connector within a data factory, we must provide a sheet name from which we have to load data. This approach is nuanced when we have to deal with a single or a handful of sheets’ data, but when we have lots of sheets (say 10+), this may become a tedious task as we have to change the hard-coded sheet name every time!
However, we can use a data factory binary data format connector for this and point it to the excel file and need not provide the sheet name/s. We’ll be able to use copy activity to copy the data from all the sheets present in the file.
31. Is it possible to have nested looping in ADF?
There is no direct support for nested looping in the data factory for any looping activity (for each / until). However, we can use one for each/until loop activity which will contain an execute pipeline activity that can have a loop activity. This way, when we call the loop activity it will indirectly call another loop activity, and we’ll be able to achieve nested looping.
32. How to copy multiple tables from on datastore to another datastore?
An efficient approach to complete this task would be:
- Maintain a lookup table/ file which will contain the list of tables and their source, which needs to be copied.
- Then, we can use the lookup activity and each loop activity to scan through the list.
- Inside the for each loop activity, we can use a copy activity or a mapping dataflow to accomplish the task of copying multiple tables to the destination datastore.
We could consider the below set of parameters for tuning the performance of a Mapping Data Flow activity we have in a pipeline.
i) We should try to leverage partitioning in the source, sink, or transformation whenever possible.
Microsoft, however, recommends that we use the default partition (size 128 MB) selected by the Data Factory as it intelligently chooses one based on our pipeline configuration.
Still, one should try out different partitions and see if they can have improved performance.
ii) We should not use a data flow activity for each loop activity. Instead, suppose we have multiple files similar in terms of structure and the processing need. In that case, we should use a wildcard path inside the data flow activity, enabling the processing of all the files within a folder.
iii) The recommended file format to use is ‘. parquet’. The reason being the pipeline will execute by spinning up spark clusters, and Parquet is the native file format
for Apache Spark; thus it will generally give good performance.
iv) Multiple logging modes are available: Basic, Verbose, and None.
We should not use verbose mode unless essential, as it will log all the details about each operation the activity is performing. e.g., It will log all the details of the operations performed for all the partitions we have. This one is useful when troubleshooting issues with the data flow.
The basic mode will give out all the necessary basic details in the log, so try to use this one whenever possible.
v) Try to break down a complex data flow activity into multiple data flow activities. Let’s say we have n number of transformations between source and sink, and by
adding more, we think the design has become complex. In this case, try to have it in multiple such activities, which will give two advantages:
a) All activities will run on separate spark clusters, so the run time will come down for the whole task.
b) The whole pipeline will be easy to understand and maintain in the future.
34. What are some of the limitations of ADF?
Azure Data Factory provides great functionalities for data movement and transformations. However, there are some limitations as well.
- We can’t have nested looping activities in the data factory, and we must use some workaround if we have that sort of structure in our pipeline. All the looping activities come under this: If, Foreach, switch, and until activities.
- The lookup activity can retrieve only 5000 rows at a time and not more than that. Again, we need to use some other loop activity along with SQL with the limit to achieve this sort of structure in the pipeline.
- We can have a maximum of 40 activities in a single pipeline, including everything: inner activity, containers, etc. To overcome this, we should try to modularize the pipelines regarding the number of datasets, activities, etc.
There are multiple ways to do this:
1. Using Logic Apps with Web/Web hook activity.
Configure a logic app that, upon getting an HTTP request, can send an email to the required set of people for failure. In the pipeline, configure the failure option to hit the URL generated by the logic app.
2. Using Alerts and Metrics from pipeline options.
We can set up this from the pipeline itself, where we get numerous options for email on any activity failure within the pipeline.
36. What is Azure SQL database? Can you integrate it with Data Factory?
Part of the Azure SQL family, Azure SQL Database is an always up-to-date, fully managed relational database service built for the cloud for storing data. We can easily design data pipelines to read and write to SQL DB using the Azure data factory.
Checkout: https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql- database?tabs=data-factory
1.What is Azure Data Factory?
Answer: Azure Data Factory is a cloud-based, fully managed, server less ETL and data integration service offered by Microsoft Azure for automating data movement from its native place to say a data lake or data warehouse using ETL (extract-transform-load) OR extract-load-transform (ELT). It lets you create and run data pipelines that can help move and transform data and run scheduled pipelines.
2. Is Azure Data Factory ETL or ELT tool?
Answer: It is a cloud-based Microsoft tool that provides a cloud-based integration service for data analytics at scale and supports ETL and ELT paradigms.
Answer: With an increasing amount of big data, there is a need for a service like ADF that can orchestrate and operationalize processes to refine the enormous stores of raw business data into actionable business insights.
Answer: To effectively utilize Data Factory, it’s crucial to understand the following concepts:
- Pipelines: Logical groupings of tasks for specific functions, facilitating data movement and transformation.
- Activities: Individual processing steps within pipelines, performing tasks like data movement and control.
- Datasets: Represent data structures within data stores, used as inputs or outputs in activities.
- Linked service: Connection strings enabling Data Factory to connect to various sources.
- Integration Runtime: Computing environment ensuring activities run efficiently and securely.
- Data Flows: Visual tools for designing data transformations using backend Spark services.
Answer: There are three approaches to executing a pipeline in Data Factory:
- Debug mode serves as a tool for testing and troubleshooting pipeline code.
- Manual Execution involves triggering a pipeline directly by selecting the ‘Trigger now’ option, suitable for ad-hoc runs.
- Scheduling pipelines at predetermined times and intervals using Triggers, with three types available in Data Factory.
Answer:
Linked services serve two primary purposes within Data Factory:
- They represent Data Stores, including storage systems like Azure Blob storage, file shares, or database instances such as Oracle DB or SQL Server.
- They represent Compute resources, where the underlying virtual machine executes activities defined within the pipeline.
Answer: The Integration Runtime (IR) serves as the computational backbone for Azure Data Factory pipelines, acting as a conduit between activities and linked services. Referenced by either entity, it furnishes the computational framework for activity execution or dispatch. This setup enables activities to be conducted in the closest proximity to the target data stores or compute services, optimizing efficiency. The diagram illustrating the location settings for Data Factory and its integration runtimes can be found at Source: Microsoft Docs. Azure Data Factory supports three types of integration runtime, each tailored to specific data integration capabilities and network environment requirements:
- Azure Integration Runtime: Facilitates data copying between cloud data stores and dispatches activities to various computing services like SQL Server, Azure HDInsight, etc.
- Self-Hosted Integration Runtime: Utilized for executing copy activities between cloud data stores and data stores within private networks. This runtime is software sharing the same codebase as the Azure Integration Runtime but installed on local systems or virtual machines over a virtual network.
- Azure SSIS Integration Runtime: Enables the execution of SSIS packages in a managed environment. When transitioning SSIS packages to Data Factory, the Azure SSIS Integration Runtime is employed.
Answer: In Data Factory, the default ceiling on all entities, encompassing pipelines, datasets, triggers, linked services, Private Endpoints, and integration runtimes, is established at 5000. Should there be a necessity for a higher limit, one can escalate the issue by submitting an online support ticket.
Answer: An ARM template is a JSON (JavaScript Object Notation) file that defines the infrastructure and configuration for the data factory pipeline, including pipeline activities, linked services, datasets, etc. The template will contain essentially the same code as our pipeline.
ARM templates are helpful when we want to migrate our pipeline code to higher environments, say Production or Staging from Development, after we are convinced that the code is working correctly.
Answer:
At a very high level, we can achieve this with the below set of steps:
- Create a feature branch that will store our code base.
- Create a pull request to merge the code after we’re sure to the Dev branch.
- Publish the code from dev to generate ARM templates.
- This can trigger an automated CI/CD DevOps pipeline to promote code to higher environments like Staging or Production.
Answer: Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data engineers to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can be operationalized using existing Azure Data Factory scheduling, control flow, and monitoring capabilities.
Answer: The ETL (Extract, Transform, Load) process follows four main steps:
- Connect and Collect: Connect to the data source/s and move data to local and crowdsource data storage.
- Data transformation using computing services such as HDInsight, Hadoop, Spark etc.
- Publish: To load data into Azure data lake storage, Azure SQL data warehouse, Azure SQL databases, Azure Cosmos DB, etc.
- Monitor: Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.
- Look-up activity can return the result of executing a query or stored procedure.
- The output can be a singleton value or an array of attributes, which can be consumed in subsequent copy data activity, or any transformation or control flow activity like ForEach activity.
Yes, parameters are a first-class, top-level concept in Data Factory. We can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or using a trigger.
Answer: We can use execute notebook activity to pass code to our databricks cluster. We can pass parameters to a notebook activity using baseParameters property. If the parameters are not defined/ specified in the activity, default values from the notebook are executed.
Answer: Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to develop and deliver your ETL processes incrementally before publishing the finished product. After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse or Azure SQL Azure Data Lake, Azure Cosmos DB, or whichever analytics engine your business uses can point to from their business intelligence tools.
16. What to you mean by variables in the Azure Data Factory?
Answer: Variables in the Azure Data Factory pipeline provide the functionality to hold the values. They are used for a similar reason as we use variables in any programming language and are available inside the pipeline.
Set Variable and append variable are two activities used for setting or manipulating the values of the variables. There are two types of the variables in a data factory: –
- System variables: These are fixed variables from the azure pipeline. For example, pipeline name, pipeline id, trigger name etc. You mostly need these to get the system information that might be needed in your use case.
- User variable: A user variable is declared manually in your code based on your pipeline logic.
17. What is copy activity in the ADF?
Answer: Copy activity is one of the most popular and universally used activity in the Azure data factory. It is used for ETL or Lift and Shift, where you want to move the data from one data source to another. While you copy the data, you can also do the transformation; for example, you read the data from txt/csv file, which contains 12 columns; however, while writing to your target data source, you want to keep only seven columns. You can transform it and send only the required number of columns to the destination data source.
18. Can you elaborate more on the Copy activity?
The copy activity performs the following steps at high-level:
- Read data from the source data store. (e.g., blob storage)
- Perform the following tasks on the data:
- Serialization/deserialization
- Compression/decompression
- Column mapping
- Write data to the destination data store or sink. (e.g., azure data lake) This is summarized in the below graphic:
Here you can share some of the major activities if you have used them in your career be it your work or college project. Here are a few of the most used activities :
- Copy Data Activity to copy the data between datasets.
- ForEach Activity for looping.
- Get Metadata Activity which can provide metadata about any data source.
- Set Variable Activity to define and initiate variables within pipelines.
- Lookup Activity to do a lookup to get some values from a table/file.
- Wait Activity to wait for a specified amount of time before/in between the pipeline run.
- Validation Activity will validate the presence of files within the dataset.
- Web Activity to call a custom REST endpoint from an ADF pipeline.
20. How can I schedule a pipeline?
You can use the time window trigger or scheduler trigger to schedule a pipeline. The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).
Currently, the service supports three types of triggers:
- Tumbling window trigger: A trigger that operates on a periodic interval while retaining a state.
- Schedule Trigger: A trigger that invokes a pipeline on a wall-clock schedule.
- Event-Based Trigger: A trigger that responds to an event. e.g., a file getting placed inside a blob.
Pipelines and triggers have a many-to-many relationship (except for the tumbling window trigger). Multiple triggers can kick off a single pipeline, or a single trigger can kick off numerous pipelines.
21. When should you choose Azure Data Factory?
One should consider using Data Factory:
- When working with big data, there is a need for a data warehouse to be implemented; you might require a cloud-based integration solution like ADF for the same.
- Not all the team members are experienced in coding and may prefer graphical tools to work with data.
- When raw business data is stored at diverse data sources, which can be on-prem and on the cloud, we would like to have one analytics solution like ADF to integrate them all in one place.
- We would like to use readily available data movement and processing solutions and like to be light in terms of infrastructure management. So, a managed solution like ADF makes more sense in this case.
Answer: The mapping data flow feature allows Azure SQL Database, Azure Synapse Analytics, delimited text files from azure storage account or Azure Data Lake Storage Gen2, and Parquet files from blob storage or Data Lake Storage Gen2 natively for source and sink data source.
Use the Copy activity to stage data from any other connectors, and then execute a Data Flow activity to transform data after it’s been staged.
23. Is it possible to calculate a value for a new column from the existing column from mapping in ADF?
We can derive transformations in the mapping data flow to generate a new column based on our desired logic. We can create a new derived column or update an existing one when generating a derived one. Enter the name of the column you’re creating in the Column textbox.
You can use the column dropdown to override an existing column in your schema. Click the Enter expression textbox to start creating the derived column’s expression. You can input or use the expression builder to build your logic.
24. How is lookup activity useful in the Azure Data Factory?
In the ADF pipeline, the Lookup activity is commonly used for configuration lookup purposes, and the source dataset is available. Moreover, it is used to retrieve the data from the source dataset and then send it as the output of the activity. Generally, the output of the lookup activity is further used in the pipeline for taking some decisions or presenting any configuration as a result.
In simple terms, lookup activity is used for data fetching in the ADF pipeline. The way you would use it entirely relies on your pipeline logic. It is possible to obtain only the first row, or you can retrieve the complete rows depending on your dataset or query.
25. Elaborate more on the Get Metadata activity in Azure Data Factory.
The Get Metadata activity is used to retrieve the metadata of any data in the Azure Data Factory or a Synapse pipeline. We can use the output from the Get Metadata activity in conditional expressions to perform validation or consume the metadata in subsequent activities.
It takes a dataset as an input and returns metadata information as output. Currently, the following connectors and the corresponding retrievable metadata are supported. The maximum size of returned metadata is 4 MB.
Please refer to the snapshot below for supported metadata which can be retrieved using the Get Metadata activity.
26. How to debus an ADF pipeline?
Debugging is one of the crucial aspects of any coding-related activity needed to test the code for any issues it might have. It also provides an option to debug the pipeline without executing it.
27. What does it mean by the breakpoint in the ADF pipeline?
To understand better, for example, you are using three activities in the pipeline, and now you want to debug up to the second activity only. You can do this by placing the breakpoint at the second activity. To add a breakpoint, you can click the circle present at the top of the activity.
28. Explain the data source in the Azure Data Factory.
The data source is the source or destination system that comprises the data intended to be utilized or executed. The type of data can be binary, text, csv files, JSON files, and it. It can be image files, video, audio, or might be a proper database.
Examples of data sources include azure data lake storage, azure blob storage, or any other database such as mysql db, azure sql database, postgres, etc.
29. Can you share any difficulties you faced while getting data from on-premises to Azure cloud Data Factory?
One of the significant challenges we face while migrating from on-prem to cloud is throughput and speed. When we try to copy the data using Copy activity from on- prem, the speed of the process is relatively slow, and hence we don’t get the desired throughput.
There are some configuration options for a copy activity, which can help in tuning this process and can give desired results.
- We should use the compression option to get the data in a compressed mode while loading from on-prem servers, which is then de-compressed while writing on the cloud storage.
- Staging area should be the first destination of our data after we have enabled the compression. The copy activity can decompress before writing it to the final cloud storage buckets.
- Degree of Copy Parallelism is another option to help improve the migration process. This is identical to having multiple threads processing data and can speed up the data copy process.
- There is no right fit-for-all here, so we must try out different numbers like 8, 16, or 32 and see which gives a good performance.
- Data Integration Unit is loosely the number of CPUs used, and increasing it may improve the performance of the copy process.
30. How to copy multiple sheet data from an Excel file?
When we use an excel connector within a data factory, we must provide a sheet name from which we have to load data. This approach is nuanced when we have to deal with a single or a handful of sheets’ data, but when we have lots of sheets (say 10+), this may become a tedious task as we have to change the hard-coded sheet name every time!
However, we can use a data factory binary data format connector for this and point it to the excel file and need not provide the sheet name/s. We’ll be able to use copy activity to copy the data from all the sheets present in the file.
31. Is it possible to have nested looping in ADF?
There is no direct support for nested looping in the data factory for any looping activity (for each / until). However, we can use one for each/until loop activity which will contain an execute pipeline activity that can have a loop activity. This way, when we call the loop activity it will indirectly call another loop activity, and we’ll be able to achieve nested looping.
32. How to copy multiple tables from on datastore to another datastore?
An efficient approach to complete this task would be:
- Maintain a lookup table/ file which will contain the list of tables and their source, which needs to be copied.
- Then, we can use the lookup activity and each loop activity to scan through the list.
- Inside the for each loop activity, we can use a copy activity or a mapping dataflow to accomplish the task of copying multiple tables to the destination datastore.
We could consider the below set of parameters for tuning the performance of a Mapping Data Flow activity we have in a pipeline.
i) We should try to leverage partitioning in the source, sink, or transformation whenever possible.
Microsoft, however, recommends that we use the default partition (size 128 MB) selected by the Data Factory as it intelligently chooses one based on our pipeline configuration.
Still, one should try out different partitions and see if they can have improved performance.
ii) We should not use a data flow activity for each loop activity. Instead, suppose we have multiple files similar in terms of structure and the processing need. In that case, we should use a wildcard path inside the data flow activity, enabling the processing of all the files within a folder.
iii) The recommended file format to use is ‘. parquet’. The reason being the pipeline will execute by spinning up spark clusters, and Parquet is the native file format
for Apache Spark; thus it will generally give good performance.
iv) Multiple logging modes are available: Basic, Verbose, and None.
We should not use verbose mode unless essential, as it will log all the details about each operation the activity is performing. e.g., It will log all the details of the operations performed for all the partitions we have. This one is useful when troubleshooting issues with the data flow.
The basic mode will give out all the necessary basic details in the log, so try to use this one whenever possible.
v) Try to break down a complex data flow activity into multiple data flow activities. Let’s say we have n number of transformations between source and sink, and by
adding more, we think the design has become complex. In this case, try to have it in multiple such activities, which will give two advantages:
a) All activities will run on separate spark clusters, so the run time will come down for the whole task.
b) The whole pipeline will be easy to understand and maintain in the future.
34. What are some of the limitations of ADF?
Azure Data Factory provides great functionalities for data movement and transformations. However, there are some limitations as well.
- We can’t have nested looping activities in the data factory, and we must use some workaround if we have that sort of structure in our pipeline. All the looping activities come under this: If, Foreach, switch, and until activities.
- The lookup activity can retrieve only 5000 rows at a time and not more than that. Again, we need to use some other loop activity along with SQL with the limit to achieve this sort of structure in the pipeline.
- We can have a maximum of 40 activities in a single pipeline, including everything: inner activity, containers, etc. To overcome this, we should try to modularize the pipelines regarding the number of datasets, activities, etc.
There are multiple ways to do this:
1. Using Logic Apps with Web/Web hook activity.
Configure a logic app that, upon getting an HTTP request, can send an email to the required set of people for failure. In the pipeline, configure the failure option to hit the URL generated by the logic app.
2. Using Alerts and Metrics from pipeline options.
We can set up this from the pipeline itself, where we get numerous options for email on any activity failure within the pipeline.
36. What is Azure SQL database? Can you integrate it with Data Factory?
Part of the Azure SQL family, Azure SQL Database is an always up-to-date, fully managed relational database service built for the cloud for storing data. We can easily design data pipelines to read and write to SQL DB using the Azure data factory.
Checkout: https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql- database?tabs=data-factory
Your information will never be shared with any third party