Home » Interview Preparation » Databricks-PySpark Interview Questions

Home
Databricks-PySpark Interview Questions

Databricks-PySpark Interview Questions

Atchyut
August 25, 2025

DataSpark Academy - Databricks PySpark Interview Questions

Atchyut

Senior Data Engineering Instructor

📚 M.Tech NIT Calicut

⏰ 15+ Years Experience

🏆 GATE AIR 400 Rank

💼 Data Engineering Expert

With over 15 years of experience in data engineering and a distinguished academic background from NIT Calicut, Atchyut brings deep expertise in Databricks, PySpark, and Azure cloud technologies. His exceptional GATE rank of 400 demonstrates his strong analytical foundation, which he now applies to help aspiring data engineers master complex interview scenarios and real-world data processing challenges.

Databricks & PySpark Fundamentals

Q1 What is lazy evaluation in Spark? Can you describe how it works with an example in PySpark, and how it benefits large-scale data processing?

Answer:

Lazy evaluation means Spark does not immediately execute transformations (like filter, map, select). Instead, it builds a logical execution plan (a DAG) and only runs computations when an action (such as collect(), count(), or write()) is triggered.

This lets Spark optimize across all operations—for example, combining multiple filters before reading data.

Example:

df_filtered = df.filter(df.salary > 50000)
df_selected = df_filtered.select("name", "salary")
# No computation happens yet
output = df_selected.collect() # Only now Spark computes and returns results

Benefits:

• Reduces unnecessary computation

• Enables query optimization (predicate pushdown, pipelining)

• Improves cluster resource usage

Q2 What's the difference between transformations and actions in Spark? Provide examples of each.

Answer:

Transformation: Creates a new RDD/DataFrame from an existing one but does not trigger execution. Examples: map(), filter(), select(), withColumn().

Action: Triggers the execution of transformations and returns a result. Examples: collect(), count(), show(), write().

Example:

# Transformation
df2 = df.filter(df['age'] > 18)  # No computation yet
# Action
result = df2.count()  # Triggers computation

Q3 Explain what a Spark DAG (Directed Acyclic Graph) is, and how it is formed in a Databricks pipeline.

Answer:

A DAG is a graph structure where each node represents an operation (transformation or action) and edges represent dependencies between operations.

Spark builds a DAG from your series of transformations. When you trigger an action, Spark analyzes the DAG, optimizes it, and breaks it into execution stages with tasks scheduled across the cluster.

In a Databricks notebook, every chain of DataFrame operations builds and maintains a DAG until an action (like show() or write()) is called.

Q4 How does partitioning work in Spark, and why is partition size important for performance on Databricks?

Answer:

Partitioning splits a dataset into smaller, logical chunks—partitions—that are processed in parallel by different executors. Too few partitions may underutilize cluster resources; too many may cause scheduling overhead and small-file problems.

Ideal partition size is often determined by data size (e.g., 100-200MB per partition is typical for performance), and can be adjusted using repartition() or coalesce().

Q5 What is the difference between repartition() and coalesce() methods in PySpark? When would you use each?

Answer:

repartition(n): Shuffles all data for a complete redistribution into n partitions. Use when increasing partitions or when even data distribution is needed.

coalesce(n): Reduces number of partitions (without full shuffle). Use when you want fewer partitions, usually after filtering.

Example:

df2 = df.repartition(10)  # More partitions (increases parallelism)
df3 = df.coalesce(2)      # Fewer partitions (less parallelism, less shuffle)

Q6 Describe the use and purpose of caching and persistence in Spark. When should you use cache() or persist()?

Answer:

Caching and persisting store intermediate DataFrames or RDDs in memory (or disk), allowing future actions to reuse results instead of re-computing.

Use cases:

• When a DataFrame is reused multiple times (e.g., in iterative algorithms, in multiple downstream operations of a complex ETL).

Usage:

• Use cache() for default (memory-only) storage.

• Use persist(StorageLevel.MEMORY_AND_DISK) for larger data or when memory is limited.

Q7 If you have a large dataset and notice your job is running slowly, what Spark UI metrics would you check first and why?

Answer:

Key metrics to check:

• Stage and task duration: to locate bottlenecks.

• Shuffle read/write metrics: to detect expensive data shuffles.

• Skewed partitions: some tasks taking much longer (data skew).

• Executor metrics: to check CPU/memory resource utilization.

Identify slow stages, skew, and resource bottlenecks to optimize the pipeline.

SQL and Data Modeling Basics

Q8 List the clauses in a standard SQL SELECT statement and explain the logical order in which they are executed by the SQL engine.

Answer:

Clauses (in execution order, not written order):

1. FROM (specifies source table(s))

2. WHERE (filters rows)

3. GROUP BY (forms groups)

4. HAVING (filters groups)

5. SELECT (selects columns/expressions)

6. ORDER BY (sorts results)

Written SQL:

SELECT col1, col2
FROM mytable
WHERE col3 > 100
GROUP BY col1
HAVING COUNT(*) > 2
ORDER BY col2 DESC

Execution order ensures filters and grouping happen before selection and sorting.

Q9 Explain with an example scenario when you would use each type of SQL JOIN: INNER, LEFT, RIGHT, and FULL JOIN.

Answer:

INNER JOIN: Rows must match in both tables.

Scenario: Find customers who placed orders.

LEFT JOIN: All rows from left table, matching from right.

Scenario: List all customers, including those without orders.

RIGHT JOIN: All rows from right table, matching from left.

Scenario: List all orders, including those without a matching customer (rare).

FULL JOIN: All rows from both tables, matched where possible.

Scenario: List all customers and orders, showing matches and unmatched from both.

Q10 How would you use GROUP BY and HAVING in a query, and when does HAVING execute in relation to WHERE?

Answer:

GROUP BY creates groups of rows with the same value(s) in specified column(s).

HAVING filters after grouping, typically using aggregates.

WHERE filters rows before grouping.

Example:

SELECT dept, COUNT(*) as emp_count
FROM employees
WHERE status = 'ACTIVE'
GROUP BY dept
HAVING COUNT(*) > 5;

This finds departments with more than five active employees.

Azure Data Factory (ADF) Core Concepts

Q11 How do variables and parameters differ in ADF, and can you provide an example of using each in a pipeline?

Answer:

Parameters: Passed at pipeline execution; set externally. Used for configuration (e.g., filename, date).

Variables: Mutable, scoped within pipeline run, used to store and change state during execution.

Example:

• Parameter: @pipeline().parameters.SourceFilePath

• Variable: Set or append during loop for accumulating file counts.

Q12 What are ADF pipeline activities? Name three commonly used activities and provide a brief use-case for each.

Answer:

ADF activities are discrete pipeline steps, such as data read/write, data transformation, or workflow control.

Three commonly used activities:

• Copy Activity: Transfer data between source and sink.

• Lookup Activity: Fetch reference data or metadata for conditional logic.

• ForEach Activity: Iterate over a collection (e.g., file list) to process multiple objects.

Q13 Describe a scenario where you would use the ForEach activity in ADF.

Answer:

You need to load daily CSV files from Blob Storage into a SQL database; use Get Metadata to list files, then ForEach to loop through filenames and invoke Copy or Dataflow activity for each.

Q14 How does ADF handle incremental data loads, and what components would you use to implement them?

Answer:

ADF supports incremental loads using watermark columns (e.g., last modified timestamps) and parameters. Use Lookup/GetMetadata to get latest processed value, and pass that parameter to the next pipeline/dataflow to fetch only new or changed records.

Q15 Explain how Lookup and Get Metadata activities work in ADF. When would you use each in an ETL workflow?

Answer:

Lookup Activity:

Reads a row/rows from a table/file and makes it available as object/array (e.g., read latest watermark or config values).

Get Metadata Activity:

Fetches file or folder metadata (e.g., file list, size, modified date) from storage.

Usage: Use Lookup for content, Get Metadata for structure/info.

Storage & Data Lake

Q16 What are the key differences between Azure Blob Storage and Azure Data Lake Storage Gen2? Why would you choose one over the other?

Answer:

Blob Storage: General object/file storage, simple flat namespace, widely used for all kinds of files.

ADLS Gen2: Built on Blob, but with hierarchical namespace (folders), fine ACLs, optimized for analytics (big data/Hadoop).

Usage:

• Use Blob: Simple file storage, simple security.

• Use ADLS Gen2: Structured analytics, advanced access controls, big data workloads.

Q17 Explain what hierarchical namespace means in Azure Data Lake Storage Gen2 and how it benefits analytics.

Answer:

Hierarchical namespace supports directory and file structure, enabling faster file operations (move/rename), efficient directory-level security, and compatibility with analytics frameworks like Hadoop/Spark.

Q18 Describe how files are managed and accessed by Databricks using the DBFS layer.

Answer:

Databricks File System (DBFS) is a virtual file system layer on top of cloud storage (Blob or ADLS), providing a POSIX-like interface (/dbfs/). Users access with standard file APIs or %fs commands, simplifying file I/O in notebooks.

Delta Lake & Data Consistency

Q19 What is Delta Lake and what benefits does it offer over traditional data lake storage?

Answer:

Delta Lake is a storage layer on top of data lake formats (Parquet) that provides ACID transactions, scalable metadata handling, schema enforcement, and time travel/versioning.

Benefits: Reliable batch/stream processing, data consistency (no dirty reads), rollback/versioning, and simplified pipeline design.

Q20 How does Delta Lake support time travel and data versioning? Provide a scenario where this would help recover from data corruption.

Answer:

Delta Lake stores a transaction log for all table changes. Using SQL or PySpark, you can query/restore data "as of" a specific timestamp or version.

Example:

SELECT * FROM delta./path/table VERSION AS OF 10;
RESTORE TABLE my_table TO VERSION AS OF 10;

Use case: After accidental deletion/corruption, restore data to a previous version with zero data loss.

Q21 How would you perform an upsert (merge) in Delta Lake using PySpark or SQL?

Answer:

SQL Example:

MERGE INTO targetTable AS t
USING sourceTable AS s
ON t.key = s.key
WHEN MATCHED THEN UPDATE SET t.val = s.val
WHEN NOT MATCHED THEN INSERT (key, val) VALUES (s.key, s.val);

PySpark Example:

from delta.tables import DeltaTable
dt = DeltaTable.forPath(spark, "/path/to/table")
dt.alias("t").merge(
    source=sourceDF.alias("s"),
    condition="t.key = s.key"
).whenMatchedUpdate(set={"val": "s.val"}) \
 .whenNotMatchedInsert(values={"key": "s.key", "val": "s.val"}) \
 .execute()

Monitoring, Logging, and Security Basics

Q22 How can you enable and use pipeline monitoring and logging in ADF and Databricks?

Answer:

ADF:

Use built-in Monitoring tab for pipeline runs, activity status, trigger history. Enable diagnostic logs to export to Log Analytics/Storage/Event Hub for detailed monitoring and alerting.

Databricks:

Use Job UI, Spark History Server, and logging to track notebook/job outcomes. Log custom metrics with MLflow or structured logging within notebooks.

Q23 What is Azure Key Vault and how would you securely manage credentials for databases and storage in your pipelines?

Answer:

Azure Key Vault is a cloud service for securely storing secrets, keys, and certificates.

Usage:

• Store DB connection strings and secrets.

• Grant pipelines managed identity access to read secrets at runtime.

• Reference secrets in ADF linked services or Databricks notebooks using native integrations.

Architecture and Project Best Practices

Q24 What is the Medallion architecture? Walk through how data typically flows from Bronze to Silver to Gold tables in Databricks.

Answer:

Medallion Architecture:

Bronze: Raw ingested data (minimal transformation, append only).

Silver: Cleansed, filtered, and joined data; business logic applied.

Gold: Aggregated, analytics-ready data for reporting.

Data flows: ingest raw → clean and enrich → aggregate and publish.

Q25 Describe how you would use concepts from Agile and user stories to track and deliver features in a data engineering project.

Answer:

Use user stories to break down features (e.g., "As an analyst, I want to load customer data daily so I can view up-to-date reports").

Sprint planning organizes work into deliverable increments (e.g., finishing ingestion workflow, implement SCD, build monitoring dashboard). Tracking velocity, updates, and blockers ensures consistent delivery and collaboration.

What is transformations and actions?

Answer : In PySpark, operations on data are divided into two categories: transformations and actions. These two types of operations are fundamental to understanding how PySpark works, especially in the context of distributed data processing.

Transformations

Transformations are operations on a DataFrame that return a new DataFrame. They are lazy in nature, meaning that they do not immediately compute their results. Instead, they build up a logical plan of transformations that will be applied when an action is performed. Transformations are used to create a pipeline of operations.

Common Transformations

map(): Applies a function to each element of the RDD (Resilient Distributed Dataset) and returns a new RDD.
filter(): Returns a new RDD containing only the elements that satisfy a predicate.
select(): Selects a subset of columns from a DataFrame.
where(): Filters rows using a given condition.
groupBy(): Groups the DataFrame using the specified columns.
agg(): Performs aggregate calculations.

Example of Transformations

Let’s consider a simple example where we filter and select data from a DataFrame:

from pyspark.sql.functions import col

# Create a sample DataFrame

data = [(“Alice”, 34), (“Bob”, 45), (“Catherine”, 29)]

columns = [“Name”, “Age”]

df = spark.createDataFrame(data, columns)

# Transformation: filter and select

filtered_df = df.filter(col(“Age”) > 30).select(“Name”, “Age”)

# Show the transformation result

filtered_df.show()

Databricks-PySpark Interview Questions

DataSpark Academy

Atchyut

What is transformations and actions?

Transformations

Common Transformations

Example of Transformations

Leave a Reply Cancel reply

Popular Posts

Databricks-PySpark Interview Questions

Learn Lazy Evaluation in Databricks PySpark: Benefits, Examples, and Tips

Azure Data Factory – Interview Questions

Categories

Databricks-PySpark Interview Questions

DataSpark Academy

Atchyut

What is transformations and actions?

Transformations

Common Transformations

Example of Transformations

Leave a Reply Cancel reply

Popular Posts

Databricks-PySpark Interview Questions

Learn Lazy Evaluation in Databricks PySpark: Benefits, Examples, and Tips

Azure Data Factory – Interview Questions

Categories

Subscribe to our newsletter