Home » Interview Preparation » Databricks-PySpark Interview Questions
Comprehensive Databricks PySpark Interview Questions & Answers
With over 15 years of experience in data engineering and a distinguished academic background from NIT Calicut, Atchyut brings deep expertise in Databricks, PySpark, and Azure cloud technologies. His exceptional GATE rank of 400 demonstrates his strong analytical foundation, which he now applies to help aspiring data engineers master complex interview scenarios and real-world data processing challenges.
Lazy evaluation means Spark does not immediately execute transformations (like filter, map, select). Instead, it builds a logical execution plan (a DAG) and only runs computations when an action (such as collect(), count(), or write()) is triggered.
This lets Spark optimize across all operations—for example, combining multiple filters before reading data.
Example:
df_filtered = df.filter(df.salary > 50000)
df_selected = df_filtered.select("name", "salary")
# No computation happens yet
output = df_selected.collect() # Only now Spark computes and returns results
Benefits:
• Reduces unnecessary computation
• Enables query optimization (predicate pushdown, pipelining)
• Improves cluster resource usage
Transformation: Creates a new RDD/DataFrame from an existing one but does not trigger execution. Examples: map(), filter(), select(), withColumn().
Action: Triggers the execution of transformations and returns a result. Examples: collect(), count(), show(), write().
Example:
# Transformation
df2 = df.filter(df['age'] > 18) # No computation yet
# Action
result = df2.count() # Triggers computation
A DAG is a graph structure where each node represents an operation (transformation or action) and edges represent dependencies between operations.
Spark builds a DAG from your series of transformations. When you trigger an action, Spark analyzes the DAG, optimizes it, and breaks it into execution stages with tasks scheduled across the cluster.
In a Databricks notebook, every chain of DataFrame operations builds and maintains a DAG until an action (like show()
or write()
) is called.
Partitioning splits a dataset into smaller, logical chunks—partitions—that are processed in parallel by different executors. Too few partitions may underutilize cluster resources; too many may cause scheduling overhead and small-file problems.
Ideal partition size is often determined by data size (e.g., 100-200MB per partition is typical for performance), and can be adjusted using repartition()
or coalesce()
.
repartition(n): Shuffles all data for a complete redistribution into n partitions. Use when increasing partitions or when even data distribution is needed.
coalesce(n): Reduces number of partitions (without full shuffle). Use when you want fewer partitions, usually after filtering.
Example:
df2 = df.repartition(10) # More partitions (increases parallelism)
df3 = df.coalesce(2) # Fewer partitions (less parallelism, less shuffle)
Caching and persisting store intermediate DataFrames or RDDs in memory (or disk), allowing future actions to reuse results instead of re-computing.
Use cases:
• When a DataFrame is reused multiple times (e.g., in iterative algorithms, in multiple downstream operations of a complex ETL).
Usage:
• Use cache()
for default (memory-only) storage.
• Use persist(StorageLevel.MEMORY_AND_DISK)
for larger data or when memory is limited.
Key metrics to check:
• Stage and task duration: to locate bottlenecks.
• Shuffle read/write metrics: to detect expensive data shuffles.
• Skewed partitions: some tasks taking much longer (data skew).
• Executor metrics: to check CPU/memory resource utilization.
Identify slow stages, skew, and resource bottlenecks to optimize the pipeline.
Clauses (in execution order, not written order):
1. FROM (specifies source table(s))
2. WHERE (filters rows)
3. GROUP BY (forms groups)
4. HAVING (filters groups)
5. SELECT (selects columns/expressions)
6. ORDER BY (sorts results)
Written SQL:
SELECT col1, col2
FROM mytable
WHERE col3 > 100
GROUP BY col1
HAVING COUNT(*) > 2
ORDER BY col2 DESC
Execution order ensures filters and grouping happen before selection and sorting.
INNER JOIN: Rows must match in both tables.
Scenario: Find customers who placed orders.
LEFT JOIN: All rows from left table, matching from right.
Scenario: List all customers, including those without orders.
RIGHT JOIN: All rows from right table, matching from left.
Scenario: List all orders, including those without a matching customer (rare).
FULL JOIN: All rows from both tables, matched where possible.
Scenario: List all customers and orders, showing matches and unmatched from both.
GROUP BY creates groups of rows with the same value(s) in specified column(s).
HAVING filters after grouping, typically using aggregates.
WHERE filters rows before grouping.
Example:
SELECT dept, COUNT(*) as emp_count
FROM employees
WHERE status = 'ACTIVE'
GROUP BY dept
HAVING COUNT(*) > 5;
This finds departments with more than five active employees.
Parameters: Passed at pipeline execution; set externally. Used for configuration (e.g., filename, date).
Variables: Mutable, scoped within pipeline run, used to store and change state during execution.
Example:
• Parameter: @pipeline().parameters.SourceFilePath
• Variable: Set or append during loop for accumulating file counts.
ADF activities are discrete pipeline steps, such as data read/write, data transformation, or workflow control.
Three commonly used activities:
• Copy Activity: Transfer data between source and sink.
• Lookup Activity: Fetch reference data or metadata for conditional logic.
• ForEach Activity: Iterate over a collection (e.g., file list) to process multiple objects.
You need to load daily CSV files from Blob Storage into a SQL database; use Get Metadata to list files, then ForEach to loop through filenames and invoke Copy or Dataflow activity for each.
ADF supports incremental loads using watermark columns (e.g., last modified timestamps) and parameters. Use Lookup/GetMetadata to get latest processed value, and pass that parameter to the next pipeline/dataflow to fetch only new or changed records.
Lookup Activity:
Reads a row/rows from a table/file and makes it available as object/array (e.g., read latest watermark or config values).
Get Metadata Activity:
Fetches file or folder metadata (e.g., file list, size, modified date) from storage.
Usage: Use Lookup for content, Get Metadata for structure/info.
Blob Storage: General object/file storage, simple flat namespace, widely used for all kinds of files.
ADLS Gen2: Built on Blob, but with hierarchical namespace (folders), fine ACLs, optimized for analytics (big data/Hadoop).
Usage:
• Use Blob: Simple file storage, simple security.
• Use ADLS Gen2: Structured analytics, advanced access controls, big data workloads.
Hierarchical namespace supports directory and file structure, enabling faster file operations (move/rename), efficient directory-level security, and compatibility with analytics frameworks like Hadoop/Spark.
Databricks File System (DBFS) is a virtual file system layer on top of cloud storage (Blob or ADLS), providing a POSIX-like interface (/dbfs/
). Users access with standard file APIs or %fs
commands, simplifying file I/O in notebooks.
Delta Lake is a storage layer on top of data lake formats (Parquet) that provides ACID transactions, scalable metadata handling, schema enforcement, and time travel/versioning.
Benefits: Reliable batch/stream processing, data consistency (no dirty reads), rollback/versioning, and simplified pipeline design.
Delta Lake stores a transaction log for all table changes. Using SQL or PySpark, you can query/restore data "as of" a specific timestamp or version.
Example:
SELECT * FROM delta./path/table
VERSION AS OF 10;
RESTORE TABLE my_table TO VERSION AS OF 10;
Use case: After accidental deletion/corruption, restore data to a previous version with zero data loss.
SQL Example:
MERGE INTO targetTable AS t
USING sourceTable AS s
ON t.key = s.key
WHEN MATCHED THEN UPDATE SET t.val = s.val
WHEN NOT MATCHED THEN INSERT (key, val) VALUES (s.key, s.val);
PySpark Example:
from delta.tables import DeltaTabledt = DeltaTable.forPath(spark, "/path/to/table")
dt.alias("t").merge(
source=sourceDF.alias("s"),
condition="t.key = s.key"
).whenMatchedUpdate(set={"val": "s.val"}) \
.whenNotMatchedInsert(values={"key": "s.key", "val": "s.val"}) \
.execute()
ADF:
Use built-in Monitoring tab for pipeline runs, activity status, trigger history. Enable diagnostic logs to export to Log Analytics/Storage/Event Hub for detailed monitoring and alerting.
Databricks:
Use Job UI, Spark History Server, and logging to track notebook/job outcomes. Log custom metrics with MLflow or structured logging within notebooks.
Azure Key Vault is a cloud service for securely storing secrets, keys, and certificates.
Usage:
• Store DB connection strings and secrets.
• Grant pipelines managed identity access to read secrets at runtime.
• Reference secrets in ADF linked services or Databricks notebooks using native integrations.
Medallion Architecture:
Bronze: Raw ingested data (minimal transformation, append only).
Silver: Cleansed, filtered, and joined data; business logic applied.
Gold: Aggregated, analytics-ready data for reporting.
Data flows: ingest raw → clean and enrich → aggregate and publish.
Use user stories to break down features (e.g., "As an analyst, I want to load customer data daily so I can view up-to-date reports").
Sprint planning organizes work into deliverable increments (e.g., finishing ingestion workflow, implement SCD, build monitoring dashboard). Tracking velocity, updates, and blockers ensures consistent delivery and collaboration.
© 2024 DataSpark Academy - Empowering Data Engineers
Answer :Â In PySpark, operations on data are divided into two categories: transformations and actions. These two types of operations are fundamental to understanding how PySpark works, especially in the context of distributed data processing.
Transformations are operations on a DataFrame that return a new DataFrame. They are lazy in nature, meaning that they do not immediately compute their results. Instead, they build up a logical plan of transformations that will be applied when an action is performed. Transformations are used to create a pipeline of operations.
map()
: Applies a function to each element of the RDD (Resilient Distributed Dataset) and returns a new RDD.filter()
: Returns a new RDD containing only the elements that satisfy a predicate.select()
: Selects a subset of columns from a DataFrame.where()
: Filters rows using a given condition.groupBy()
: Groups the DataFrame using the specified columns.agg()
: Performs aggregate calculations.Let’s consider a simple example where we filter and select data from a DataFrame:
from pyspark.sql.functions import col
# Create a sample DataFrame
data = [(“Alice”, 34), (“Bob”, 45), (“Catherine”, 29)]
columns = [“Name”, “Age”]
df = spark.createDataFrame(data, columns)
# Transformation: filter and select
filtered_df = df.filter(col(“Age”) > 30).select(“Name”, “Age”)
# Show the transformation result
filtered_df.show()
Your information will never be shared with any third party