
Pass Databricks Associate-Developer-Apache-Spark Exam Info and Free Practice Test
New 2023 Latest Questions Associate-Developer-Apache-Spark Dumps - Use Updated Databricks Exam
How to Register for the Databricks Associate-Developer-Apache-Spark Exam
The on-screen steps will show you how to arrange an exam with our partner.
You can see all the available certificate exams by Clicking on the Certifications tab.
Go to create an account.
You can register for the exam by clicking the Register button.
Prerequisites for the Databricks Associate Developer Apache Spark Exam?
- If you have a basic understanding of the architecture, you can use adaptive query execution.
- You need to be able to complete the individual data manipulation task with the help of the Spark DataFrameAPI.
NEW QUESTION 93
Which of the following code blocks concatenates rows of DataFrames transactionsDf and transactionsNewDf, omitting any duplicates?
- A. spark.union(transactionsDf, transactionsNewDf).distinct()
- B. transactionsDf.union(transactionsNewDf).distinct()
- C. transactionsDf.join(transactionsNewDf, how="union").distinct()
- D. transactionsDf.union(transactionsNewDf).unique()
- E. transactionsDf.concat(transactionsNewDf).unique()
Answer: B
Explanation:
Explanation
DataFrame.unique() and DataFrame.concat() do not exist and union() is not a method of the SparkSession. In addition, there is no union option for the join method in the DataFrame.join() statement.
More info: pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 94
The code block displayed below contains an error. The code block should write DataFrame transactionsDf as a parquet file to location filePath after partitioning it on column storeId. Find the error.
Code block:
transactionsDf.write.partitionOn("storeId").parquet(filePath)
- A. No method partitionOn() exists for the DataFrame class, partitionBy() should be used instead.
- B. Column storeId should be wrapped in a col() operator.
- C. The partitioning column as well as the file path should be passed to the write() method of DataFrame transactionsDf directly and not as appended commands as in the code block.
- D. The operator should use the mode() option to configure the DataFrameWriter so that it replaces any existing files at location filePath.
- E. The partitionOn method should be called before the write method.
Answer: A
Explanation:
Explanation
No method partitionOn() exists for the DataFrame class, partitionBy() should be used instead.
Correct! Find out more about partitionBy() in the documentation (linked below).
The operator should use the mode() option to configure the DataFrameWriter so that it replaces any existing files at location filePath.
No. There is no information about whether files should be overwritten in the question.
The partitioning column as well as the file path should be passed to the write() method of DataFrame transactionsDf directly and not as appended commands as in the code block.
Incorrect. To write a DataFrame to disk, you need to work with a DataFrameWriter object which you get access to through the DataFrame.writer property - no parentheses involved.
Column storeId should be wrapped in a col() operator.
No, this is not necessary - the problem is in the partitionOn command (see above).
The partitionOn method should be called before the write method.
Wrong. First of all partitionOn is not a valid method of DataFrame. However, even assuming partitionOn would be replaced by partitionBy (which is a valid method), this method is a method of DataFrameWriter and not of DataFrame. So, you would always have to first call DataFrame.write to get access to the DataFrameWriter object and afterwards call partitionBy.
More info: pyspark.sql.DataFrameWriter.partitionBy - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 95
Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?
- A. spark.DataFrame(throughputRates, FloatType)
- B. spark.createDataFrame(throughputRates)
- C. spark.createDataFrame(throughputRates, FloatType())
- D. spark.createDataFrame(throughputRates, FloatType)
- E. spark.createDataFrame((throughputRates), FloatType)
Answer: C
Explanation:
Explanation
spark.createDataFrame(throughputRates, FloatType())
Correct! spark.createDataFrame is the correct operator to use here and the type FloatType() which is passed in for the command's schema argument is correctly instantiated using the parentheses.
Remember that it is essential in PySpark to instantiate types when passing them to SparkSession.createDataFrame. And, in Databricks, spark returns a SparkSession object.
spark.createDataFrame((throughputRates), FloatType)
No. While packing throughputRates in parentheses does not do anything to the execution of this command, not instantiating the FloatType with parentheses as in the previous answer will make this command fail.
spark.createDataFrame(throughputRates, FloatType)
Incorrect. Given that it does not matter whether you pass throughputRates in parentheses or not, see the explanation of the previous answer for further insights.
spark.DataFrame(throughputRates, FloatType)
Wrong. There is no SparkSession.DataFrame() method in Spark.
spark.createDataFrame(throughputRates)
False. Avoiding the schema argument will have PySpark try to infer the schema. However, as you can see in the documentation (linked below), the inference will only work if you pass in an "RDD of either Row, namedtuple, or dict" for data (the first argument to createDataFrame). But since you are passing a Python list, Spark's schema inference will fail.
More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 96
Which of the following code blocks shows the structure of a DataFrame in a tree-like way, containing both column names and types?
- A. 1.print(itemsDf.columns)
2.print(itemsDf.types) - B. itemsDf.printSchema()
- C. itemsDf.print.schema()
- D. spark.schema(itemsDf)
- E. itemsDf.rdd.printSchema()
Answer: B
Explanation:
Explanation
itemsDf.printSchema()
Correct! Here is an example of what itemsDf.printSchema() shows, you can see the tree-like structure containing both column names and types:
root
|-- itemId: integer (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: string (containsNull = true)
|-- supplier: string (nullable = true)
itemsDf.rdd.printSchema()
No, the DataFrame's underlying RDD does not have a printSchema() method.
spark.schema(itemsDf)
Incorrect, there is no spark.schema command.
print(itemsDf.columns)
print(itemsDf.dtypes)
Wrong. While the output of this code blocks contains both column names and column types, the information is not arranges in a tree-like way.
itemsDf.print.schema()
No, DataFrame does not have a print method.
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 97
The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient executor memory is available, in a fault-tolerant way. Find the error.
Code block:
transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)
- A. The code block uses the wrong operator for caching.
- B. Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.
- C. Caching is not supported in Spark, data are always recomputed.
- D. The DataFrameWriter needs to be invoked.
- E. The storage level is inappropriate for fault-tolerant storage.
Answer: E
Explanation:
Explanation
The storage level is inappropriate for fault-tolerant storage.
Correct. Typically, when thinking about fault tolerance and storage levels, you would want to store redundant copies of the dataset. This can be achieved by using a storage level such as StorageLevel.MEMORY_AND_DISK_2.
The code block uses the wrong command for caching.
Wrong. In this case, DataFrame.persist() needs to be used, since this operator supports passing a storage level.
DataFrame.cache() does not support passing a storage level.
Caching is not supported in Spark, data are always recomputed.
Incorrect. Caching is an important component of Spark, since it can help to accelerate Spark programs to great extent. Caching is often a good idea for datasets that need to be accessed repeatedly.
Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.
No. Caching is either accessed through DataFrame.cache() or DataFrame.persist().
The DataFrameWriter needs to be invoked.
Wrong. The DataFrameWriter can be accessed via DataFrame.write and is used to write data to external data stores, mostly on disk. Here, we find keywords such as "cache" and "executor memory" that point us away from using external data stores. We aim to save data to memory to accelerate the reading process, since reading from disk is comparatively slower. The DataFrameWriter does not write to memory, so we cannot use it here.
More info: Best practices for caching in Spark SQL | by David Vrba | Towards Data Science
NEW QUESTION 98
Which of the following describes characteristics of the Spark driver?
- A. In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object.
- B. The Spark driver's responsibility includes scheduling queries for execution on worker nodes.
- C. The Spark driver requests the transformation of operations into DAG computations from the worker nodes.
- D. If set in the Spark configuration, Spark scales the Spark driver horizontally to improve parallel processing performance.
- E. The Spark driver processes partitions in an optimized, distributed fashion.
Answer: A
Explanation:
Explanation
The Spark driver requests the transformation of operations into DAG computations from the worker nodes.
No, the Spark driver transforms operations into DAG computations itself.
If set in the Spark configuration, Spark scales the Spark driver horizontally to improve parallel processing performance.
No. There is always a single driver per application, but one or more executors.
The Spark driver processes partitions in an optimized, distributed fashion.
No, this is what executors do.
In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object.
Wrong. In a non-interactive Spark application, you need to create the SparkSession object. In an interactive Spark shell, the Spark driver instantiates the object for you.
NEW QUESTION 99
Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has
10 partitions?
- A. transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
- B. transactionsDf.coalesce(10)
- C. transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
- D. transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
- E. transactionsDf.repartition(transactionsDf._partitions+2)
Answer: A
Explanation:
Explanation
transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)
Correct. The repartition operator is the correct one for increasing the number of partitions. calling getNumPartitions() on DataFrame.rdd returns the current number of partitions.
transactionsDf.coalesce(10)
No, after this command transactionsDf will continue to only have 8 partitions. This is because coalesce() can only decreast the amount of partitions, but not increase it.
transactionsDf.repartition(transactionsDf.getNumPartitions()+2)
Incorrect, there is no getNumPartitions() method for the DataFrame class.
transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)
Wrong, coalesce() can only be used for reducing the number of partitions and there is no getNumPartitions() method for the DataFrame class.
transactionsDf.repartition(transactionsDf._partitions+2)
No, DataFrame has no _partitions attribute. You can find out the current number of partitions of a DataFrame with the DataFrame.rdd.getNumPartitions() method.
More info: pyspark.sql.DataFrame.repartition - PySpark 3.1.2 documentation, pyspark.RDD.getNumPartitions - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 100
Which of the following describes a difference between Spark's cluster and client execution modes?
- A. In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client mode.
- B. In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.
- C. In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.
- D. In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.
- E. In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.
Answer: C
Explanation:
Explanation
In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.
Correct. The idea of Spark's client mode is that workloads can be executed from an edge node, also known as gateway machine, from outside the cluster. The most common way to execute Spark however is in cluster mode, where the driver resides on a worker node.
In practice, in client mode, there are tight constraints about the data transfer speed relative to the data transfer speed between worker nodes in the cluster. Also, any job in that is executed in client mode will fail if the edge node fails. For these reasons, client mode is usually not used in a production environment.
In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client execution mode.
No. In both execution modes, the cluster manager may reside on a worker node, but it does not reside on an edge node in client mode.
In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.
This is incorrect. Only the driver runs on gateway nodes (also known as "edge nodes") in client mode, but not the executor processes.
In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.
No, in client mode, the Spark driver is not co-located with the driver. The whole point of client mode is that the driver is outside the cluster and not associated with the resource that manages the cluster (the machine that runs the cluster manager).
In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.
No, it is exactly the opposite: There are no gateway machines in cluster mode, but in client mode, they host the driver.
NEW QUESTION 101
Which of the following code blocks produces the following output, given DataFrame transactionsDf?
Output:
1.root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.+-------------+---------+-----+-------+---------+----+
- A. transactionsDf.rdd.formatSchema()
- B. print(transactionsDf.schema)
- C. transactionsDf.printSchema()
- D. transactionsDf.rdd.printSchema()
- E. transactionsDf.schema.print()
Answer: C
Explanation:
Explanation
The output is the typical output of a DataFrame.printSchema() call. The DataFrame's RDD representation does not have a printSchema or formatSchema method (find available methods in the RDD documentation linked below). The output of print(transactionsDf.schema) is this:
StructType(List(StructField(transactionId,IntegerType,true),StructField(predError,IntegerType,true),StructField (value,IntegerType,true),StructField(storeId,IntegerType,true),StructField(productId,IntegerType,true),StructFiel It includes the same information as the nicely formatted original output, but is not nicely formatted itself. Lastly, the DataFrame's schema attribute does not have a print() method.
More info:
- pyspark.RDD: pyspark.RDD - PySpark 3.1.2 documentation
- DataFrame.printSchema(): pyspark.sql.DataFrame.printSchema - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
NEW QUESTION 102
The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__)
- A. 1. filter
2. "transactionId", "predError", "value", "f" - B. 1. select
2. col(["transactionId", "predError", "value", "f"]) - C. 1. select
2. "transactionId, predError, value, f" - D. 1. select
2. ["transactionId", "predError", "value", "f"] - E. 1. where
2. col("transactionId"), col("predError"), col("value"), col("f")
Answer: D
Explanation:
Explanation
Correct code block:
transactionsDf.select(["transactionId", "predError", "value", "f"])
The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument.
Thus, this is the correct choice here. The option using col(["transactionId", "predError",
"value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a list. Likewise, all columns being specified in a single string like "transactionId, predError, value, f" is not valid syntax.
filter and where filter rows based on conditions, they do not control which columns to return.
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 103
Which of the following is not a feature of Adaptive Query Execution?
- A. Split skewed partitions into smaller partitions to avoid differences in partition processing time.
- B. Reroute a query in case of an executor failure.
- C. Coalesce partitions to accelerate data processing.
- D. Replace a sort merge join with a broadcast join, where appropriate.
- E. Collect runtime statistics during query execution.
Answer: B
Explanation:
Explanation
Reroute a query in case of an executor failure.
Correct. Although this feature exists in Spark, it is not a feature of Adaptive Query Execution. The cluster manager keeps track of executors and will work together with the driver to launch an executor and assign the workload of the failed executor to it (see also link below).
Replace a sort merge join with a broadcast join, where appropriate.
No, this is a feature of Adaptive Query Execution.
Coalesce partitions to accelerate data processing.
Wrong, Adaptive Query Execution does this.
Collect runtime statistics during query execution.
Incorrect, Adaptive Query Execution (AQE) collects these statistics to adjust query plans. This feedback loop is an essential part of accelerating queries via AQE.
Split skewed partitions into smaller partitions to avoid differences in partition processing time.
No, this is indeed a feature of Adaptive Query Execution. Find more information in the Databricks blog post linked below.
More info: Learning Spark, 2nd Edition, Chapter 12, On which way does RDD of spark finish fault-tolerance?
- Stack Overflow, How to Speed up SQL Queries with Adaptive Query Execution
NEW QUESTION 104
Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?
- A. transactionsDf.orderBy("value").asc().show(10)
- B. transactionsDf.sort(col("value").asc()).print(10)
- C. transactionsDf.sort(asc(value)).show(10)
- D. transactionsDf.sort(col("value")).show(10)
- E. transactionsDf.sort(col("value").desc()).head()
Answer: D
Explanation:
Explanation
show() is the correct method to look for here, since the question specifically asks for displaying the rows in a nicely formatted way. Here is the output of show (only a few rows shown):
+-------------+---------+-----+-------+---------+----+---------------+
|transactionId|predError|value|storeId|productId| f|transactionDate|
+-------------+---------+-----+-------+---------+----+---------------+
| 3| 3| 1| 25| 3|null| 1585824821|
| 5| null| 2| null| 2|null| 1575285427|
| 4| null| 3| 3| 2|null| 1583244275|
+-------------+---------+-----+-------+---------+----+---------------+
With regards to the sorting, specifically in ascending order since the smallest values should be shown first, the following expressions are valid:
- transactionsDf.sort(col("value")) ("ascending" is the default sort direction in the sort method)
- transactionsDf.sort(asc(col("value")))
- transactionsDf.sort(asc("value"))
- transactionsDf.sort(transactionsDf.value.asc())
- transactionsDf.sort(transactionsDf.value)
Also, orderBy is just an alias of sort, so all of these expressions work equally well using orderBy.
Static notebook | Dynamic notebook: See test 1
NEW QUESTION 105
Which of the following statements about lazy evaluation is incorrect?
- A. Execution is triggered by transformations.
- B. Predicate pushdown is a feature resulting from lazy evaluation.
- C. Lineages allow Spark to coalesce transformations into stages
- D. Accumulators do not change the lazy evaluation model of Spark.
- E. Spark will fail a job only during execution, but not during definition.
Answer: A
Explanation:
Explanation
Execution is triggered by transformations.
Correct. Execution is triggered by actions only, not by transformations.
Lineages allow Spark to coalesce transformations into stages.
Incorrect. In Spark, lineage means a recording of transformations. This lineage enables lazy evaluation in Spark.
Predicate pushdown is a feature resulting from lazy evaluation.
Wrong. Predicate pushdown means that, for example, Spark will execute filters as early in the process as possible so that it deals with the least possible amount of data in subsequent transformations, resulting in a performance improvements.
Accumulators do not change the lazy evaluation model of Spark.
Incorrect. In Spark, accumulators are only updated when the query that refers to the is actually executed. In other words, they are not updated if the query is not (yet) executed due to lazy evaluation.
Spark will fail a job only during execution, but not during definition.
Wrong. During definition, due to lazy evaluation, the job is not executed and thus certain errors, for example reading from a non-existing file, cannot be caught. To be caught, the job needs to be executed, for example through an action.
NEW QUESTION 106
Which of the following code blocks returns a single-row DataFrame that only has a column corr which shows the Pearson correlation coefficient between columns predError and value in DataFrame transactionsDf?
- A. transactionsDf.select(corr(predError, value).alias("corr"))
- B. transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first()
- C. transactionsDf.select(corr("predError", "value"))
- D. transactionsDf.select(corr(col("predError"), col("value")).alias("corr")) (Correct)
- E. transactionsDf.select(corr(["predError", "value"]).alias("corr")).first()
Answer: D
Explanation:
Explanation
In difficulty, this question is above what you can expect from the exam. What this question NO:
wants to teach you, however, is to pay attention to the useful details included in the documentation.
pyspark.sql.corr is not a very common method, but it deals with Spark's data structure in an interesting way.
The command takes two columns over multiple rows and returns a single row - similar to an aggregation function. When examining the documentation (linked below), you will find this code example:
a = range(20)
b = [2 * x for x in range(20)]
df = spark.createDataFrame(zip(a, b), ["a", "b"])
df.agg(corr("a", "b").alias('c')).collect()
[Row(c=1.0)]
See how corr just returns a single row? Once you understand this, you should be suspicious about answers that include first(), since there is no need to just select a single row. A reason to eliminate those answers is that DataFrame.first() returns an object of type Row, but not DataFrame, as requested in the question.
transactionsDf.select(corr(col("predError"), col("value")).alias("corr")) Correct! After calculating the Pearson correlation coefficient, the resulting column is correctly renamed to corr.
transactionsDf.select(corr(predError, value).alias("corr"))
No. In this answer, Python will interpret column names predError and value as variable names.
transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first() Incorrect. first() returns a row, not a DataFrame (see above and linked documentation below).
transactionsDf.select(corr("predError", "value"))
Wrong. Whie this statement returns a DataFrame in the desired shape, the column will have the name corr(predError, value) and not corr.
transactionsDf.select(corr(["predError", "value"]).alias("corr")).first() False. In addition to first() returning a row, this code block also uses the wrong call structure for command corr which takes two arguments (the two columns to correlate).
More info:
- pyspark.sql.functions.corr - PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.first - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 107
Which of the following describes properties of a shuffle?
- A. Operations involving shuffles are never evaluated lazily.
- B. Shuffles involve only single partitions.
- C. In a shuffle, Spark writes data to disk.
- D. Shuffles belong to a class known as "full transformations".
- E. A shuffle is one of many actions in Spark.
Answer: C
Explanation:
Explanation
In a shuffle, Spark writes data to disk.
Correct! Spark's architecture dictates that intermediate results during a shuffle are written to disk.
A shuffle is one of many actions in Spark.
Incorrect. A shuffle is a transformation, but not an action.
Shuffles involve only single partitions.
No, shuffles involve multiple partitions. During a shuffle, Spark generates output partitions from multiple input partitions.
Operations involving shuffles are never evaluated lazily.
Wrong. A shuffle is a costly operation and Spark will evaluate it as lazily as other transformations. This is, until a subsequent action triggers its evaluation.
Shuffles belong to a class known as "full transformations".
Not quite. Shuffles belong to a class known as "wide transformations". "Full transformation" is not a relevant term in Spark.
More info: Spark - The Definitive Guide, Chapter 2 and Spark: disk I/O on stage boundaries explanation - Stack Overflow
NEW QUESTION 108
Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?
- A. transactionsDf.sort("storeId").sort(desc("productId"))
- B. transactionsDf.order_by(col(storeId), desc(col(productId)))
- C. transactionsDf.sort("storeId", asc("productId"))
- D. transactionsDf.sort("storeId", desc("productId"))
- E. transactionsDf.sort(col(storeId)).desc(col(productId))
Answer: D
Explanation:
Explanation
In this question it is important to realize that you are asked to sort transactionDf by two columns. This means that the sorting of the second column depends on the sorting of the first column.
So, any option that sorts the entire DataFrame (through chaining sort statements) will not work. The two columns need to be channeled through the same call to sort().
Also, order_by is not a valid DataFrame API method.
More info: pyspark.sql.DataFrame.sort - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2
NEW QUESTION 109
Which of the following code blocks returns a one-column DataFrame for which every row contains an array of all integer numbers from 0 up to and including the number given in column predError of DataFrame transactionsDf, and null if predError is null?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+
- A. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = list(range(target))
6. return result
7.
8.transactionsDf.select(count_to_target(col('predError'))) - B. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = list(range(target))
6. return result
7.
8.count_to_target_udf = udf(count_to_target)
9.
10.transactionsDf.select(count_to_target_udf('predError')) - C. 1.def count_to_target(target):
2. result = list(range(target))
3. return result
4.
5.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
6.
7.df = transactionsDf.select(count_to_target_udf('predError')) - D. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = [range(target)]
6. return result
7.
8.count_to_target_udf = udf(count_to_target, ArrayType[IntegerType])
9.
10.transactionsDf.select(count_to_target_udf(col('predError'))) - E. 1.def count_to_target(target):
2. if target is None:
3. return
4.
5. result = list(range(target))
6. return result
7.
8.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
9.
10.transactionsDf.select(count_to_target_udf('predError'))
(Correct)
Answer: E
Explanation:
Explanation
Correct code block:
def count_to_target(target):
if target is None:
return
result = list(range(target))
return result
count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
transactionsDf.select(count_to_target_udf('predError'))
Output of correct code block:
+--------------------------+
|count_to_target(predError)|
+--------------------------+
| [0, 1, 2]|
| [0, 1, 2, 3, 4, 5]|
| [0, 1, 2]|
| null|
| null|
| [0, 1, 2]|
+--------------------------+
This question is not exactly easy. You need to be familiar with the syntax around UDFs (user-defined functions). Specifically, in this question it is important to pass the correct types to the udf method - returning an array of a specific type rather than just a single type means you need to think harder about type implications than usual.
Remember that in Spark, you always pass types in an instantiated way like ArrayType(IntegerType()), not like ArrayType(IntegerType). The parentheses () are the key here - make sure you do not forget those.
You should also pay attention that you actually pass the UDF count_to_target_udf, and not the Python method count_to_target to the select() operator.
Finally, null values are always a tricky case with UDFs. So, take care that the code can handle them correctly.
More info: How to Turn Python Functions into PySpark Functions (UDF) - Chang Hsin Lee - Committing my thoughts to words.
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 110
Which of the following describes Spark actions?
- A. Actions are Spark's way of modifying RDDs.
- B. The driver receives data upon request by actions.
- C. Writing data to disk is the primary purpose of actions.
- D. Stage boundaries are commonly established by actions.
- E. Actions are Spark's way of exchanging data between executors.
Answer: B
Explanation:
Explanation
The driver receives data upon request by actions.
Correct! Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer result data back to the driver.
Actions are Spark's way of exchanging data between executors.
No. In Spark, data is exchanged between executors via shuffles.
Writing data to disk is the primary purpose of actions.
No. The primary purpose of actions is to access data that is stored in Spark's RDDs and return the data, often in aggregated form, back to the driver.
Actions are Spark's way of modifying RDDs.
Incorrect. Firstly, RDDs are immutable - they cannot be modified. Secondly, Spark generates new RDDs via transformations and not actions.
Stage boundaries are commonly established by actions.
Wrong. A stage boundary is commonly established by a shuffle, for example caused by a wide transformation.
NEW QUESTION 111
The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+ Code block:
itemsDf.__1__(__2__).select(__3__, __4__)
- A. 1. filter
2. col("supplier").contains("Sports")
3. "itemName"
4. explode("attributes") - B. 1. where
2. "Sports".isin(col("Supplier"))
3. "itemName"
4. array_explode("attributes") - C. 1. filter
2. col("supplier").isin("Sports")
3. "itemName"
4. explode(col("attributes")) - D. 1. where
2. col("supplier").contains("Sports")
3. "itemName"
4. "attributes" - E. 1. where
2. col(supplier).contains("Sports")
3. explode(attributes)
4. itemName
Answer: A
Explanation:
Explanation
Output of correct code block:
+----------------------------------+------+
|itemName |col |
+----------------------------------+------+
|Thick Coat for Walking in the Snow|blue |
|Thick Coat for Walking in the Snow|winter|
|Thick Coat for Walking in the Snow|cozy |
|Outdoors Backpack |green |
|Outdoors Backpack |summer|
|Outdoors Backpack |travel|
+----------------------------------+------+
The key to solving this question is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through the answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the first gap, but can also exclude some answers based on obvious problems you see with them.
The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do not help us in selecting the right answer.
The second gap is more interesting. One answer option includes "Sports".isin(col("Supplier")). This construct does not work, since Python's string does not have an isin method. Another option contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col ("supplier").contains("Sports") and col("supplier").isin("Sports"). The question states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator here.
We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.
Finally, we are left with two answers that fill the third gap both with "itemName" and the fourth gap either with explode("attributes") or "attributes". While both are correct Spark syntax, only explode ("attributes") will help us achieve our goal. Specifically, the question asks for one attribute from column attributes per row - this is what the explode() operator does.
One answer option also includes array_explode() which is not a valid operator in PySpark.
More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3
NEW QUESTION 112
Which of the following code blocks saves DataFrame transactionsDf in location /FileStore/transactions.csv as a CSV file and throws an error if a file already exists in the location?
- A. transactionsDf.write.format("csv").mode("error").save("/FileStore/transactions.csv")
- B. transactionsDf.write.format("csv").mode("ignore").path("/FileStore/transactions.csv")
- C. transactionsDf.write("csv").mode("error").save("/FileStore/transactions.csv")
- D. transactionsDf.write.save("/FileStore/transactions.csv")
- E. transactionsDf.write.format("csv").mode("error").path("/FileStore/transactions.csv")
Answer: A
Explanation:
Explanation
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/28.html ,
https://bit.ly/sparkpracticeexams_import_instructions)
NEW QUESTION 113
The code block displayed below contains an error. The code block should use Python method find_most_freq_letter to find the letter present most in column itemName of DataFrame itemsDf and return it in a new column most_frequent_letter. Find the error.
Code block:
1. find_most_freq_letter_udf = udf(find_most_freq_letter)
2. itemsDf.withColumn("most_frequent_letter", find_most_freq_letter("itemName"))
- A. The UDF method is not registered correctly, since the return type is missing.
- B. Spark is not using the UDF method correctly.
- C. UDFs do not exist in PySpark.
- D. Spark is not adding a column.
- E. The "itemName" expression should be wrapped in col().
Answer: B
Explanation:
Explanation
Correct code block:
find_most_freq_letter_udf = udf(find_most_frequent_letter)
itemsDf.withColumn("most_frequent_letter", find_most_freq_letter_udf("itemName")) Spark should use the previously registered find_most_freq_letter_udf method here - but it is not doing that in the original codeblock. There, it just uses the non-UDF version of the Python method.
Note that typically, we would have to specify a return type for udf(). Except in this case, since the default return type for udf() is a string which is what we are expecting here. If we wanted to return an integer variable instead, we would have to register the Python function as UDF using find_most_freq_letter_udf = udf(find_most_freq_letter, IntegerType()).
More info: pyspark.sql.functions.udf - PySpark 3.1.1 documentation
NEW QUESTION 114
The code block displayed below contains an error. The code block should configure Spark so that DataFrames up to a size of 20 MB will be broadcast to all worker nodes when performing a join.
Find the error.
Code block:
- A. Spark will only broadcast DataFrames that are much smaller than the default value.
- B. Spark will only apply the limit to threshold joins and not to other joins.
- C. The command is evaluated lazily and needs to be followed by an action.
- D. The passed limit has the wrong variable type.
- E. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 20)
- F. The correct option to write configurations is through spark.config and not spark.conf.
Answer: A
Explanation:
Explanation
This is question is hard. Let's assess the different answers one-by-one.
Spark will only broadcast DataFrames that are much smaller than the default value.
This is correct. The default value is 10 MB (10485760 bytes). Since the configuration for spark.sql.autoBroadcastJoinThreshold expects a number in bytes (and not megabytes), the code block sets the limits to merely 20 bytes, instead of the requested 20 * 1024 * 1024 (= 20971520) bytes.
The command is evaluated lazily and needs to be followed by an action.
No, this command is evaluated right away!
Spark will only apply the limit to threshold joins and not to other joins.
There are no "threshold joins", so this option does not make any sense.
The correct option to write configurations is through spark.config and not spark.conf.
No, it is indeed spark.conf!
The passed limit has the wrong variable type.
The configuration expects the number of bytes, a number, as an input. So, the 20 provided in the code block is fine.
NEW QUESTION 115
Which of the following statements about broadcast variables is correct?
- A. Broadcast variables are occasionally dynamically updated on a per-task basis.
- B. Broadcast variables are immutable.
- C. Broadcast variables are serialized with every single task.
- D. Broadcast variables are commonly used for tables that do not fit into memory.
- E. Broadcast variables are local to the worker node and not shared across the cluster.
Answer: B
Explanation:
Explanation
Broadcast variables are local to the worker node and not shared across the cluster.
This is wrong because broadcast variables are meant to be shared across the cluster. As such, they are never just local to the worker node, but available to all worker nodes.
Broadcast variables are commonly used for tables that do not fit into memory.
This is wrong because broadcast variables can only be broadcast because they are small and do fit into memory.
Broadcast variables are serialized with every single task.
This is wrong because they are cached on every machine in the cluster, precisely avoiding to have to be serialized with every single task.
Broadcast variables are occasionally dynamically updated on a per-task basis.
This is wrong because broadcast variables are immutable - they are never updated.
More info: Spark - The Definitive Guide, Chapter 14
NEW QUESTION 116
Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?
- A. itemsDf.write.option("parquet").mode("overwrite").path(filePath)
- B. itemsDf.write.mode("overwrite").parquet(filePath)
- C. itemsDf.write(filePath, mode="overwrite")
- D. itemsDf.write.mode("overwrite").path(filePath)
- E. itemsDf.write().parquet(filePath, mode="overwrite")
Answer: B
Explanation:
Explanation
itemsDf.write.mode("overwrite").parquet(filePath)
Correct! itemsDf.write returns a pyspark.sql.DataFrameWriter instance whose overwriting behavior can be modified via the mode setting or by passing mode="overwrite" to the parquet() command.
Although the parquet format is not prescribed for solving this question, parquet() is a valid operator to initiate Spark to write the data to disk.
itemsDf.write.mode("overwrite").path(filePath)
No. A pyspark.sql.DataFrameWriter instance does not have a path() method.
itemsDf.write.option("parquet").mode("overwrite").path(filePath)
Incorrect, see above. In addition, a file format cannot be passed via the option() method.
itemsDf.write(filePath, mode="overwrite")
Wrong. Unfortunately, this is too simple. You need to obtain access to a DataFrameWriter for the DataFrame through calling itemsDf.write upon which you can apply further methods to control how Spark data should be written to disk. You cannot, however, pass arguments to itemsDf.write directly.
itemsDf.write().parquet(filePath, mode="overwrite")
False. See above.
More info: pyspark.sql.DataFrameWriter.parquet - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
NEW QUESTION 117
Which is the highest level in Spark's execution hierarchy?
- A. Executor
- B. Task
- C. Stage
- D. Job
- E. Slot
Answer: D
NEW QUESTION 118
......
Latest Associate-Developer-Apache-Spark Exam Dumps Databricks Exam: https://www.lead2passed.com/Databricks/Associate-Developer-Apache-Spark-practice-exam-dumps.html
Pass Databricks Associate-Developer-Apache-Spark PDF Dumps Recently Updated 179 Questions: https://drive.google.com/open?id=1lgPlpKdTpRD3CWu1cKOhxHhHy6SXQKsz