최신 Databricks-Machine-Learning-Associate 무료덤프 - Databricks Certified Machine Learning Associate

문제1

A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.
Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

A. Bias is avoidable when using a train-validation split

B. Fewer models need to be trained when using a train-validation split

C. Reproducibility is achievable when using a train-validation split

D. A holdout set is not necessary when using a train-validation split

E. Fewer hyperparameter values need to be tested when using a train-validation split

정답: B

문제2

A new data scientist has started working on an existing machine learning project. The project is a scheduled Job that retrains every day. The project currently exists in a Repo in Databricks. The data scientist has been tasked with improving the feature engineering of the pipeline's preprocessing stage. The data scientist wants to make necessary updates to the code that can be easily adopted into the project without changing what is being run each day.
Which approach should the data scientist take to complete this task?

A. They can create a new branch in Databricks, commit their changes, and push those changes to the Git provider.

B. They can create a new Git repository, import it into Databricks, and copy and paste the existing code from the original repository before making changes.

C. They can clone the notebooks in the repository into a new Databricks Repo and make the necessary changes.

D. They can clone the notebooks in the repository into a Databricks Workspace folder and make the necessary changes.

정답: A

설명: (DumpTOP 회원만 볼 수 있음)

문제3

A data scientist has replaced missing values in their feature set with each respective feature variable's median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.
Which of the following approaches can they take to include as much information as possible in the feature set?

A. Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them

B. Create a binary feature variable for each feature that contained missing values indicating whether each row's value has been imputed

C. Remove all feature variables that originally contained missing values from the feature set

D. Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing

E. Impute the missing values using each respective feature variable's mean value instead of the median value

정답: B

설명: (DumpTOP 회원만 볼 수 있음)

문제4

A machine learning engineer is trying to scale a machine learning pipeline by distributing its single-node model tuning process. After broadcasting the entire training data onto each core, each core in the cluster can train one model at a time. Because the tuning process is still running slowly, the engineer wants to increase the level of parallelism from 4 cores to 8 cores to speed up the tuning process. Unfortunately, the total memory in the cluster cannot be increased.
In which of the following scenarios will increasing the level of parallelism from 4 to 8 speed up the tuning process?

A. When the entire data can fit on each core

B. When the tuning process in randomized

C. When the data is particularly long in shape

D. When the data is particularly wide in shape

E. When the model is unable to be parallelized

정답: A

설명: (DumpTOP 회원만 볼 수 있음)

문제5

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

A. pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

B. pandas API on Spark DataFrames are more performant than Spark DataFrames

C. pandas API on Spark DataFrames are unrelated to Spark DataFrames

D. pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

E. pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

정답: D

설명: (DumpTOP 회원만 볼 수 있음)

문제6

A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.
Which of the following possible explanations for this difference is invalid?

A. The data scientist failed to take the log of the predictions in the first model prior to computing the RMSE

B. The data scientist failed to exponentiate the predictions in the second model prior to computing the RMSE

C. The RMSE is an invalid evaluation metric for regression problems

D. The first model is much more accurate than the second model

E. The second model is much more accurate than the first model

정답: C

설명: (DumpTOP 회원만 볼 수 있음)

문제7

A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

Assuming the default Spark configuration is in place, which of the following is a benefit of using an Iterator?

A. The data will be limited to a single executor preventing the model from being loaded multiple times

B. The model only needs to be loaded once per executor rather than once per batch during the inference process

C. The data will be distributed across multiple executors during the inference process

D. The model will be limited to a single executor preventing the data from being distributed

정답: B

설명: (DumpTOP 회원만 볼 수 있음)

문제8

A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.
Which of the following describes why?

A. Gradient boosting calculates gradients in evaluation metrics using all cores which prevents parallelization.

B. Gradient boosting requires access to all data at once which cannot happen during parallelization.

C. Gradient boosting is not a linear algebra-based algorithm which is required for parallelization

D. Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.

정답: D

설명: (DumpTOP 회원만 볼 수 있음)

최신 Databricks-Machine-Learning-Associate 무료덤프 - Databricks Certified Machine Learning Associate

우리와 연락하기

유용한 링크

최신 업데이트