https://cloud.google.com/dataproc/docs/concepts/iam/iam#iam_roles_and_cloud_dataproc_operations_summary
- What is Dataproc? A managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning
- What affects Dataproc billing? Size of the cluster, running time. ⚠ Furthermore, running a Dataproc cluster incurs charges for other Google Cloud resources used in the cluster, such as Compute Engine and Cloud Storage
- What jobs can Dataproc run? Spark, Spark SQL, PySpark, MapReduce, Hive, and Pig jobs.
- What cluster manager does Dataproc use for Spark? YARN
Configuration
- What flag do you use to configure a dataproc cluster? The —properties flag, with file_prefix:property=value,property=value,...: The file_prefix maps to a predefined configuration file as shown in the table below, and the property maps to a property within the file.
- What do you need to do if there is a comma in a property's value? Specify your own delimiter, as follows: **^#^**file_prefix1:property1=part1,part2#file_prefix2:property2=value2
- What do you drop when configuring a job instead of a dataproc cluster? The Apache Hadoop YARN, HDFS, Spark, and other file-prefixed properties are applied at the cluster level when you create a cluster. Many of these properties can also be applied to specific jobs. When applying a property to a job, the file prefix is not used.
- What prefix does not allow modification of its properties values after creation of the cluster? 'dataproc', e.g. dataproc:efm.spark.shuffle=primary-worker
- What are initialization actions? Executables or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up
- On which note are initialization actions run? On each node during cluster creation. They are also executed on each newly added node when scaling or autoscaling clusters up.
- What can replace initialization actions? Dataproc custom images
- What does a custom machine "custom-6-23040" represent? 6 vCPUs, 23040/1024 = 22.5GB memory
- What's the problem with adding local SSDs to your dataproc cluster? They are temporary, anything stored on it disappears when the cluster stops
IAM
- Can a user with role 'Dataproc Viewer' submit jobs? No
- Can a user with role 'Dataproc Viewer' create Dataproc clusters? no
- What's the difference between 'Dataproc Editor' and 'Dataproc Admin' role? The operation 'Get/Set Dataproc IAM permissions'