org.apache.spark.SparkException: не удалось инициализировать класс com.google.cloud.spark.bigquery.SparkBigQueryConnectorUserAgentProvider

Ниже приведен код, который я использовал для импорта таблицы bigquery в мой кластер PySpark (dataproc), а затем запускал на нем алгоритм fp-growth. Но сегодня, когда я запускал тот же код, он выдавал ошибку. Он возвращает схему импортированного df с .printSchema (), но когда я пытаюсь запустить .show () или .fit (), он выдает следующую ошибку.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.ml.fpm import FPGrowth
from pyspark.sql.functions import concat_ws
spark = SparkSession.builder.appName('Jupyter BigQuery Storage').config('spark.jars','gs://spark-lib/bigquery/spark-bigquery-latest.jar').getOrCreate()
table = "project_name.dataset_name.test_table"
df = spark.read.format("bigquery").option("table",table).load()
df.printSchema()


df = df.withColumn(
    "item",
    split(col("item"), ",").cast(ArrayType(IntegerType())).alias("item")
    )

df.printSchema()

df.show(2)

fpGrowth = FPGrowth(itemsCol="item", minSupport=0.01, minConfidence=0.01)
model = fpGrowth.fit(df)

Ниже я получаю ошибку:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-10-74ec76b0ec20> in <module>
     14     df.printSchema()
     15 
---> 16     df.show(2)
     17 
     18     fpGrowth = FPGrowth(itemsCol="item", minSupport=0.01, minConfidence=0.01)

/usr/lib/spark/python/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
    378         """
    379         if isinstance(truncate, bool) and truncate:
--> 380             print(self._jdf.showString(n, 20, vertical))
    381         else:
    382             print(self._jdf.showString(n, int(truncate), vertical))

/opt/conda/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/opt/conda/anaconda/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o377.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 18, cluster-we8z-x-0.c.project_name.dataset_name, executor 1): java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.spark.bigquery.SparkBigQueryConnectorUserAgentProvider
    at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$.headerProvider(DirectBigQueryRelation.scala:356)
    at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$.createReadClient(DirectBigQueryRelation.scala:333)
    at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$$anonfun$$lessinit$greater$default$3$1.apply(DirectBigQueryRelation.scala:42)
    at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$$anonfun$$lessinit$greater$default$3$1.apply(DirectBigQueryRelation.scala:42)
    at com.google.cloud.spark.bigquery.direct.BigQueryRDD.compute(BigQueryRDD.scala:46)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1892)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1880)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1879)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2113)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2062)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2051)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
    at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
    at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.spark.bigquery.SparkBigQueryConnectorUserAgentProvider
    at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$.headerProvider(DirectBigQueryRelation.scala:356)
    at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$.createReadClient(DirectBigQueryRelation.scala:333)
    at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$$anonfun$$lessinit$greater$default$3$1.apply(DirectBigQueryRelation.scala:42)
    at com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation$$anonfun$$lessinit$greater$default$3$1.apply(DirectBigQueryRelation.scala:42)
    at com.google.cloud.spark.bigquery.direct.BigQueryRDD.compute(BigQueryRDD.scala:46)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more



Ответы (2)


Я тоже столкнулся с этой проблемой сегодня утром. Я использовал gs: //spark-lib/bigquery/spark-bigquery-latest.jar при создании кластера DataProc.

--properties spark: spark.jars = gs: //spark-lib/bigquery/spark-bigquery-latest.jar

Этот коннектор был обновлен с 2.11 до 2.12 вчера.

Чтобы исправить мои скрипты, мне пришлось перейти на коннектор spark-bigquery-latest_2.11.jar.

--properties spark: spark.jars = gs: //spark-lib/bigquery/spark-bigquery-latest_2.11.jar

Проблема с новым драйвером 2.12 была создана в проекте Github: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/187

person Kavi Sek    schedule 11.06.2020
comment
Вчера у коннектора появилась новая версия 0.16.0, в которой была обнаружена эта ошибка (извините). Выпущена версия с исправлением ошибки 0.16.1. - person David Rabinowitz; 11.06.2020

Используйте коннектор spark-bigquery версии 0.16.1 и выше, доступный в gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.16.1.jar и gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.16.1.jar (на основе версии Spark Scala). Он также доступен в центральном репозитории maven.

person David Rabinowitz    schedule 11.06.2020