AttributeError: #39;NoneType#39; object has no attribute #39;setCallSite#39;(AttributeError: NoneType 对象没有属性 setCallSite)

在 PySpark 中,我想计算两个数据帧向量之间的相关性,使用以下代码(我在导入 pyspark 或 createDataFrame 时没有任何问题):

In PySpark, I want to calculate the correlation between two dataframe vectors, using the following code (I do not have any problem in importing pyspark or createDataFrame):

from import Vectors
from import Correlation
import pyspark

spark = pyspark.sql.SparkSession.builder.master("local[*]").getOrCreate()

data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
        (Vectors.dense([4.0, 5.0, 0.0, 3.0]),)]
df = spark.createDataFrame(data, ["features"])

r1 = Correlation.corr(df, "features").head()
print("Pearson correlation matrix:
" + str(r1[0]))

但是,我得到了 AttributeError (AttributeError: 'NoneType' object has no attribute 'setCallSite') 为:

But, I got the AttributeError (AttributeError: 'NoneType' object has no attribute 'setCallSite') as:

AttributeError                            Traceback (most recent call last)
<ipython-input-136-d553c1ade793> in <module>()
      6 df = spark.createDataFrame(data, ["features"])
----> 8 r1 = Correlation.corr(df, "features").head()
      9 print("Pearson correlation matrix:
" + str(r1[0]))

/usr/local/lib/python3.6/dist-packages/pyspark/sql/ in head(self, n)
   1130         """
   1131         if n is None:
-> 1132             rs = self.head(1)
   1133             return rs[0] if rs else None
   1134         return self.take(n)

/usr/local/lib/python3.6/dist-packages/pyspark/sql/ in head(self, n)
   1132             rs = self.head(1)
   1133             return rs[0] if rs else None
-> 1134         return self.take(n)
   1136     @ignore_unicode_prefix

/usr/local/lib/python3.6/dist-packages/pyspark/sql/ in take(self, num)
    502         [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
    503         """
--> 504         return self.limit(num).collect()
    506     @since(1.3)

/usr/local/lib/python3.6/dist-packages/pyspark/sql/ in collect(self)
    463         [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
    464         """
--> 465         with SCCallSiteSync(self._sc) as css:
    466             port = self._jdf.collectToPython()
    467         return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))

/usr/local/lib/python3.6/dist-packages/pyspark/ in __enter__(self)
     70     def __enter__(self):
     71         if SCCallSiteSync._spark_stack_depth == 0:
---> 72             self._context._jsc.setCallSite(self._call_site)
     73         SCCallSiteSync._spark_stack_depth += 1

AttributeError: 'NoneType' object has no attribute 'setCallSite'



有一个 open 解决了这个问题:

There's an open resolved issue around this:

[注意:问题已解决,如果您使用的是比 2019 年 10 月更新的 Spark 版本,如果您仍然遇到此问题,请向 Apache Jira 报告]

[Note: as it's resolved, if you're using a more recent version of Spark than October 2019, please report to Apache Jira if you're still encountering this issue]

海报建议强制将 DF 的后端与 Spark 上下文同步:

The poster suggests forcing to sync your DF's backend with your Spark context:

df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession
df._sc = spark._sc


This worked for us, hopefully can work in other cases as well.

