Vote count:
0
(Spark 1.0.1): I have two files on my local, t1 and t2.
t1:
a,1
a,2
a,3
a,4
b,6
b,7
b,8
b,8
b,10
t2:
a,matthew
b,crouse
My code looks like this (python):
from pyspark.sql import *
sqlContext = SQLContext(sc)
t1raw = sc.textFile("../../data/t1.csv")
t1fields = t1raw.map(lambda l: l.split(","))
t1 = t1fields.map(lambda p: {"alias" : p[0], "qty": int(p[1])})
schemat1 = sqlContext.inferSchema(t1)
schemat1.registerAsTable("t1")
t2raw = sc.textFile("../../data/t2.csv")
t2fields = t2raw.map(lambda l: l.split(","))
t2 = t2fields.map(lambda p: {"alias" : p[0], "name" : p[1]})
schemat2 = sqlContext.inferSchema(t2)
schemat2.registerAsTable("t2")
q = sqlContext.sql("SELECT * FROM t1")
q.collect() #immediate
q = sqlContext.sql("SELECT * FROM t2")
q.collect() #immediate
q = sqlContext.sql("SELECT * FROM t1 WHERE qty > 3")
q.collect() #immediate
q = sqlContext.sql("SELECT t1.alias, t2.name, t1.qty FROM t1 INNER JOIN t2 ON t1.alias = t2.alias")
q.collect() # half a minute
SELECT *s on a single table feel immediate. a WHERE on a single table seems immediate. However joining one table to the other takes about half a minute.
I was under the impression that it would have been much faster? Is there something about inner joins that makes Spark SQL slow?
asked 1 min ago
Spark SQL Join Speed is Slow
Aucun commentaire:
Enregistrer un commentaire