vendredi 12 septembre 2014

Spark SQL Join Speed is Slow


Vote count:

0




(Spark 1.0.1): I have two files on my local, t1 and t2.


t1:



a,1
a,2
a,3
a,4
b,6
b,7
b,8
b,8
b,10


t2:



a,matthew
b,crouse


My code looks like this (python):



from pyspark.sql import *
sqlContext = SQLContext(sc)

t1raw = sc.textFile("../../data/t1.csv")
t1fields = t1raw.map(lambda l: l.split(","))
t1 = t1fields.map(lambda p: {"alias" : p[0], "qty": int(p[1])})
schemat1 = sqlContext.inferSchema(t1)
schemat1.registerAsTable("t1")

t2raw = sc.textFile("../../data/t2.csv")
t2fields = t2raw.map(lambda l: l.split(","))
t2 = t2fields.map(lambda p: {"alias" : p[0], "name" : p[1]})
schemat2 = sqlContext.inferSchema(t2)
schemat2.registerAsTable("t2")

q = sqlContext.sql("SELECT * FROM t1")
q.collect() #immediate

q = sqlContext.sql("SELECT * FROM t2")
q.collect() #immediate

q = sqlContext.sql("SELECT * FROM t1 WHERE qty > 3")
q.collect() #immediate

q = sqlContext.sql("SELECT t1.alias, t2.name, t1.qty FROM t1 INNER JOIN t2 ON t1.alias = t2.alias")
q.collect() # half a minute


SELECT *s on a single table feel immediate. a WHERE on a single table seems immediate. However joining one table to the other takes about half a minute.


I was under the impression that it would have been much faster? Is there something about inner joins that makes Spark SQL slow?



asked 1 min ago







Spark SQL Join Speed is Slow

Aucun commentaire:

Enregistrer un commentaire