Vote count: 0
I am trying to transform a large directory of JSON files into CSV files. However, before these JSON files can be turned into CSV, or Dataframe, friendly format, I need to transform them and clip them. This is the done by the transform_json
function.
Below is a solution that works, but it feels silly and slow because of the back and fourth of the json.loads
/json.dumps
.
rdd = (spark_context.textFile('*.json')
.map(json.loads)
.flatMap(transform_json)
.map(json.dumps))
(spark_session.read.json(rdd)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("output_dir"))
I need to put them through the PySpark Dataframe because I don't know all of the columns beforehand, and Spark will handle that for me.
How do I improve this code?
asked 35 secs ago
How do I preprocess JSON data before loading into Spark dataframe
Aucun commentaire:
Enregistrer un commentaire