lundi 13 février 2017

How do I preprocess JSON data before loading into Spark dataframe

Vote count: 0

I am trying to transform a large directory of JSON files into CSV files. However, before these JSON files can be turned into CSV, or Dataframe, friendly format, I need to transform them and clip them. This is the done by the transform_json function.

Below is a solution that works, but it feels silly and slow because of the back and fourth of the json.loads/json.dumps.

rdd = (spark_context.textFile('*.json')
        .map(json.loads)
        .flatMap(transform_json)
        .map(json.dumps))

(spark_session.read.json(rdd)
    .write.format("com.databricks.spark.csv")
    .option("header", "true")
    .save("output_dir"))

I need to put them through the PySpark Dataframe because I don't know all of the columns beforehand, and Spark will handle that for me.

How do I improve this code?

asked 35 secs ago

Let's block ads! (Why?)



How do I preprocess JSON data before loading into Spark dataframe

Aucun commentaire:

Enregistrer un commentaire