The Python Oracle

How to change dataframe column names in PySpark?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Dream Voyager Looping

--

Chapters
00:00 How To Change Dataframe Column Names In Pyspark?
00:48 Accepted Answer Score 521
02:05 Answer 2 Score 310
02:27 Answer 3 Score 124
02:38 Answer 4 Score 82
02:56 Thank you

--

Full question
https://stackoverflow.com/questions/3407...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #apachespark #pyspark #apachesparksql #rename

#avk47



ACCEPTED ANSWER

Score 521


There are many ways to do that:

  • Option 1. Using selectExpr.

     data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                       ["Name", "askdaosdka"])
     data.show()
     data.printSchema()
    
     # Output
     #+-------+----------+
     #|   Name|askdaosdka|
     #+-------+----------+
     #|Alberto|         2|
     #| Dakota|         2|
     #+-------+----------+
    
     #root
     # |-- Name: string (nullable = true)
     # |-- askdaosdka: long (nullable = true)
    
     df = data.selectExpr("Name as name", "askdaosdka as age")
     df.show()
     df.printSchema()
    
     # Output
     #+-------+---+
     #|   name|age|
     #+-------+---+
     #|Alberto|  2|
     #| Dakota|  2|
     #+-------+---+
    
     #root
     # |-- name: string (nullable = true)
     # |-- age: long (nullable = true)
    
  • Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

     from functools import reduce
    
     oldColumns = data.schema.names
     newColumns = ["name", "age"]
    
     df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
     df.printSchema()
     df.show()
    
  • Option 3. using alias, in Scala you can also use as.

     from pyspark.sql.functions import col
    
     data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
     data.show()
    
     # Output
     #+-------+---+
     #|   name|age|
     #+-------+---+
     #|Alberto|  2|
     #| Dakota|  2|
     #+-------+---+
    
  • Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

     sqlContext.registerDataFrameAsTable(data, "myTable")
     df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
    
     df2.show()
    
     # Output
     #+-------+---+
     #|   name|age|
     #+-------+---+
     #|Alberto|  2|
     #| Dakota|  2|
     #+-------+---+
    



ANSWER 2

Score 310


df = df.withColumnRenamed("colName", "newColName")\
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.




ANSWER 3

Score 124


If you want to change all columns names, try df.toDF(*cols)




ANSWER 4

Score 82


In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDf trick.