How to change dataframe column names in PySpark?
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Dream Voyager Looping
--
Chapters
00:00 How To Change Dataframe Column Names In Pyspark?
00:48 Accepted Answer Score 521
02:05 Answer 2 Score 310
02:27 Answer 3 Score 124
02:38 Answer 4 Score 82
02:56 Thank you
--
Full question
https://stackoverflow.com/questions/3407...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #apachespark #pyspark #apachesparksql #rename
#avk47
ACCEPTED ANSWER
Score 521
There are many ways to do that:
Option 1. Using selectExpr.
data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], ["Name", "askdaosdka"]) data.show() data.printSchema() # Output #+-------+----------+ #| Name|askdaosdka| #+-------+----------+ #|Alberto| 2| #| Dakota| 2| #+-------+----------+ #root # |-- Name: string (nullable = true) # |-- askdaosdka: long (nullable = true) df = data.selectExpr("Name as name", "askdaosdka as age") df.show() df.printSchema() # Output #+-------+---+ #| name|age| #+-------+---+ #|Alberto| 2| #| Dakota| 2| #+-------+---+ #root # |-- name: string (nullable = true) # |-- age: long (nullable = true)Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace
xrangewithrange.from functools import reduce oldColumns = data.schema.names newColumns = ["name", "age"] df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data) df.printSchema() df.show()Option 3. using alias, in Scala you can also use as.
from pyspark.sql.functions import col data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age")) data.show() # Output #+-------+---+ #| name|age| #+-------+---+ #|Alberto| 2| #| Dakota| 2| #+-------+---+Option 4. Using sqlContext.sql, which lets you use SQL queries on
DataFramesregistered as tables.sqlContext.registerDataFrameAsTable(data, "myTable") df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable") df2.show() # Output #+-------+---+ #| name|age| #+-------+---+ #|Alberto| 2| #| Dakota| 2| #+-------+---+
ANSWER 2
Score 310
df = df.withColumnRenamed("colName", "newColName")\
       .withColumnRenamed("colName2", "newColName2")
Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.
ANSWER 3
Score 124
If you want to change all columns names, try df.toDF(*cols)
ANSWER 4
Score 82
In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)
new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))
df = df.toDF(*new_column_name_list)
Thanks to @user8117731 for toDf trick.