Show distinct column values in pyspark dataframe
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Cosmic Puzzle
--
Chapters
00:00 Show Distinct Column Values In Pyspark Dataframe
00:29 Answer 1 Score 409
00:51 Accepted Answer Score 128
01:48 Answer 3 Score 26
02:00 Answer 4 Score 23
02:26 Answer 5 Score 14
02:38 Thank you
--
Full question
https://stackoverflow.com/questions/3938...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #apachespark #pyspark #apachesparksql
#avk47
ANSWER 1
Score 411
This should help to get distinct values of a column:
df.select('column1').distinct().collect()
Note that .collect() doesn't have any built-in limit on how many values can return so this might be slow -- use .show() instead or add .limit(20) before .collect() to manage this.
ACCEPTED ANSWER
Score 128
Let's assume we're working with the following representation of data (two columns, k and v, where k contains three entries, two unique:
+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+
With a Pandas dataframe:
import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()
This returns an ndarray, i.e. array(['foo', 'bar'], dtype=object)
You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:
s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))
If you want the same result from Spark, i.e. an ndarray, use toPandas():
s_df.toPandas()['k'].unique()
Alternatively, if you don't need an ndarray specifically and just want a list of the unique values of column k:
s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()
Finally, you can also use a list comprehension as follows:
[i for i in s_df.select('k').distinct().collect()]
ANSWER 3
Score 26
You can use df.dropDuplicates(['col1','col2']) to get only distinct rows based on colX in the array. 
ANSWER 4
Score 14
collect_set can help to get unique values from a given column of pyspark.sql.DataFrame:
df.select(F.collect_set("column").alias("column")).first()["column"]