Analyze NYC Taxi data with a Spark pool
NYC Taxi Trip – Spark code:
%%pyspark
df =
spark.read.load('abfss://users@vnycdatalake.dfs.core.windows.net/
NYCTaxiTrip.parquet', format='parquet')
display(df.limit(10))
#check the schema of the dataframe
%%pyspark
df.printSchema()
#Load the NYC Taxi data into the Spark nyctaxi database
%%pyspark
spark.sql("CREATE DATABASE IF NOT EXISTS nyctaxi")
df.write.mode("overwrite").saveAsTable("nyctaxi.trip")
#Analyze the NYC Taxi data using Spark and notebooks
%%pyspark
df = spark.sql("SELECT * FROM nyctaxi.trip")
display(df)
#analyze the passenger count status
%%pyspark
df = spark.sql("""
SELECT passenger_count,
SUM(trip_distance) as SumTripDistance,
AVG(trip_distance) as AvgTripDistance
FROM nyctaxi.trip
WHERE trip_distance > 0 AND passenger_count > 0
GROUP BY passenger_count
ORDER BY passenger_count
""")
display(df)
df.write.saveAsTable("nyctaxi.passengercountstats")