[go: up one dir, main page]

0% found this document useful (0 votes)
32 views3 pages

3 Analyze NYC Taxi Data Using Spark Pool

The document provides a Spark code example for analyzing NYC Taxi data stored in a Parquet file. It demonstrates how to load the data into a Spark DataFrame, create a database, and perform SQL queries to analyze passenger counts and trip distances. The results are then saved into a new table for further analysis.

Uploaded by

kasaramvenky082
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views3 pages

3 Analyze NYC Taxi Data Using Spark Pool

The document provides a Spark code example for analyzing NYC Taxi data stored in a Parquet file. It demonstrates how to load the data into a Spark DataFrame, create a database, and perform SQL queries to analyze passenger counts and trip distances. The results are then saved into a new table for further analysis.

Uploaded by

kasaramvenky082
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Analyze NYC Taxi data with a Spark pool

NYC Taxi Trip – Spark code:

%%pyspark

df =
spark.read.load('abfss://users@vnycdatalake.dfs.core.windows.net/
NYCTaxiTrip.parquet', format='parquet')

display(df.limit(10))

#check the schema of the dataframe

%%pyspark

df.printSchema()

#Load the NYC Taxi data into the Spark nyctaxi database

%%pyspark

spark.sql("CREATE DATABASE IF NOT EXISTS nyctaxi")

df.write.mode("overwrite").saveAsTable("nyctaxi.trip")
#Analyze the NYC Taxi data using Spark and notebooks

%%pyspark

df = spark.sql("SELECT * FROM nyctaxi.trip")

display(df)

#analyze the passenger count status

%%pyspark

df = spark.sql("""

SELECT passenger_count,

SUM(trip_distance) as SumTripDistance,

AVG(trip_distance) as AvgTripDistance

FROM nyctaxi.trip

WHERE trip_distance > 0 AND passenger_count > 0

GROUP BY passenger_count

ORDER BY passenger_count

""")
display(df)

df.write.saveAsTable("nyctaxi.passengercountstats")

You might also like