Spark Window Functions

Window functions in Spark🔗

Rank vs Dense Rank🔗

Dense rank does not leave any gaps between the ranks.

Lead and Lag🔗

Range and Row Between🔗

Q1

Using first and last functions let's try to acheive this.

Data:

This solution is wrong, ideally we should get 111000 in all rows of latest_sales column.

Let's look at explain plan.

We can see that the window here is unbounded preceeding and current row

What do these terms mean?

Unbounded preceeding : If i'm standing at a current row in a window I will return the result of any operation on the window from here to all the rows before me in the window.
current_row : the row im standing at.
Unbounded following : opposite of unbounded preceeding.
rows_between(start_row,end_row) : basically the row we are currently at is 0, all rows before that are negative numbers and all rows after that is positive numbers.

If we dont give anything then it just goes from current row to either unbounded preceeding (first row) of window or unbounded following (last row) of window.

Converting from string to unixtime when we have two fields date and time.

emp_df = emp_df.withColumn("timestamp",from_unixtime(unix_timestamp(expr("CONCAT(date,' ',time)"),"dd-MM-yyyy HH:mm")))

The timestamp column is a string.