Copy Into Databricks
COPY INTO
in Databricks🔗
COPY INTO feature is used to load files from volumes to Databricks tables and has feature of idempotency ie the data does not get duplicated in the table.
Steps🔗
- Create Volume
- Create folder inside volume
- Copy sample dataset into volume
dbutils.fs.cp("/databricks-datasets/definitive-guide/data/retail-data/by-day/2010-12-01.csv","/Volumes/dev/bronze/landing/input")
dbutils.fs.cp("/databricks-datasets/definitive-guide/data/retail-data/by-day/2010-12-02.csv","/Volumes/dev/bronze/landing/input")
- Create bronze table
COPY INTO dev.bronze.invoice_cp
FROM '/Volumes/dev/bronze/landing/input'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ( -- for the files, if they are different format one has 3 cols other has 5 then merge them
'mergeSchema' = 'true',
'header' = 'true'
)
COPY_OPTIONS ( -- at table level meerge Schema
'mergeSchema' = 'true'
)
- Select from table
We can see 5217 rows.
- Run COPY INTO again
No affected rows so copy into does not duplicate. Its idempotent.
How does copy into maintain the log of data files ingested?🔗
The delta log maintains json version tracking that has information and path of files processed.
Custom Transformations while loading🔗
COPY INTO dev.bronze.invoice_cp_alt
FROM
(
SELECT InvoiceNo,StockCode,cast(Quantity as DOUBLE),current_timestamp() as _insert_date
FROM
'/Volumes/dev/bronze/landing/input'
)
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ( -- for the files, if they are different format one has 3 cols other has 5 then merge them
'mergeSchema' = 'true',
'header' = 'true'
)
COPY_OPTIONS ( -- at table level meerge Schema
'mergeSchema' = 'true'
)