Skip to content

Data Engineering Knowledge Base

Copy Into Databricks

Copy Into Databricks

`COPY INTO` in Databricks🔗

COPY INTO feature is used to load files from volumes to Databricks tables and has feature of idempotency ie the data does not get duplicated in the table.

Steps🔗

Create Volume

CREATE VOLUME dev.bronze.landing

Create folder inside volume

dbutils.fs.mkdirs("/Volumes/dev/bronze/landing/input")

Copy sample dataset into volume

dbutils.fs.cp("/databricks-datasets/definitive-guide/data/retail-data/by-day/2010-12-01.csv","/Volumes/dev/bronze/landing/input")

dbutils.fs.cp("/databricks-datasets/definitive-guide/data/retail-data/by-day/2010-12-02.csv","/Volumes/dev/bronze/landing/input")

Create bronze table

COPY INTO dev.bronze.invoice_cp
FROM '/Volumes/dev/bronze/landing/input'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ( -- for the files, if they are different format one has 3 cols other has 5 then merge them
  'mergeSchema' = 'true',
  'header' = 'true'
)
COPY_OPTIONS ( -- at table level meerge Schema
  'mergeSchema' = 'true'
)

Select from table

We can see 5217 rows.

Run COPY INTO again

No affected rows so copy into does not duplicate. Its idempotent.

How does copy into maintain the log of data files ingested?🔗

The delta log maintains json version tracking that has information and path of files processed.

Custom Transformations while loading🔗

COPY INTO dev.bronze.invoice_cp_alt
FROM 
(
  SELECT InvoiceNo,StockCode,cast(Quantity as DOUBLE),current_timestamp() as _insert_date 
  FROM 
  '/Volumes/dev/bronze/landing/input'
)

FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS ( -- for the files, if they are different format one has 3 cols other has 5 then merge them
  'mergeSchema' = 'true',
  'header' = 'true'
)
COPY_OPTIONS ( -- at table level meerge Schema
  'mergeSchema' = 'true'
)