This topic provides best practices, general guidelines, and important considerations for loading staging data.
Options for selecting prepared data files¶
The COPY command supports several options for loading data files from a Stage:
By route (internal stages)/prefix (Amazon S3 bucket). To seeOrganization of data by routefor information.
Specify a list of specific files to upload.
Use pattern matching to identify specific files by pattern.
These options allow you to copy a fraction of the served data to Snowflake with a single command. This allows you to execute concurrent COPY statements that match a subset of files and take advantage of parallel operations.
file lists¶
oCOPY TO <table>The command includes a FILES parameter to upload files with a specific name.
Above
Of the three options for identifying/specifying the data files to load in a stage, providing a discrete list of files is usually the fastest; However, the FILES parameter supports a maximum of 1000 files, that is, a COPY command executed with the FILES parameter can load up to 1000 files.
For example:
COPY OF EM charged1 VON @%charged1/fecha1/ FILES=('test1.csv', 'test2.csv', 'test3.csv')(Video) Data Pipelines Explained
File lists can be combined with paths for more control over data loading.
pattern matching¶
oCOPY TO <table>The command includes a PATTERN parameter to upload files using a regular expression.
For example:
COPY OF EM people_data VON @%people_data/fecha1/ STANDARD='.*Personal data[^0-9{1,3}$$].csv';
Regular expression pattern matching is generally the slowest of the three options for identifying/specifying data files for a stage to load. However, this option works well if you have exported your files from your external application in the order listed and want to batch upload the files in the same order.
Pattern matching can be combined with routes to provide more control over data loading.
supervision
The regular expression is applied differently when loading bulk data than when loading Snowpipe data.
Snowpipe removes all path segments in the location's stage definition and applies the regular expression to all remaining path segments and filenames. Run the command to see the stage definitionDESCRIBE THE STEPCommand for the stage. The URL property consists of the name of the repository or container and zero or more path segments. For example, if the FROM location is in COPYINTO<tabla>statement is
@s/Pfad1/Pfad2/
and the URL value for the stage@S
Ess3://midepósito/ruta1/
, then Snowpipe ornaments/ruta1/
of the location in the FROM clause and applies the regular expressionPfad2/
plus the names of the files in the path.Bulk loads apply the regular expression to the entire location in the FROM clause.
Snowflake recommends enabling cloud event filtering for Snowpipe to reduce cost, event noise, and latency. Use the SAMPLE option only if your cloud provider's event filtering capabilities are not sufficient. For more information on configuring event filtering for each cloud provider, see the following pages:
Configuring Event Notifications Using Object Key Name Filtering - Amazon S3
(Video) View your dbt documentation as a websiteUnderstanding event filtering for Event Grid subscriptions - Azure
Execute parallel COPY statements that refer to the same data files¶
When a COPY statement is executed, Snowflake sets a load state for the table metadata for the data files referenced in the statement. This prevents parallel COPY statements from loading the same files into the table, avoiding data duplication.
When the COPY statement is done processing, Snowflake adjusts the load state of the data files accordingly. If one or more data files fail to load, Snowflake sets the load status of those files to Load Failed. These files are available for loading by a subsequent COPY statement.
Upload old files¶
This section describes how theCOPY TO <table>The command prevents data duplication differently depending on whether a file's upload status is known or unknown. If you sort your data by date using logical, granular paths (as inOrganization of data by route) and uploading data within a short period of time after preparation, this section does not apply to you. However, if the COPY command skips older files (that is, historical data files) when loading data, this section describes how to skip the default behavior.
upload metadata¶
Snowflake maintains detailed metadata for each table that data is loaded into, including:
Name of each file from which the data was loaded
File size
ETag to file
Number of lines parsed in the file
Timestamp of the last upload of the file
Information about errors found in the file during upload
This upload metadata expires after 64 days. If the LAST_MODIFIED date for a given data file is less than or equal to 64 days, the COPY command can determine the load status of a particular table and prevent reloads (and data duplication). The LAST_MODIFIED date is the timestamp of when the file was initially prepared or when it was last modified, whichever is later.
If the LAST_MODIFIED date is older than 64 days, the state of charge is still known whenanyof the following events that occurred less than 64 days before the current date:
The file has been uploaded successfully.
The initial record of the table (that is, the first batch after the table was created) was loaded.
However, the COPY command cannot positively determine if a file has already been uploaded if the LAST_MODIFIED date is older than 64 days.mithe original record was loaded into the table more than 64 days ago (miif the file was uploaded to the spreadsheet, that was also more than 64 days ago). In this case, the command skips the file by default to prevent an accidental reload.
hanging problem¶
To load files whose metadata has expired, set the LOAD_UNCERTAIN_FILES copy option to true. The copy option references upload metadata if available to avoid data duplication, but also attempts to upload files with expired upload metadata.
Alternatively, set the FORCE option to upload all files and ignore upload metadata if present. Note that this option reloads files and may duplicate data in a table.
examples¶
In this example:
A table is created inJanuary 1and the table is initially loaded the same day.
64 days pass. Around7 of March, the upload metadata expires.
A file is prepared and loaded into the table.July 27thmi28, respectively Since the file was prepared the day before the upload, the LAST_MODIFIED date was within 64 days. The state of charge was known. There are no data or format problems with the file and the COPY command loads it correctly.
64 days pass. Around28.09, the LAST_MODIFIED date of the tested file is greater than 64 days. Around29.09, the upload metadata expires for a successful file upload.
Trying to load file into the same table as reloadNovember 1st. Since the COPY command cannot determine if the file has already been loaded, the file is ignored. The LOAD_UNCERTAIN_FILES copy option (or the FORCE copy option) is required to load the file.
(Video) Module 2 - Day 1 - 4 - EpiData Entry: Data documentation sheet
In this example:
A file is being tested.January 1.
64 days pass. Around7 of March, the LAST_MODIFIED date of the tested file is greater than 64 days.
A new table is created in29.09and the prepared file is loaded into the table. Because the table was initially loaded less than 64 days ago, the COPY command may determine that the file has not been loaded yet. There are no data or format problems with the file and the COPY command loads it correctly.
JSON data: elimination of "Null" values¶
In a VARIANT column, NULL values are stored as a string containing the word "null", not the SQL NULL value. If "Null" values in your JSON documents indicate missing values and have no other special meaning, we recommend setting the STRIP_NULL_VALUES file format option to TRUE for STRIP_NULL_VALUESCOPY TO <table>Command when loading the JSON files. Keeping null values generally wastes memory and slows down query processing.
CSV Data: Trim Leading Spaces¶
If your external software exports quoted fields but inserts a leading space before the opening quote of each field, Snowflake reads the leading space instead of the leading quote as the beginning of the field. The quotes are interpreted as string data.
Use TRIM_SPACEDate formatOption to remove unwanted spaces when loading data.
For example, each of the following fields in a sample CSV file contains a leading space:
"value1", "Wert2", "valor3"
The following COPY command truncates the leading space and removes the quotation marks surrounding each field:
COPY OF EM my tableVON @%my tableDATE FORMAT = (TYPICAL = CSV TRIM_SPACE=TRUE FIELD_OPTIONALLY_CLOSED_BY = '0x22');CHOOSE * VON my table;+--------+--------+--------+| columna1 | columna2 | col3 |+--------+--------+--------+| valor1 | valor2 | valentía3 |+--------+--------+--------+
FAQs
What is the recommended method for loading data into Snowflake? ›
Bulk Loading Using the COPY Command
This option enables loading batches of data from files already available in cloud storage, or copying (i.e. staging) data files from a local machine to an internal (i.e. Snowflake) cloud storage location before loading the data into tables using the COPY command.
When loading data in Snowflake, which two are true? - You can query a portion of data in external cloud storage without loading into Snowflake. - The Snowflake Data Load wizard has a limitation on file size.
What is the difference between Snowpipe and bulk loading? ›Bulk data load: The load history is stored in the target table's metadata for 64 days. Snowpipe: The pipe's metadata stores the pipe's history for 14 days. The ACCOUNT USAGE view or SQL table function can be used to request the history from a REST endpoint.
How ETL is done in Snowflake? ›Snowflake ETL means applying the process of ETL to load data into the Snowflake Data Warehouse. This comprises the extraction of relevant data from Data Sources, making necessary transformations to make the data analysis-ready, and then loading it into Snowflake.
What are the types of data files can be loaded in Snowflake? ›Currently supported semi-structured data formats include JSON, Avro, ORC, Parquet, or XML: For JSON, Avro, ORC, and Parquet data, each top-level, complete object is loaded as a separate row in the table. Each object can contain new line characters and spaces as long as the object is valid.
Which statement is used to load data from a file? ›The LOAD DATA statement reads rows from a text file into a table at a very high speed. The file can be read from the server host or the client host, depending on whether the LOCAL modifier is given.
What command is used to load or unload data in Snowflake? ›Bulk Unloading Process
From a Snowflake stage, use the GET command to download the data file(s). From S3, use the interfaces/tools provided by Amazon S3 to get the data file(s).
Loading data into Snowflake is fast and flexible. You get the greatest speed when working with CSV files, but Snowflake's expressiveness in handling semi-structured data allows even complex partitioning schemes for existing ORC and Parquet data sets to be easily ingested into fully structured Snowflake tables.
Should I use a Snowflake internal or external stage to load data? ›For data we don't intend to keep as flat-files, we use internal stages which are then loaded and deleted. For data where we need more auditing, we use external stages.
What command is used to load or unload data Snowflake? ›Bulk Unloading Process
From a Snowflake stage, use the GET command to download the data file(s). From S3, use the interfaces/tools provided by Amazon S3 to get the data file(s). From Azure, use the interfaces/tools provided by Microsoft Azure to get the data file(s).
How do I load bulk data into a Snowflake? ›
- Create File Format Objects.
- Create Stage Objects.
- Stage the Data Files.
- Copy Data into the Target Tables.
- Resolve Data Load Errors.
- Remove the Successfully Copied Data Files.
- Clean Up.