Using and Abusing Auto Loader's Inferred Schema
Problem Databricks' Auto Loader has the ability to infer a schema from a sample of files. This means that you don’t have to provide a schema, which is really handy when you’re dealing with an unknown schema or a wide and complex schema, which you don’t always want to define up-front. But what happens if the schema that has been inferred isn’t the schema you were expecting or it contains fields which you definitely don’t want to ingest - like PCI or PII data fields?
Using Auto Loader on Azure Databricks with AWS S3
Problem Recently on a client project, we wanted to use the Auto Loader functionality in Databricks to easily consume from AWS S3 into our Azure hosted data platform. The reason why we opted for Auto Loader over any other solution is because it natively exists within Databricks and allows us to quickly ingest data from Azure Storage Accounts and AWS S3 Buckets, while using the benefits of Structured Streaming to checkpoint which files it last loaded.
Databricks Labs: Data Generator
Databricks recently released the public preview of a Data Generator for use within Databricks to generate synthetic data. This is particularly exciting as the Information Security manager at a client recently requested synthetic data to be generated for use in all non-production environments as a feature of a platform I’ve been designing for them. The Product Owner decided at the time that it was too costly to implement any time soon, but this release from Databricks makes the requirement for synthetic data much easier and quicker to realise and deliver.