In my Data Fusion pipeline, I start by reading a GCS file as a blob. Since the file is encoded in UTF-16, I set the charset to UTF-16 during the transformation step. Wrangler does not indicate any issues.
For testing purposes, I created a file with various formatting errors and ran my pipeline against it. Instead of processing each issued record separately as expected, the entire input is outputted as a single row. No correct rows are being written (in my case to BigQuery).
When I manually changed the file’s charset in Sublime Text to UTF-8 and read GCS file as text, the pipeline behaves as expected, correctly treating each row as a separate record in the error output and also writing correct rows to its destination.
I even tried adding another transformation step that would set charset and output bytes again but that every times results in the same - no correct records on the output, all errors are treated as 1.
Does this ring any bell with anyone?
In my Data Fusion pipeline, I start by reading a GCS file as a blob. Since the file is encoded in UTF-16, I set the charset to UTF-16 during the transformation step. Wrangler does not indicate any issues.
For testing purposes, I created a file with various formatting errors and ran my pipeline against it. Instead of processing each issued record separately as expected, the entire input is outputted as a single row. No correct rows are being written (in my case to BigQuery).
When I manually changed the file’s charset in Sublime Text to UTF-8 and read GCS file as text, the pipeline behaves as expected, correctly treating each row as a separate record in the error output and also writing correct rows to its destination.
I even tried adding another transformation step that would set charset and output bytes again but that every times results in the same - no correct records on the output, all errors are treated as 1.
Does this ring any bell with anyone?
One of the best approach is to pre-process the data before it reaches Data Fusion. This provides better control over data quality and avoids the complexity of custom parsing within Data Fusion, like for example Charset conversion, By converting the file to UTF-8
before uploading to GCS, You eliminate the UTF-16
decoding issue within Data Fusion.
You can also try to create a custom parser within Data Fusion using a Wrangler transformation with JavaScript UDFs. This parser would need to handle the raw bytes of the UTF-16
blob, carefully decode it, and implement error handling to manage invalid characters or formatting issues.
function parseUTF16Blob(blob) {
try {
const decoder = new TextDecoder('utf-16');
const text = decoder.decode(blob);
return parsedRecords;
} catch (error) {
return { error: error.message, blob: blob };
}
}