google cloud platform - Data Fusion UTF-8 vs UTF-16 encoded file processing issue - Stack Overflow

admin2025-04-25 16

In my Data Fusion pipeline, I start by reading a GCS file as a blob. Since the file is encoded in UTF-16, I set the charset to UTF-16 during the transformation step. Wrangler does not indicate any issues.

For testing purposes, I created a file with various formatting errors and ran my pipeline against it. Instead of processing each issued record separately as expected, the entire input is outputted as a single row. No correct rows are being written (in my case to BigQuery).

When I manually changed the file’s charset in Sublime Text to UTF-8 and read GCS file as text, the pipeline behaves as expected, correctly treating each row as a separate record in the error output and also writing correct rows to its destination.

I even tried adding another transformation step that would set charset and output bytes again but that every times results in the same - no correct records on the output, all errors are treated as 1.

Does this ring any bell with anyone?

I even tried adding another transformation step that would set charset and output bytes again but that every times results in the same - no correct records on the output, all errors are treated as 1.

Does this ring any bell with anyone?

Share Improve this question asked Jan 16 at 13:56 user101010 1173 silver badges11 bronze badges

You tried UTF-16LE (little endian) which is typical for Windows? – Joop Eggen Commented Jan 16 at 14:06
@JoopEggen yes, that is correct, such file I receive from a provider. – user101010 Commented Jan 16 at 15:05

Add a comment |

1 Answer 1

Sorted by: Reset to default -1

One of the best approach is to pre-process the data before it reaches Data Fusion. This provides better control over data quality and avoids the complexity of custom parsing within Data Fusion, like for example Charset conversion, By converting the file to UTF-8 before uploading to GCS, You eliminate the UTF-16 decoding issue within Data Fusion.

You can also try to create a custom parser within Data Fusion using a Wrangler transformation with JavaScript UDFs. This parser would need to handle the raw bytes of the UTF-16 blob, carefully decode it, and implement error handling to manage invalid characters or formatting issues.

function parseUTF16Blob(blob) {
  try {
    const decoder = new TextDecoder('utf-16'); 
    const text = decoder.decode(blob);
    return parsedRecords; 
  } catch (error) {
    return { error: error.message, blob: blob };
  }
}

转载请注明原文地址:http://anycun.com/QandA/1745528163a90794.html

google cloud platform - Data Fusion UTF-8 vs UTF-16 encoded file processing issue - Stack Overflow

1 Answer 1

google cloud platformData Fusion UTF8 vs UTF16 encoded file processing issueStack Overflow