mongodb - How to preserve markdown format in a parquet file - Stack Overflow

admin2025-04-17  1

I have a column in a mongodb collection called 'markdown' that looks like this:



Title

**Section 1**

Introduction

**Section 2**

Paragraph 1

**Section 3**

Paragraph 2

**Section 4**

Conclusion

## Background / Description

**Hypothesis**: Hypothesis

## Project Goals

* Goal 1
* Goal 2
* Goal 3

but when I convert this into a column in a parquet file (using pyarrow), it becomes this:

'\n\nTitle\n\n**Section 1**\n\nIntroductiont\n\n**Section 2**\n\nParagraph 1\n\n**Section 3**\n\nParagraph 2\n\n**Section 4**\n\nConclusion\n\n## Background / Description\n\n**Hypothesis**Hypothesis\n\n## Project Goals\n\n* Goal 1\n* Goal 2\n* Goal 3\n

which becomes this when I store the contents in a md file. which defeats the purpose of even having markdown information.

Is there a way to preserve markdown in a parquet file?

Edit: I am converting the mongodb documents into a parquet file using the following code:

# pa = pyarrow lib, pq = parquet from pyarrow lib
    bson_data = list(db[MONGO_COLLECTION].find())
    logging.info(f"{len(bson_data)} documents found.")

    df = pd.DataFrame(bson_data)
    table = pa.Table.from_pandas(df)
    pq.write_table(table, "/tmp/output.parquet")
    logging.info("Conversion to parquet completed successfully.")
转载请注明原文地址:http://anycun.com/QandA/1744884858a88997.html