python - Spark JDBC table to Dataframe no partitionCol to use - Stack Overflow

admin2025-04-27  3

I have a MySQL RDBMS table (3 Million rows, only 209K returned) like this that I need to Python to load into a Spark dataframe. The issue is that I need to load it concurrently as it is REALLY slow (1.5H min), but as you can see I have no way to set an "upperbound" and "lowerbound" that JDBC needs. So my question is how to load this table concurrently. I can't change the table and can't find an example of such a table being loaded into a dataframe with concurrency.

Please let me know if I am being dense about this, but I just haven't come across this issue before.

The USERJSON is a long-character JSON string.

+------------------------------------+-------------------------------------+-------------------------------------+---------+-------------+---------------------+-----------------+------------------------+
|SESSIONID                           |PARTID                               |USERID                               |USERNAME |ACTIVE_FLAG  |LOGINTIMESTAMP       |LOGOUTTIMESTAMP  |USERJSON                | 
+------------------------------------+-------------------------------------+-------------------------------------+---------+-------------+---------------------+-----------------+------------------------+
|00000123-e63b-4b65-a47a-c84620ae4d20|d6ee09a5-1a16-4a0a-9e2a-3b9ffd9cf1d0 |null                                 |null     |1            |2024-09-25 08:43:44  |null             |null                    |
|000012e8-8baf-4adc-bb1e-4c3aead53e60|d6ee09a5-1a16-4a0a-9e2a-3b9ffd9cf1d0 |2ab6e89b-1dc0-e8a2-ad32-87296434b69a |null     |1            |2023-09-22 00:00:00  |null             |[2,620 CHARACTER_JSON]  |
|000022b4-ad4a-4cef-8285-e65d35b7b106|c59ba81c-5e2f-4760-bf44-24432f1e76fc |252ea556-7eb1-336e-bec5-36df57b8ecee |null     |1            |2023-12-23 11:20:34  |null             |[2,554 CHARACTER_JSON]  |
|000034d2-5607-472d-a8d3-ecf81c76a4cf|d6ee09a5-1a16-4a0a-9e2a-3b9ffd9cf1d0 |da192ec4-97ef-34dc-70d2-3b7b17fd6dcc |null     |1            |2023-06-19 00:00:00  |null             |[2,526 CHARACTER_JSON]  |
+------------------------------------+-------------------------------------+-------------------------------------+---------+-------------+---------------------+-----------------+------------------------+


df_session = spark.read \
.format("jdbc") \
.option("url", "jdbc:mysql://127.0.0.1:3317/sesdb?useSSL=false") \
.option("driver", "com.mysql.jdbc.Driver") \
.option("user", "spark") \
.option("password", "[PASS]") \
.option("query", "select * from sesdb.session where PARTID IN('c59ba81c-5e2f-4760-bf44-24432f1e76fc', '992f6369-bf10-4b2e-bd97-b7c99ec4d6f9', 'd6ee09a5-1a16-4a0a-9e2a-3b9ffd9cf1d0')") \
.load()

EDIT:

Hi @Jonathan yes I tried that but the problem with that is I don't need them by LOGINTIMESTAMP. I need to get all records, regardless of LOGINTIMESTAMP. But this code below gets me closer.

.option("dbtable", "sesdb.session") \
.option("numPartitions", 200) \
.option("fetchsize", 5000) \
.option("partitionColumn", "LOGINTIMESTAMP") \
.option("lowerBound", "2018-07-06 00:00:00") \
.option("upperBound", "2025-07-20 00:00:00") \
转载请注明原文地址:http://anycun.com/QandA/1745708548a91121.html