I have a MySQL RDBMS table (3 Million rows, only 209K returned) like this that I need to Python to load into a Spark dataframe. The issue is that I need to load it concurrently as it is REALLY slow (1.5H min), but as you can see I have no way to set an "upperbound" and "lowerbound" that JDBC needs. So my question is how to load this table concurrently. I can't change the table and can't find an example of such a table being loaded into a dataframe with concurrency.
Please let me know if I am being dense about this, but I just haven't come across this issue before.
The USERJSON is a long-character JSON string.
+------------------------------------+-------------------------------------+-------------------------------------+---------+-------------+---------------------+-----------------+------------------------+
|SESSIONID |PARTID |USERID |USERNAME |ACTIVE_FLAG |LOGINTIMESTAMP |LOGOUTTIMESTAMP |USERJSON |
+------------------------------------+-------------------------------------+-------------------------------------+---------+-------------+---------------------+-----------------+------------------------+
|00000123-e63b-4b65-a47a-c84620ae4d20|d6ee09a5-1a16-4a0a-9e2a-3b9ffd9cf1d0 |null |null |1 |2024-09-25 08:43:44 |null |null |
|000012e8-8baf-4adc-bb1e-4c3aead53e60|d6ee09a5-1a16-4a0a-9e2a-3b9ffd9cf1d0 |2ab6e89b-1dc0-e8a2-ad32-87296434b69a |null |1 |2023-09-22 00:00:00 |null |[2,620 CHARACTER_JSON] |
|000022b4-ad4a-4cef-8285-e65d35b7b106|c59ba81c-5e2f-4760-bf44-24432f1e76fc |252ea556-7eb1-336e-bec5-36df57b8ecee |null |1 |2023-12-23 11:20:34 |null |[2,554 CHARACTER_JSON] |
|000034d2-5607-472d-a8d3-ecf81c76a4cf|d6ee09a5-1a16-4a0a-9e2a-3b9ffd9cf1d0 |da192ec4-97ef-34dc-70d2-3b7b17fd6dcc |null |1 |2023-06-19 00:00:00 |null |[2,526 CHARACTER_JSON] |
+------------------------------------+-------------------------------------+-------------------------------------+---------+-------------+---------------------+-----------------+------------------------+
df_session = spark.read \
.format("jdbc") \
.option("url", "jdbc:mysql://127.0.0.1:3317/sesdb?useSSL=false") \
.option("driver", "com.mysql.jdbc.Driver") \
.option("user", "spark") \
.option("password", "[PASS]") \
.option("query", "select * from sesdb.session where PARTID IN('c59ba81c-5e2f-4760-bf44-24432f1e76fc', '992f6369-bf10-4b2e-bd97-b7c99ec4d6f9', 'd6ee09a5-1a16-4a0a-9e2a-3b9ffd9cf1d0')") \
.load()
EDIT:
Hi @Jonathan yes I tried that but the problem with that is I don't need them by LOGINTIMESTAMP. I need to get all records, regardless of LOGINTIMESTAMP. But this code below gets me closer.
.option("dbtable", "sesdb.session") \
.option("numPartitions", 200) \
.option("fetchsize", 5000) \
.option("partitionColumn", "LOGINTIMESTAMP") \
.option("lowerBound", "2018-07-06 00:00:00") \
.option("upperBound", "2025-07-20 00:00:00") \