I am stuck on this SAS code that I have to rewrite for SQL (PySpark specifically).
data data2 data3;
merge input_2(in=in2)
input_1(in=in1);
by col_1
col_2;
if in1 and in2 then do;
new_col = 'yes';
output data3;
end;
else if in1 then output data2;
run;
For "if in1 and in2", I believe that's like a SQL inner join. But for "else if in1", this would be a left join, yes?
If so, does the order of "merge input_2 input_1" matter? Is input_2 equivalent to the "left" of a SQL left join?
I am stuck on this SAS code that I have to rewrite for SQL (PySpark specifically).
data data2 data3;
merge input_2(in=in2)
input_1(in=in1);
by col_1
col_2;
if in1 and in2 then do;
new_col = 'yes';
output data3;
end;
else if in1 then output data2;
run;
For "if in1 and in2", I believe that's like a SQL inner join. But for "else if in1", this would be a left join, yes?
If so, does the order of "merge input_2 input_1" matter? Is input_2 equivalent to the "left" of a SQL left join?
You can try:
merged_df = input_1_df.join(input_2_df, on=["col_1", "col_2"], how="left")
# Create new columns based on the SAS logic
result_df = merged_df.withColumn(
"new_col",
when(col("col_1").isNotNull() & col("col_2").isNotNull(), lit("yes"))
).select(
*input_1_df.columns, "new_col"
)
# Filter into separate outputs
data3_df = result_df.filter(col("new_col") == "yes")
data2_df = result_df.filter(col("new_col").isNull())
else in1
precludesdata2
from being the left join. data2 contains the data of input_1 that is not paired to that in input_2. data2 is input_1 EXCEPT input_2 – Richard Commented Jan 30 at 17:20