How do I count distinct 'label' intersections across multiple PySpark DataFrames in Databricks?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I have three PySpark DataFrames in Databricks named raw_old, raw_new, and master_df, each containing a column called label. I need to calculate the number of distinct labels that appear in the following intersections:

raw_old and raw_new (expected: 3 labels: A789, B456, D123)
raw_new and master_df (expected: 2 labels: C456, D123)
raw_old and master_df (expected: 4 labels: A654, B987, C987, D123)
raw_old, raw_new, and master_df (expected: 1 label: D123)

Below are the sample tables:

raw_old

MARKDOWN
+---+-----+
| id|label|
+---+-----+
|  1| A987|
|  2| A654|
|  3| A789|
|  4| B321|
|  5| B456|
|  6| B987|
|  7| C321|
|  8| C654|
|  9| C987|
| 10| D123|
+---+-----+

raw_new

MARKDOWN
+---+-----+
| id|label|
+---+-----+
|  1| A123|
|  2| A456|
|  3| A789|
|  4| B123|
|  5| B456|
|  6| B789|
|  7| C123|
|  8| C456|
|  9| C789|
| 10| D123|
+---+-----+

master_df

MARKDOWN
+---+-----+
| id|label|
+---+-----+
|  1| A999|
|  2| A654|
|  3| A000|
|  4| B111|
|  5| B000|
|  6| B987|
|  7| C999|
|  8| C456|
|  9| C987|
| 10| D123|
+---+-----+

How can I efficiently compute these intersections using PySpark?

Comments 0

•

Answers 3

•

3 Answers

Answer by StackOverflow TychoAI Agent • 8 months ago

Below is one way to compute the intersections using PySpark’s DataFrame API. The idea is to first select distinct labels from each DataFrame and then use the intersect() function to get the common rows (labels). Finally, count the number of rows in each resulting DataFrame.

Below is a complete example:

PYTHON
# Select distinct labels from each DataFrame
raw_old_labels = raw_old.select("label").distinct()
raw_new_labels = raw_new.select("label").distinct()
master_labels  = master_df.select("label").distinct()

# Intersection between raw_old and raw_new
common_old_new = raw_old_labels.intersect(raw_new_labels)
print("Labels in raw_old and raw_new:", common_old_new.collect())  # Expected: A789, B456, D123
print("Count:", common_old_new.count())

# Intersection between raw_new and master_df
common_new_master = raw_new_labels.intersect(master_labels)
print("Labels in raw_new and master_df:", common_new_master.collect())  # Expected: C456, D123
print("Count:", common_new_master.count())

# Intersection between raw_old and master_df
common_old_master = raw_old_labels.intersect(master_labels)
print("Labels in raw_old and master_df:", common_old_master.collect())  # Expected: A654, B987, C987, D123
print("Count:", common_old_master.count())

# Intersection among raw_old, raw_new, and master_df
common_all = raw_old_labels.intersect(raw_new_labels).intersect(master_labels)
print("Labels in raw_old, raw_new, and master_df:", common_all.collect())  # Expected: D123
print("Count:", common_all.count())

Explanation

Extracting Distinct Labels:
We use .select("label") along with .distinct() so that repeated labels (if any) don’t impact the intersection results.
Using intersect():
The .intersect() method returns only those rows that exist in both DataFrames. You can chain .intersect() to compute the intersection among more than two DataFrames.
Counting the Results:
The .count() method gives you the number of labels that appear in the resulting intersections.

This solution should scale well provided that you already have partitioned and optimized DataFrames, as Databricks’ Spark engine efficiently distributes the work over your cluster.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by SaturnianPioneer883 • 8 months ago

I'd create a set from the labels of one dataframe, intersect it with the others and get the size of the resulting set:

PYTHON
num_shared = 
    len(set(raw_old.label).intersection(*(s.label for s in (raw_new, master_df))))

No comments yet.

Answer by GalacticGuardian663 • 8 months ago

You should use an inner join to get the elements in common between the datasets

PYTHON
joined_data = raw_old.join(
    raw_new,
    on=raw_old["label"] == raw_new["label",
    how="inner"
)

and then you can collect the result back to Python, keeping all the heavy work in Spark

PYTHON
print(joined_data.count())

When joining 3 dataframes, you can do the first 2 and join the resulted dataframe with the third one.

No comments yet.

Discussion

No comments yet.

How do I count distinct 'label' intersections across multiple PySpark DataFrames in Databricks?

3 Answers

Explanation

Discussion

Similar Posts

How Can I Dynamically Adjust Token Reservation and Handle Extensive Outputs in GPT-4o?