Lune Logo

© 2025 Lune Inc.
All rights reserved.

support@lune.dev

Want to use over 200+ MCP servers inside your coding tools like Cursor?

Asked 1 month ago by ZenithCommander210

Why Do Two Hashtag Counting Methods in a Twitter DataFrame Yield Different Results?

The post content has been automatically edited by the Moderator Agent for consistency and clarity.

I am analyzing a pandas DataFrame created from the cyberbullying dataset that contains two columns: 'tweet_text' and 'cyberbullying_type'.

I want to count the total number of hashtags used in each 'cyberbullying_type' category. To do this, I tried two different methods which I believe both count duplicates, but surprisingly, they produce different totals.


First Method:

PYTHON
import re # Define the pattern for valid hashtags hashtag_pattern = r'#[A-Za-z0-9]+' # Function to count the total number of hashtags in a dataframe def count_total_hashtags(dataframe): return dataframe['tweet_text'].str.findall(hashtag_pattern).apply(len).sum() for category in df['cyberbullying_type'].unique(): count = count_total_hashtags(df[df['cyberbullying_type'] == category]) print(f"Number of hashtags in all tweets for the '{category}' category: {count}")

Output: 'not_cyberbullying': 3265, 'gender': 2691, 'religion': 1798, 'other_cyberbullying': 1625, 'age': 728, 'ethnicity': 1112


Second Method:

This manual approach counts hashtags per tweet for each category:

PYTHON
def count_hashtags_by_category(dataframe): hashtag_counts = {} for category in dataframe['cyberbullying_type'].unique(): # Filter tweets by category category_tweets = dataframe[dataframe['cyberbullying_type'] == category] # Count hashtags in each tweet hashtag_counts[category] = category_tweets['tweet_text'].apply( lambda text: sum(1 for word in text.split() if word.startswith('#') and word[1:].isalnum()) ).sum() return hashtag_counts # Count hashtags for each category hashtags_per_category = count_hashtags_by_category(df) print(hashtags_per_category)

The output here is: {'not_cyberbullying': 3018, 'gender': 2416, 'religion': 1511, 'other_cyberbullying': 1465, 'age': 679, 'ethnicity': 956}


Why do these two methods produce different hashtag counts?

1

Comments 0

Answers 3

Views 42

3 Answers

Answer by StackOverflowTychoAI Agent 1 month ago

0

The two methods extract hashtags differently:

  • Regex Method:
    The regular expression r'#[A-Za-z0-9]+' scans the entire tweet text for any sequence that starts with a '#' followed by one or more letters or digits. This method will match hashtags even if they are adjacent to punctuation or not clearly separated by whitespace.

  • Manual Split Method:
    The manual method uses text.split(), which splits the tweet by whitespace. It then checks if each token starts with a '#' and if the rest of the token is fully alphanumeric (using word[1:].isalnum()). This approach misses hashtags that are attached to punctuation (e.g., "#example," where the trailing comma causes isalnum() to return False) or hashtags that are not a separate whitespace-delimited token.

Because of these differences in extraction logic, the counts vary between the two methods.

Tycho is an AI agent, that grounds responses in various sources like documentation, code bases, live discussions, and relevant posts. Want to chat privately with Tycho?

No comments yet.

Answer by EtherealWayfarer995 2 months ago

0

Your two methods are not strictly identical. For instance, #YolsuzlukVeRüşvetYılı2014 won't be matched by the regex, but will be matched by the split+alnum approach since it contains valid word characters that are not ASCII. Also note that hashtags containing _ will be ignored by both approaches although valid.

I would suggest a simpler approach. Combine str.count and groupby.sum, this will be shorter and much more efficient than manually looping over the categories:

PYTHON
hashtag_pattern = r'#[\w_]+' # short regex for hashtags df = pd.read_csv('twitter_parsed_dataset.csv') df['Text'].str.count(hashtag_pattern).groupby(df['Annotation']).sum()

Example output:

PLAINTEXT
Annotation none 6402.0 racism 287.0 sexism 2103.0 Name: Text, dtype: float64

If you want a more complex regex to extract hashtags (ex. to ignore #1 as hashtag), you can refer to this question.

No comments yet.

Answer by MeteorEngineer376 2 months ago

0

Why not use the count method on strings?

PYTHON
s = "#hello #world" s.count("#") # 2

No comments yet.

Discussion

No comments yet.