Cookie Thread Act 6: Cookie & Thread

orangeandblack5 · September 11, 2024, 9:27pm

I’m not super familiar with web scraping but there are python libraries that make this pretty easy

orangeandblack5 · September 11, 2024, 9:27pm

I have, uh, definitely done this to try and validate meta reads on your activity in the past

orangeandblack5 · September 11, 2024, 9:28pm

genuinely though chat gpt could actually be a good starting point here if nobody has more experience and more specific advice lol

Arete · September 11, 2024, 9:29pm

wait, when

the last time we were in a game together and not hydraing was… possibly Autumn Invitational but there might be something I’m forgetting? were you trying to meta me while speccing champs or something??

lilith · September 11, 2024, 9:38pm

hi hello done this on a couple of occasions
itll be… tedious either way. But with combination of some python libraries (i believe beautifulsoup4 is a good one for html, a http request library for sure) and some analysis of where thread links are and how to get just the post test from the posts its definitely possible
a guide on web scraping will cover those fundamentals and from there you have to find the links and text and query for everything with certain IDs or tags. This changes with forum software but generally these are pretty stable. If you reload the site and the IDs change youll need to use something like selenium though and automate actually clicking and searching text. this is a bit of a messy explanation but if you still need help tmrw let me know and ill go through it alot better - for now im off to bed

notblackorwhite · September 11, 2024, 9:39pm

Anywhere from trivial to marginally difficult depending on the specific forum software in question. Discourse would be close to trivial unless word count isn’t in its API (but I suspect it is). If you don’t have access to an API (either because you’re not authorized or it doesn’t exist), it becomes more difficult, but not necessarily hard yet. If the software already has a way to get stats for a page that includes word count, it’s not Hard, but it may not be Easy if you’re totally new to something like this. You would just need a script that can traverse each thread in the sub-board and each page to grab the word count and sum it.

lilith · September 11, 2024, 9:41pm

last time i did anything like this was for FAM4 actually. i scraped the pfp and banners of every user if you remember. i had to use selenium for that one to load the banners

lilith · September 11, 2024, 9:41pm

HA i posted first

notblackorwhite · September 11, 2024, 9:41pm

You can almost certainly inspect the request that’s made when you click the button to see the stats, and then leverage that to avoid actually navigating to each page and having the script interact with the stats button.

(This is what I did with the VC Bot to avoid having it interact with the frontend beyond posting. I just made the same request the VC plugin does based on the thread ID and latest post number.)

Arete · September 11, 2024, 9:44pm

the link for each individual thread on the ~forum will always be the same, and the link for the stats page for a given forum thread will also always be the same

(I think this answers your question but I’m not positive)

I don’t think I have access to its API, I’m just like a normal user. I know the people who run it and could theoretically ask them for access but I’d feel weird about doing that just to get, like, minorly interesting information

notblackorwhite · September 11, 2024, 9:49pm

so yeah the simplest plan of attack would be to write a scraper that grabs just the necessary thread/board hierarchy IDs and page bounds, and just iterate over the set using the stats page request.

lilith · September 11, 2024, 9:49pm

okay this is just what i said but coherent and well phrased what the hell gray

notblackorwhite · September 11, 2024, 9:50pm

get absolutely destroyed lil sib

Arete · September 11, 2024, 9:51pm

okay

…how do I figure out how to do that

lilith · September 11, 2024, 9:52pm

I will surpass you.

Litten · September 11, 2024, 9:53pm

To write a scraper for a Discourse forum that grabs the necessary thread/board hierarchy IDs and page bounds, you would follow these general steps:

Steps to Scrape a Discourse Forum:

Understand the Structure:
Discourse forums are often structured in a RESTful API format with data served in JSON format. You can make requests to specific endpoints to retrieve data.
Identify Relevant Endpoints:
You can start by inspecting the network activity (using browser developer tools) when navigating a Discourse forum. The primary endpoints to look for include:
- Category API Endpoint: /categories.json to get all categories and their respective IDs.
- Topics in a Category: /c/{category_id}/l/latest.json to get threads/topics under a specific category.
- Posts in a Topic: /t/{topic_id}.json to get the posts within a specific topic.
Thread Hierarchy:
You can gather the category IDs from the /categories.json endpoint, then use the category IDs to get topic IDs within each category, and finally gather posts from each topic.
Pagination:
Pagination is often included in the JSON response, usually in the form of a page parameter or a more_topics_url. You can iterate through pages by appending ?page={number} to your requests.

Example Approach:

Fetch Categories:
```
https://forum.example.com/categories.json
```
This will return JSON data with all categories, including their IDs and titles.
Fetch Topics in a Category:
Once you have a category ID, fetch the list of topics:
```
https://forum.example.com/c/{category_id}/l/latest.json?page=1
```
Here, {category_id} is replaced by the actual category ID, and you can iterate over the pages.
Fetch Posts in a Topic:
For a specific topic (thread), grab the posts:
```
https://forum.example.com/t/{topic_id}.json
```
This will give you the hierarchy and content of posts in the thread.
Iterate Over Pages:
To handle large numbers of topics or posts, use the page query parameter (or next_page if it exists).

Basic Python Example Using `requests`:

import requests

# Get the categories
categories_url = "https://forum.example.com/categories.json"
categories_response = requests.get(categories_url)
categories = categories_response.json()['category_list']['categories']

for category in categories:
    category_id = category['id']
    print(f"Category: {category['name']} (ID: {category_id})")
    
    # Get the topics in this category
    topics_url = f"https://forum.example.com/c/{category_id}/l/latest.json"
    topics_response = requests.get(topics_url)
    topics = topics_response.json()['topic_list']['topics']
    
    for topic in topics:
        topic_id = topic['id']
        print(f"  Topic: {topic['title']} (ID: {topic_id})")
        
        # Get the posts in this topic
        posts_url = f"https://forum.example.com/t/{topic_id}.json"
        posts_response = requests.get(posts_url)
        posts = posts_response.json()['post_stream']['posts']
        
        for post in posts:
            print(f"    Post ID: {post['id']}, Content: {post['cooked']}")

Notes:

Respect Rate Limits: Discourse has rate limits in place. Ensure your scraper obeys these limits, and implement sleep or backoff strategies to avoid getting banned.
Handle JSON Structure: Discourse’s API is highly structured, so parse the JSON response carefully to extract the desired information.
Authentication: Some forums may require authentication (bearer tokens or cookies), so ensure you handle authentication if necessary.

Would this approach work for your use case, or do you need a more specific part of the Discourse forum targeted?

Litten · September 11, 2024, 9:53pm

I have no idea if this works or not

lilith · September 11, 2024, 9:54pm

thank god i dont have college tomorrow. Arete if youre still having trouble I’ll like do a tutorial or something for you then

lilith · September 11, 2024, 9:54pm

what the fuck

lilith · September 11, 2024, 9:54pm

Steelkitten???

Cookie Thread Act 6: Cookie & Thread

Steps to Scrape a Discourse Forum:

Example Approach:

Basic Python Example Using requests:

Notes:

Basic Python Example Using `requests`: