Weekly Monday Discussion #6 - Computers

seon · March 16, 2024, 10:04pm

Multiple Mafiascum members have performed
systemic analyses of cues to deception. In a private forum, Mafiascum user Psyche posted that
logistic regression, multinomial naive Bayes classification, binomial naive Bayes, and an ensemble performed no better than random on a recursive feature selection of an unspecified feature set.
Equal numbers of town and scum were taken from
each game.

This is so sad

notblackorwhite · March 16, 2024, 10:08pm

The hardest part is always night action resolution. You either have to do weird priority rules and hard-code scenarios/limit potential roles in setups, or generate a directed acyclic graph (DAG) out of the night actions. The latter approximates “golden rule”/“natural action resolution” semantics, but in order to ensure the graph is acyclic, you have to write cycle detection which is both computationally expensive and complicated to figure out (and you have to decide how you’re going to handle cycles). I have given the DAG version quite a bit of thought lol.

Edit: Modbot on MU does the first version, and things like redirectors and bus drivers are not available as automated roles (at least last I checked) because they introduce potential cycles without a clear way of resolving it.

Edit 2: My brain is now stuck on this and I must ramble further. It’s not so much cycle detection (which has known solutions), it’s more constructing the graph in such a way where cycles properly indicate an issue. Blocks and redirects are huge pains (proactive and reactive versions of both) because you can think of them as changing the graph after they resolve, and if you aren’t detecting say 2 role blockers cross-targeting, your final game state now depends on the order of those actions which makes resolution essentially non-deterministic. I think my solution to this was breaking actions into two categories (state modifying and non-state modifying), constructing sub-graphs of state modifying actions with the edges being targets (it’s slightly more broad than that, but I don’t need to get into it), and performing multiple stages of cycle detection and simulating resolution on the sub-graphs. My theory was that handling the sub-graphs would be sufficient, and then all of the subgraphs and remaining actions could be joined arbitrarily in the final DAG.

TodaysStory · March 16, 2024, 10:10pm

my custom application that displays posts cool and lets me autoformat notes and autoformats my posts to look nice :(
with this set back im due to have it done by 2028 instead of 2027

ElizaThePsycho · March 16, 2024, 10:11pm

me when i resolve night actions

seon · March 16, 2024, 10:13pm

im literally doing ML research why did i gloss over AUROC which is what i should’ve been looking at it

seems like it does slightly better than chance (50% is average on a auroc curve- auroc already accounts for data being skewed (i.e not the same amount of wolves vs town)

68% means its doing slightly better than average (I forgot how to convert it and dont wanna check but its not an increase from 25% (random) to 43; its less because of how its mapped i forgor )

I think it might be the 39% number but i forgot how this works fr lmafo
edit: probably isnt that still is too high

Lumi · March 16, 2024, 11:41pm

Amazing find, very very cool read.

seon · March 16, 2024, 11:41pm

Can you help parse it

seon · March 16, 2024, 11:45pm

seems like 39% is the precision of all the classes eg: if it puts a player as wolf- it is right 39% of the time (assuming they are using wolf since random chance is 26% chance which makes sense for wolf but not for town. Town is 74% since theres more town in a game than mafia )

so im guessing this means its doing 13% better than random chance at finding wolves?

I.e: ~ 8 in every 20 wolf players it picks are actually wolf vs a random chance of 5 in every 20 players for random chance
(or 2 in 5 and 1 in 4)

seon · March 17, 2024, 12:56am

something important is we dont know how much wolves it actually picks though

Lumi · March 17, 2024, 5:10am

Once I have more time and if someone doesn’t beat me to it.

There are some neat takeaways and is just about what I expected.

tutuu · March 17, 2024, 6:37am

good that u accurately assess urself silly fluffy airhead

Virizion · March 18, 2024, 4:31pm

I’m not very experienced with machine learning (and I’m late to the party), but I can supply my own two cents. I read the paper although I skimmed over sections that seemed to be explaining forum mafia. Apologies for this being a pretty hefty wall of text.

How they got to their benchmark (unimportant read)

Under section 3 (Machine Learning Benchmark), the paper says:

We propose the following benchmark task: Take a twenty-fold stratified shuffle split of all documents with a word count of 50 or higher. For every fold, fit a fresh instance of the pipeline on the training set. Use that instance to generate predictions for the test set. These predictions should at least be scored using the area under the precision-recall curve. The baseline score set by this paper is 0.286.

In plain English, this is saying that they:

Took every “document” (their stand-in for a collection of posts deemed fit for the algorithm made by a single player in a single game) with 50+ words, then
split the documents up into twenty different fragments (where “stratified shuffle split” is how exactly the data was split up; only important to signify they kept the ratio of village:mafia intact with each split) for training data and another unknown portion for test data, then
weighted the data (because, according to the document, there’s a few factors to more consider; such as a 10000-word document being less valuable than 100 independent 100-word documents)
trained 20 logit (logistic regression) models each on all 20 splits (with the difference being what part of the training data was used to evaluate the logit model during training), and tested all of them on the test data set aside at the beginning.
Combined, they got an AUCPR (area under curve precision-recall) score of 0.286. I don’t think you can convert this to a percentage for how often it correctly guesses a wolf, but it gives a general guideline of how well it balances precision and recall (which, in this case, seems to be fairly poorly).

However, some of the above was only provided later in the paper.

TL;DR: They trained twenty different logistic regression models on the data and got an AUCPR score of 0.286. That doesn’t directly translate to a percentage or accuracy rating I don’t think, but it shows that the benchmark is a decently low bar.

Later, the authors claim in Table 2 they got an average precision in the range of 0.27-0.28 (where chance is 0.23). So, the model did better than chance, but it’s still a pretty poor tool for reading players. It’s unclear if this is a refined version of the benchmark model using the same logit method or simply the benchmark model itself, however.

Although, the model scored better when only analyzing >5000-word “documents” (an aggregate of everything a player has said in a game). Makes me think theoretically, as the total number of words a player posts increases, the machine approaches 100% accuracy on said player.

Some potential mafia-related takeaways

According to section 5.1 (Statistical Analysis), deception was positively correlated to post length, usage of the word “but,” and a few more; meanwhile it was negatively correlated with sentence length and post frequency. Word length and type-token ratio (total number of unique words / total number of words) didn’t correlate at all.

You can see it in Table 1 (AUROC <0.5 means more prevalent in mafia-aligned players):

Table 1 (flashbang warning)

I agree with the paper where it says

[…] the shorter posts from townsfolk may be the result of relatively unfiltered expression. Townsfolk may believe that they can broadcast any idea they come up with the moment they come up with it, since they know their thoughts are genuine. Mafia may feel the need to add more detail to their personal narrative.

However, I would take these with a grain of salt. The statistics are an aggregate of a group of presumably diverse players, and it may simply be an individual player’s mannerism to have longer sentences or to post more frequently, regardless of their alignment–however it is fairly interesting information.

As far as interpreting precision score, I believe you’re correct that it means when it guesses a player is a wolf, it’s correct 39% of the time.

Chloe · March 19, 2024, 1:40pm

Whitelist will be up in the next couple of days
Delay is entirely on me - sorry for the wait

orangeandblack5 · March 19, 2024, 5:34pm

Me irl

Chloe · March 26, 2024, 1:45am

Sometime this week…
I have not had time for anything lol

Chloe · March 31, 2024, 6:40pm

its no longer my fault that this hasn’t been posted
because the MODS wont give me FEEDBACK
this is public peer pressure to get them to help

Chloe · March 31, 2024, 6:41pm

we are a very professional site!

ElizaThePsycho · March 31, 2024, 6:42pm

true

orangeandblack5 · March 31, 2024, 7:44pm

I’ll give u feedback Chloe

katze · March 31, 2024, 7:45pm

its good post it