Test: Some thoughts about Strength Tests - Schachcomputer.info Community

		Schachcomputer.info Community > Schachcomputer / Chess Computer: > Teststellungen und Elo Listen / Test positions and Elo lists
Test: Some thoughts about Strength Tests

Themen-Optionen

Ansicht

01.06.2024, 14:31

spacious_mind spacious_mind ist offline

Lebende Foren Legende

Dabei seit 17 Jahren, 11 Monaten und 12 Tagen.

Registriert seit: 29.06.2006

Ort: Alabama, USA

Land:

Beiträge: 2.030

Abgegebene Danke: 167

Erhielt 544 Danke für 287 Beiträge

Some thoughts about Strength Tests

Hi Everyone,

Since I have been creating strength tests for at least 10+ years now, it has also been 10+ years of personal uncertainty knowing that when referencing strength results to ELO I am continuously providing technically incorrect references. As the ELO system was created to rate game performance through a method of storing wins, draws and losses within a pool of stored results and using a predictor from those stored results to estimate the likelyhood of which opponent wins or loses should they meet in a match. These new match games are then added into this data pool in order to grow and improve the prediction formula.

In the below Wikipedia link, this is specifically explained.

https://en.wikipedia.org/wiki/Elo_rating_system

Here is a summary of some key Wikipedia points:

"Elo ratings are comparative only and are valid only within the rating pool in which they were calculated, rather than being an absolute measure of a player's strength.

Elo's central assumption was that the chess performance of each player in each game is a normally distributed random variable. Although a player might perform significantly better or worse from one game to the next, Elo assumed that the mean value of the performances of any given player changes only slowly over time. Elo thought of a player's true skill as the mean of that player's performance random variable.

A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player's skill. Performance can only be inferred from wins, draws and losses. Therefore, if a player wins a game, they are assumed to have performed at a higher level than their opponent for that game. Conversely, if the player loses, they are assumed to have performed at a lower level. If the game is a draw, the two players are assumed to have performed at nearly the same level.

To simplify computation even further, Elo proposed a straightforward method of estimating the variables in his model (i.e., the true skill of each player). One could calculate relatively easily from tables how many games players would be expected to win based on comparisons of their ratings to those of their opponents. The ratings of a player who won more games than expected would be adjusted upward, while those of a player who won fewer than expected would be adjusted downward. Moreover, that adjustment was to be in linear proportion to the number of wins by which the player had exceeded or fallen short of their expected number.

From a modern perspective, Elo's simplifying assumptions are not necessary because computing power is inexpensive and widely available. Several people, most notably Mark Glickman, have proposed using more sophisticated statistical machinery to estimate the same variables. On the other hand, the computational simplicity of the Elo system has proven to be one of its greatest assets. With the aid of a pocket calculator, an informed chess competitor can calculate to within one point what their next officially published rating will be, which helps promote a perception that the ratings are fair.

The phrase "Elo rating" is often used to mean a player's chess rating as calculated by FIDE. However, this usage may be confusing or misleading because Elo's general ideas have been adopted by many organizations, including the USCF (before FIDE), many other national chess federations, the short-lived Professional Chess Association (PCA), and online chess servers including the Internet Chess Club (ICC), Free Internet Chess Server (FICS), Lichess, Chess.com, and Yahoo! Games. Each organization has a unique implementation, and none of them follows Elo's original suggestions precisely.

Instead one may refer to the organization granting the rating. For example: "As of August 2002, Gregory Kaidanov had a FIDE rating of 2638 and a USCF rating of 2742." The Elo ratings of these various organizations are not always directly comparable, since Elo ratings measure the results within a closed pool of players rather than absolute skill."

Everything written in the above summary is in my opinion correct, except for one key assumption which is nowadays completely outdated:

"A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player's skill."

This may have been true 50-70 years ago since computer power never existed, but nowadays this statement is incorrect. Several chess websites provide strength analysis and ever since the very beginnings of chess programs, move strength has been calculated and accepted as a method to analyze the strength of moves played in a game.

Some day in the future (maybe not in our lifetime), I fully expect that you will be able to take the complete game database of for example Wilhelm Steinitz or Garry Kasparov, plug it into the master computer program and get the strength results, overall and year by year for each and every player, and accurately compare the evolving improvements of chess knowledge throughout history for players and computer programs. You will be someday able to track it similar to what Eric (Tibono) recently did with his King Performance testing of the Easy and Fun levels. You just replace the graph chart with the year and you could replace Fun with Human and Easy with Computer for example. You will probably also be able to distinguish between Learned Theory (Brain Memory) and calculations (Brain Calculations) and rate both.

The EDOChess Website: http://www.edochess.ca/index.html similarly acknowledges that using the name ELO is potentially conflicting since he uses EDO and ELO did not exist in the years preceding 1970. He also does a masterful job of developing his EDO ratings year by year for historical players. I think he uses in part the Bayesian ELO system to get his ratings.

He also acknowledges an obvious problem with most of today's ELO rating systems which are:

"Arpad Elo put ratings on the map when he introduced his rating system first in the United States in 1960 and then internationally in 1970. There were, of course, earlier rating systems but Elo was the first to attempt to put them on a sound statistical footing. Richard Eales says in Chess: The History of a Game (1985) regarding Murray's definitive 1913 volume, A History of Chess that "The very excellence of his work has had a dampening effect on the subject," since historians felt that Murray had had the last word. The same could be said of Elo and his contribution to rating theory and practice. However, Elo, like Murray, is not perfect and there are many reasons for exploring improvements. The most obvious at present is the steady inflation in international Elo ratings [though this should probably not be blamed on Elo, as the inflation started after Elo stopped being in charge of F.I.D.E.'s ratings (added Jan. 2010)]. Another is that the requirements of a rating system for updating current players' ratings on a day-to-day basis are different from those of a rating system for players in some historical epoch. Retroactive rating is a different enterprise than the updating of current ratings.

In fact, when Elo attempted to calculate ratings of players in history, he did not use the Elo rating system at all! Instead, he applied an iterative method to tournament and match results over five-year periods to get what are essentially performance ratings for each period and then smoothed the resulting ratings over time. This procedure and its results are summarized in his book, The Rating of Chessplayers Past and Present (1978), though neither the actual method of calculation nor full results are laid out in detail. We get only a list of peak ratings of 475 players and a series of graphs indicating the ratings over time of a few of the strongest players, done by fitting a smooth curve to the annual ratings of players with results over a long enough period. When it came to initializing the rating of modern players, Elo collected results of international events over the period 1966-1969 and applied a similar iterative method. Only then could the updating system everyone knows come into effect - there had to be a set of ratings to start from.

Iterative methods rate a pool of players simultaneously, rather than adjusting each individual player's rating sequentially, after each event or rating period. The idea is to find the set of ratings for which the observed results are collectively most likely. But they are basically applicable only to static comparisons, giving the most accurate assessment possible of relative strengths at a given time. Elo's idea in his historical rating attempt was to smooth these static annual ratings over time.

While we can safely bet that Elo did a careful job of rating historical players, inevitably many choices have to be made in such an attempt, and other approaches could be taken. Indeed, Clarke made an attempt at rating players in history before Elo. Several other approaches have recently appeared, including the Chessmetrics system by Jeff Sonas, the Glicko system(s) of Mark Glickman, a version of which has been adopted by the US Chess Federation, and an unnamed rating method applied to results from 1836-1863 and published online on the Avler Chess Forum by Jeremy Spinrad. By and large these others have applied sequential updating methods, though the new (2005) incarnation of the Chessmetrics system is an interesting exception (see below) and Spinrad achieved a kind of simultaneous rating (at least more symmetric in time) by running an update algorithm alternately forwards and backwards in time.

There are pros and cons to all of these."

Therefore, in summary I am more convinced that the future, can only be along the lines of calculating strength within a pool of games. As this would be the only way to accurately calculate and chart the strength of chess throughout history and its progression. Therefore I believe my tests are a small step into this future.

Performance results will also always be relevant such as ELO or EDO as there is a difference between performance and strength. For example, De Labourdannais maybe have performed at a level of 2600 against his opponents. But the strength of play in the early 1800s may also certainly been lower than what it is today. This is the difference between strength and performance.

In Summary similar to EDOChess who rates his results with EDO, as my strength tests are not performance based, I have changed all references to ELO to STR which is short for STRENGTH or SPACIOUSMIND TESTS RATING. Whichever you prefer.

My STR ratings formula under Renaissance was of course created to approximate ELO, in order to have a fun comparison. The exact same calculations will be used for all future tests and the test results may likely differ through the history of the games; however, the constant will always remain the same, which is the distance between the Master and its subjects, game by game and average.

In my Tests I have changed ELO Sneak to STR Sneak, which you will see on future updates that I will provide for download.

I know this is a lot of reading, but I would love to hear from you to see if my assumptions of the future are sound or if you disagree or see a different future.

Best regards,
Nick

ps... I typed this in English as subject would be too hard for me to do in German

Geändert von spacious_mind (01.06.2024 um 20:28 Uhr)

Folgende 3 Benutzer sagen Danke zu spacious_mind für den nützlichen Beitrag:
applechess (01.06.2024), Tibono (01.06.2024), Wandersleben (01.06.2024)

03.06.2024, 10:12

CC 7

Mephisto London 68030

Dabei seit 19 Jahren, 6 Monaten und 1 Tag.

Registriert seit: 10.12.2004

Land:

Beiträge: 370

Abgegebene Danke: 0

Erhielt 368 Danke für 165 Beiträge

AW: Some thoughts about Strength Tests

Zitat von spacious_mind

Although a player might perform significantly better or worse from one game to the next, Elo assumed that the mean value of the performances of any given player changes only slowly over time.

The sitution for chess computers is even better: they never change their strength and performance on the long run, opimal for Elo calculations.
I like the Elo system a lot - as far as I can judge the elo ratings and rankings produced by Elo are pretty fair and very useful for making predictions.

Zitat von spacious_mind

Everything written in the above summary is in my opinion correct, except for one key assumption which is nowadays completely outdated:

"A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player's skill."

To move or not to move - that's the question !

Why not - let's have a try !
Thanks to your work (and Eric's) it is possible to see the results, do the comparison getting a final judgement.

I do believe that it is also possible to use a large position test (with at least 200 positions and lots of chess computers) in order to get quite reliable elo ratings -
with Elo-Stat by Frank Schubert.
But that's another story.

Zitat von spacious_mind

Some day in the future (maybe not in our lifetime), I fully expect that you will be able to take the complete game database of for example Wilhelm Steinitz or Garry Kasparov, plug it into the
master computer program and get the strength results, overall and year by year for each and every player, and accurately compare the evolving improvements of chess knowledge throughout history for players and computer programs.

Elo and imagination - I do like your fantasy..."Nick Verne"

Once in the future would it be possible to get through our Schachcomputer.info Tournament list...?

Zitat von spacious_mind

Another is that the requirements of a rating system for updating current players' ratings on a day-to-day basis are different from those of a rating system for players in some historical epoch.

That's not clear to me.
You need a starting point in "history epoch elo" for initializing -
you needed a starting point as well in modern elo calculations "initializing 1966-69".
Where is the difference at the start - the lower data base - what else ?

Zitat von spacious_mind

While we can safely bet that Elo did a careful job of rating historical players, inevitably many choices have to be made in such an attempt, and other approaches could be taken. Indeed, Clarke
made an attempt at rating players in history before Elo. Several other approaches have recently appeared, including the Chessmetrics system by Jeff Sonas, the Glicko system(s) of Mark Glickman,
a version of which has been adopted by the US Chess Federation, and an unnamed rating method applied to results from 1836-1863 and published online on the Avler Chess Forum by Jeremy Spinrad.
By and large these others have applied sequential updating methods, though the new (2005) incarnation of the Chessmetrics system is an interesting exception (see below) and Spinrad achieved
a kind of simultaneous rating (at least more symmetric in time) by running an update algorithm alternately forwards and backwards in time.

There are pros and cons to all of these."[/I]

I've to confess that most of these systems I don't know at all or only very rudimentary.
Many years ago there was an article in CSS about Jeff Sonas - his contribution was not convincing to me.
Sonas Rating Formula - better than Elo ? No - IMHO.
AFAIK John Nunn critized a change concerning predicting game results made by Jeff Sonas (Nunn on the K-factor: show me the proof, 2009).

Mark Glickman - AFAIK his Glicko-systems (contrary to the Elo-system) aren't totally stable - they could be flawed when intentionally manipulating data - of course that's not the case here.
https://www.reddit.com/r/TheSilphRoa...jor_flaw_in_a/

Is there a Glicko-program that can use pgn-data for calculating elo - or only Excel-sheets provided ?

Zitat von spacious_mind

... as there is a difference between performance and strength. For example, De Labourdannais maybe have performed at a level of 2600 against his opponents.
But the strength of play in the early 1800s may also certainly been lower than what it is today. This is the difference between strength and performance.

Absolutely there is a difference between performance and strength !
The elo system only measures performance - nothing else.

So IMHO that's the drawback in your system using the moves of a game - you are regarding not the performance but only the strength of a person,
which is not the aim of the Elo-sytem.

To say it more drastically: the strength of a person or computer is totally irrelevant for calculating elo - only performance matters !
Nothing else than 1-0, 1/2:1/2, 0:1, simple as that !

Zitat von spacious_mind

In Summary similar to EDOChess who rates his results with EDO, as my strength tests are not performance based, I have changed all references to ELO to STR which is short for STRENGTH or
SPACIOUSMIND TESTS RATING. Whichever you prefer.
...In my Tests I have changed ELO Sneak to STR Sneak, which you will see on future updates that I will provide for download.

That's the consequence !
You shouldn't call it Elo...STR...or maybe StrElo.

Zitat von spacious_mind

My STR ratings formula under Renaissance was of course created to approximate ELO, in order to have a fun comparison. The exact same calculations will be used for all future tests and the test
results may likely differ through the history of the games; however, the constant will always remain the same, which is the distance between the Master and its subjects, game by game and average.

Keep on going with your epoch tests !

The main thing after all - it's fun !

Regards
Hans-Jürgen

Folgende 4 Benutzer sagen Danke zu CC 7 für den nützlichen Beitrag:
kamoj (03.06.2024), spacious_mind (03.06.2024), Tibono (03.06.2024), Wandersleben (03.06.2024)

03.06.2024, 16:35

spacious_mind spacious_mind ist offline

Lebende Foren Legende

Registriert seit: 29.06.2006

Ort: Alabama, USA

Land:

Beiträge: 2.030

Abgegebene Danke: 167

Erhielt 544 Danke für 287 Beiträge

Re: Some thoughts about Strength Tests

Hi Hans-Jürgen

Thank you for your lengthy response. An observation though, I wish there was a way to separate " Originally Posted by spacious_mind" to more accurately reference the Origin. Most of what you highlighted are not my words but "QUOTED" documentation from other websites in this case Wikipedia and EDOChess. I did this to highlight different schools of thought, and get reactions to it

I am not smart enough; to be able to think and write all this myself and I would hate for some future reader to think otherwise and think of me more than what I am which is a lowly peasant!

Zitat von CC 7

Elo and imagination - I do like your fantasy..."Nick Verne"

Once in the future would it be possible to get through our Schachcomputer.info Tournament list...?

There is a Science Fiction writer called Vernor Vinge. I wonder if that is his real name or a play with words (Vernor, Verne). You know what they say: "What is Science Fiction today becomes reality tomorrow".

So why not, why would you not be able to take the 50,000+ games Schachcomputer.Info (300 programs) (180S and 30S) and run them through a future master computer? What would be more accurate in the future? Performance or Strength? Or are both needed? What if the future shows stats needed corrections? It has happened before.

An example... Someone plays 100 games with two programs. And a year later plays another 100 games with the same two programs. Will the results be the same? Add another computer what happens then? What if you add 50,000? The argument will always be "Ah yes, but it falls within the tolerance" Is that argument good enough? Maybe yes, maybe no.

How do you compare Hardware A to Hardware B? Human to computer? How do you compare history with today. Actual history regarding actual knowledge and actual strength progression throughout history. It is very simple you can't, therefore you will look for different approaches if that is something that interests a person. If it does not interest someone, then reading this Post, will also be of little interest, I am sure.

I cannot answer some of your questions, as I stated I only posted the "Quotes" to show that there is a lot of interest for alternative ideas and methods, otherwise why create websites and methods. And maybe someone can prove or disprove the comments. I can't

It goes back to the old question... is the World flat or is it round? What happened to the old "round" believers? Burned at the stake or
hung as heretics, until finally reality sets in.

A Master program would not care who or what it is or who you are. If you have a bad day or good day, if you intimidate people or are shy. It just evaluates and sends out the results of the games played and gives a number.

Zitat von CC 7

Correct! It is why I decided to change it to "STR" there is no future mention to ELO as these are not "Performance" calculations. Only "STR", strength!

BTW. thanks again for your great response and deep thoughts!

Best regards
Nick

Geändert von spacious_mind (03.06.2024 um 17:11 Uhr)

Folgende 3 Benutzer sagen Danke zu spacious_mind für den nützlichen Beitrag:
applechess (03.06.2024), kamoj (03.06.2024), Wandersleben (03.06.2024)

« Vorheriges Thema | Nächstes Thema »

Forumregeln
Du bist nicht berechtigt, neue Themen zu erstellen. Du bist nicht berechtigt, auf Beiträge zu antworten. Du bist nicht berechtigt, Anhänge hochzuladen. Du bist nicht berechtigt, deine Beiträge zu bearbeiten. BB code ist An Smileys sind An. [IMG] Code ist An. HTML-Code ist An. Forum Regeln

Gehe zu

Ähnliche Themen
Thema	Erstellt von	Forum	Antworten	Letzter Beitrag
Review: Chessnut Evo - My Thoughts	Ray	Die ganze Welt der Schachcomputer / World of chess computers	2	09.12.2023 12:27
Tipp: Android- und iOS-Stoppuhren für BT-Tests	Robert	Teststellungen und Elo Listen / Test positions and Elo lists	0	18.11.2013 14:24
Frage: real strength of Novag Super Nova	IvenGO	Die ganze Welt der Schachcomputer / World of chess computers	9	22.07.2013 10:53

Alle Zeitangaben in WEZ +2. Es ist jetzt 05:06 Uhr.