Skip to content

sho-das-96/PUBG-Player-Ranking-Prediction

Repository files navigation

PUBG Game Prediction

About Dataset:

In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.

You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 players per group.

Data Description:

  • DBNOs - Number of enemy players knocked.
  • assists - Number of enemy players this player damaged that were killed by teammates.
  • boosts - Number of boost items used.
  • damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.
  • headshotKills - Number of enemy players killed with headshots.
  • heals - Number of healing items used.
  • Id - Player’s Id
  • killPlace - Ranking in match of number of enemy players killed.
  • killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
  • killStreaks - Max number of enemy players killed in a short amount of time.
  • kills - Number of enemy players killed.
  • longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
  • matchDuration - Duration of match in seconds.
  • matchId - ID to identify match. There are no matches that are in both the training and testing set.
  • matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
  • rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
  • revives - Number of times this player revived teammates.
  • rideDistance - Total distance traveled in vehicles measured in meters.
  • roadKills - Number of kills while in a vehicle.
  • swimDistance - Total distance traveled by swimming measured in meters.
  • teamKills - Number of times this player killed a teammate.
  • vehicleDestroys - Number of vehicles destroyed.
  • walkDistance - Total distance traveled on foot measured in meters.-
  • weaponsAcquired - Number of weapons picked up.
  • winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
  • groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
  • numGroups - Number of groups we have data for in the match.
  • maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
  • winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

Tool and Libraries Used:

  • Tool:
    • Python 3.11.7
  • Standard Libraries:
    • warnings
    • numpy (imported as np)
    • pandas (imported as pd)
  • Visualization Libraries:
    • matplotlib.pyplot (imported as plt)
    • seaborn (imported as sns)
  • Machine Learning Libraries:
    • sklearn.preprocessing (specifically StandardScaler)
    • sklearn.model_selection (specifically train_test_split)
    • catboost (imported as cb)
    • sklearn.metrics (specifically mean_squared_error and r2_score)

Table of Content

  1. Importing Libraries
  2. Reading the Data
  3. Data Wrangling
  4. Feature Engineering
  5. ML - CatBoost Model

Importing Libraries

## handling warnings

import warnings
warnings.filterwarnings("ignore")

##standard libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

## visualisation

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (11,5)

import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## !pip install catboost (for jupyter/colab)

import catboost as cb

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

Reading the Data

## load the data

df = pd.read_csv("pubg_game_prediction.csv")

## glimpse of the data

df.head(2)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 3
      1 ## load the data
----> 3 df = pd.read_csv("pubg_game_prediction.csv")
      5 ## glimpse of the data
      7 df.head(2)

File ~\anaconda3\Lib\site-packages\pandas\io\parsers\readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend) 935 kwds_defaults = _refine_defaults_read( 936 dialect, 937 delimiter, (...) 944 dtype_backend=dtype_backend, 945 ) 946 kwds.update(kwds_defaults) --> 948 return _read(filepath_or_buffer, kwds)

File ~\anaconda3\Lib\site-packages\pandas\io\parsers\readers.py:611, in _read(filepath_or_buffer, kwds) 608 _validate_names(kwds.get("names", None)) 610 # Create the parser. --> 611 parser = TextFileReader(filepath_or_buffer, **kwds) 613 if chunksize or iterator: 614 return parser

File ~\anaconda3\Lib\site-packages\pandas\io\parsers\readers.py:1448, in TextFileReader.init(self, f, engine, **kwds) 1445 self.options["has_index_names"] = kwds["has_index_names"] 1447 self.handles: IOHandles | None = None -> 1448 self._engine = self._make_engine(f, self.engine)

File ~\anaconda3\Lib\site-packages\pandas\io\parsers\readers.py:1705, in TextFileReader._make_engine(self, f, engine) 1703 if "b" not in mode: 1704 mode += "b" -> 1705 self.handles = get_handle( 1706 f, 1707 mode, 1708 encoding=self.options.get("encoding", None), 1709 compression=self.options.get("compression", None), 1710 memory_map=self.options.get("memory_map", False), 1711 is_text=is_text, 1712 errors=self.options.get("encoding_errors", "strict"), 1713 storage_options=self.options.get("storage_options", None), 1714 ) 1715 assert self.handles is not None 1716 f = self.handles.handle

File ~\anaconda3\Lib\site-packages\pandas\io\common.py:863, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 858 elif isinstance(handle, str): 859 # Check whether the filename is to be opened in binary mode. 860 # Binary mode does not support 'encoding' and 'newline'. 861 if ioargs.encoding and "b" not in ioargs.mode: 862 # Encoding --> 863 handle = open( 864 handle, 865 ioargs.mode, 866 encoding=ioargs.encoding, 867 errors=errors, 868 newline="", 869 ) 870 else: 871 # Binary mode 872 handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'pubg_game_prediction.csv'

## data dimension

df.shape
## data information

df.info()

Data Wrangling

Check for the rows with missing win prediction value

## check row with NULL win prediction value

df[df['winPlacePerc'].isnull()]
## remove the data row - 2744604

df.drop(2744604, inplace = True)

Understanding Players distribution in a game

## prepare new parameter to know how many players are in a game

df['playersJoined'] = df.groupby('matchId')['matchId'].transform('count')
df.head(1)
## visualize matches where players joined >= 75

sns.countplot(data = df[df['playersJoined']>=75],x = 'playersJoined')
plt.show()

Observation:

The data for 75 and + people in a match is huge with maximum matches having 95-98 players

Analysing the data

Kills Without Moving?

It is not possible to kill even 1 player if you do not move by atleast 1 unit. Following are mostly used practices by cheaters (ones who interfere with the game's genuine natural processes):
  • Aimbots
  • Wallhacks
  • Triggerbots
  • ESP (Extra Sensory Perception)
  • Silent Aim
## prepare a data parameter to gather the information of the total distance travelled

df['totalDistance'] = df['rideDistance'] + df['walkDistance'] + df['swimDistance']

## prepare a data parameter to check for anamoly detection that
## the person has not moved but still managed to do the kills

df['killswithoutMoving'] = ((df['kills'] > 0) & (df['totalDistance'] == 0))
## check data for people who have killed without moving

df[df['killswithoutMoving'] == True].head(2)
## check total kills without moving data

df[df['killswithoutMoving'] == True].shape
Observation:

1535 instances have either used hacks or been lucky ! We cannot use such data (which cannot be generalised) for our model. Hence, dropping these instances.

## drop the instances

df.drop(df[df['killswithoutMoving'] == True].index , inplace = True)

Extra-ordinary Road Kills !

## check data for roadkills > 5

df[df['roadKills'] > 5].shape
Observation:

It takes to be expert among the other players in a match to kill by vehicles only. Hence dropping the 46 instances from data frame.

## drop the instance

df.drop(df[df['roadKills'] > 5].index, inplace = True)

So many KILLS - how ???

## visualize data for No. of players | Kills

sns.countplot(data = df, x = df['kills']).set_title("Distribution of KILLS by a player")
plt.ylabel("Count of players")
plt.xlabel("Number of Kills")
plt.show()

Observation:

Maximum people kills upto maximum 12 players.

## visualize data for No. of players | Kills >= 15

sns.countplot(data = df[df['kills']>=15],x='kills').set_title("Distribution of KILLS by a player")
plt.ylabel("Count of players")
plt.xlabel("Number of Kills")
plt.show()

## kills > 20 cannot be generalized

df[df['kills'] > 20].shape
Observation:

Kills beyond 20 are rare and cannot be used a general use case. Hence, dropping the instance.

## drop the instances

df.drop(df[df['kills'] > 20].index, inplace = True)

Head Shot

## calculate headshot rate

df['headshot_rate'] = df['headshotKills']/df['kills']

## fill with 0 if there is not headshot

df['headshot_rate'] = df['headshot_rate'].fillna(0)
## plot the headshot rate distribution

sns.distplot(df['headshot_rate'], bins =10).set_title("Distplot showing the distribution of headshot rate")
plt.ylabel("Count of players")
plt.show()

## find headshot rate == 100% with kills > 5

df[(df['headshot_rate'] == 1) & (df['kills'] > 5)].shape
Observation

Killing more than 5 people as headshots where all the shots in a match are headshots is mostly not a general case. 187 instances have such anomaly and hence, we will drop them.

## droping the instances

df.drop(df[(df['headshot_rate'] == 1) & (df['kills'] > 6)].index, inplace = True)

Longest Shot

The maximum possible distance that is made possible to snipe from in PUBG is 1km or 1000 meters. However, this is not general case and most of the times, hackers use either of the following to take advantage and win a match:
  • Sniper Aimbots
  • Bullet Speed/Trajectory Hacks
  • No Recoil/No Spread
  • Zoom Hacks
## visualize Number of people | Longest Kills

sns.distplot(df['longestKill'], bins = 50).set_title("Histogram showing the Longest Kill Distribution")
plt.ylabel("Count of players")
plt.show()

## calculate instances with longestkill distance > 500 meters

df[df['longestKill']>=500].shape
Observation:

1747 instances have kills > 500. hence, we will drop these.

## dropping the instances

df.drop(df[df['longestKill']>=500].index, inplace = True)

Weapon Change

In general, people change upto 10 guns in match (avg. being 5 to 6). But, cheaters sometimes use either of the following for unlimited recoil/ guns in a single match:
  • Macro Scripts
  • Rapid Fire Hacks
  • Input Spoofing
## visualize number of players | weapon change

sns.distplot(df['weaponsAcquired'], bins=100).set_title("Weapons Distribution")
plt.show()

## calculate instances with weapons acquired > 15

df[df['weaponsAcquired']>=15].shape
Observation:

In 6809 instances, people have changed gun more than 15 times in a match. Such is not a general the use case and hence, we will drop these values.

## drop instance

df.drop(df[df['weaponsAcquired']>=15].index, inplace = True)

Exploratory Data Analysis

## final shape

df.shape
## total number of null values

df.isna().sum()
## correlation of parameter with Win Prediction

plt.figure(figsize=[30,30])
sns.heatmap(df.corr(numeric_only = True), annot = True)
plt.show()

Feature Engineering

## calculate normalization factor
## (100-factor)/100 = 0 for matches including 100 players
## use (100-factor)/100 + 1

normalising_factor = (100 - df['playersJoined']/100)+1
## create new attributes with normalization factor

df['killsNorm'] = df['kills'] * normalising_factor
df['damageDealtNorm'] = df['damageDealt'] * normalising_factor
df['maxPlaceNorm'] = df['maxPlace'] * normalising_factor
df['matchDurationNorm'] = df['matchDuration'] * normalising_factor
df['traveldistance'] = df['walkDistance']+ df['swimDistance'] + df['rideDistance']
df['healsnboosts'] = df['heals'] + df['boosts']
df['assist'] = df['assists'] + df['revives']
## analyze columns

df.columns

Removing unwanted columns

## not tampering the cleaned data, creating important dataset

data = df.drop(columns = ['Id', 'groupId', 'matchId', 'assists', 'boosts', 'walkDistance', 'swimDistance',
                          'rideDistance', 'heals', 'revives', 'kills', 'damageDealt', 'maxPlace', 'matchDuration'])
## check data dataframe

data.head(2)

🔝

ML - Catboost Model

Handling categorical data

x = data.drop(['winPlacePerc'], axis = 1)
y = data['winPlacePerc']

One-hot Encoding

x = pd.get_dummies(x, columns = ['matchType', 'killswithoutMoving'])
x = x.applymap(lambda x: int(x) if isinstance(x, bool) else x)
x.head()
features = x.columns

Scaling the data

## prevent model from giving undue preference
## to instances with higher values

sc = StandardScaler()
sc.fit(x)
x = pd.DataFrame(sc.transform(x))
x.head(2)

Splitting data

## train and test within the single file

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.3, random_state = 0)
print(xtrain.shape, ytrain.shape)
print(xtest.shape, ytest.shape)
Check:

Training Parameters: 3105414
Testing Parameters: 1330892

CatBoost Model

train_dataset = cb.Pool(xtrain, ytrain)
test_dataset = cb.Pool(xtest, ytest)
model = cb.CatBoostRegressor(loss_function='RMSE')
## GRID search
## run model one by one on all combinations
## return the best parameter combination

grid = {'iterations': [100, 150], 
       'learning_rate': [0.03, 0.1], 
       'depth': [2, 4, 6, 8]} ## runs 16 combinations here

model.grid_search(grid, train_dataset)
Observations:

Our model has prepare final data after Kfold cross validation.

Best Parameters:

  • 'depth': 8
  • 'learning_rate': 0.1
  • 'iterations': 150}
  • 'iterations': [0,....149]
feature_importance_df = pd.DataFrame()
feature_importance_df['features'] = features
feature_importance_df['importance'] = model.feature_importances_

feature_importance_df = feature_importance_df.sort_values(by = ['importance'], ascending=False)
feature_importance_df
plt.figure(figsize=(10, 6))  # Adjust the figure size if needed

# Set the background color of the graph
plt.gca().set_facecolor('green')

# Plot the bar chart with specified colors
bars = plt.bar(feature_importance_df.features, feature_importance_df.importance, color='yellow', edgecolor='white')

# Set the labels and their colors
plt.ylabel("CatBoost Feature Importance", color='black')
plt.xticks(rotation=90, color='black')
plt.yticks(color='black')

# Display the plot
plt.show()
Observation:

The model can be trained dropping the following parameters:

  • matchType_normal-squad
  • vehicleDestroys
  • headshot_rate
  • matchType_normal-solo
  • matchType_normal-solo-fpp
  • matchType_crashtpp
  • matchType_normal-duo-fpp
  • matchType_normal-duo
  • matchType_flarefpp
  • headshotKills
  • killswithoutMoving_False

Prediction

pred = model.predict(xtest)
## evaluate model

rmse = np.sqrt(mean_squared_error(ytest, pred)) ## percentage of error
r2 = r2_score(ytest, pred) ## needs to be high closer to 1 (ranging from 0 to 1)

print("Testing performance")

print("RMSE: {:.2f}".format(rmse))
print("R2: {:.2f}".format(r2))
Observation:

An 8% error with r2 Value closer to 1, which means the model accuracy is high without being overfitting.

Hence,

I hope you found this analysis of PUBG game ranking prediction using the CatBoost model both comprehensive and insightful! With an RSME of 0.08 and an R² score close to 1, the model demonstrates high accuracy in predicting player rankings.

Your feedback is invaluable, please share your thoughts if you enjoyed it.

Check out more such projects here! 😄😅

🔝

About

PUBG player data (4.5million+) processed using Pandas, NumPy in Python for preprocessing, and CatBoost for match predictions. Achieved RMSE 0.08, R² close to 1, optimizing gameplay metrics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors