Skip to content Skip to sidebar Skip to footer

Strange Behaviour. A Single Negative Value Generated In Pandas Dataframe, When New Column Is Created

Following is the head of the Pandas dataframethat I am working on. test.head() Country_or_Other TotalCases TotalTests 9 USA 2026493 21725064 10 Br

Solution 1:

Pandas here convert column to int32 instead int64, so after multiple by 100 output is wrong:

df['TotalTests'] = df['TotalTests'].fillna(0).astype(int)

Then get int32 and after multiple by 100 get negative values:

test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]

print (test.dtypes)
Country_or_Other    object
TotalCases           int64
TotalTests           int32
dtype: object


test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']
df = test[test['Country_or_Other'] == 'USA']

print (df)
  Country_or_Other  TotalCases  TotalTests  TotalCases_Percent  \
9              USA     2026493    21725064            28.18313   

   TotalTests_Percent  
9         -2122460896  

Solution is convert to np.int64:

df['TotalDeaths'] = df['TotalDeaths'].fillna(0).astype(np.int64)
df['TotalRecovered'] = df['TotalRecovered'].fillna(0).astype(np.int64)
df['TotalTests'] = df['TotalTests'].fillna(0).astype(np.int64)


test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]

print (test.dtypes)
Country_or_Other    object
TotalCases           int64
TotalTests           int64
dtype: object

test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
df = test[test['Country_or_Other'] == 'USA']

print (df)
  Country_or_Other  TotalCases  TotalTests  TotalCases_Percent  \
9              USA     2026493    21725064            28.18313   

   TotalTests_Percent  
9           21.641801  

Solution 2:

Solution 1:

Change TotalTests column dtype to int64.

Use:

df['TotalTests'] = df['TotalTests'].fillna(0).astype('int64')   ## int64

instead of:

df['TotalTests'] = df['TotalTests'].fillna(0).astype(int)       ## int32

Why?

Mathematically, after multiplying the TotalTests value for USA by 100, its value should be 2 172 506 400 which is larger than int32 maximum value 2 147 483 648. So, it behaved weirdly. Changing its type to int64 provides much higher maximum value.

Generally, it is preferred to use int64 in all columns that have large values and about to reach int32 max value.

Solution 2 (naive):

Multiply by 100 after division to avoid exceeding the maximum value at any point:

test['TotalTests_Percent'] = (test['TotalTests']/test['TotalTests'].sum())*100

Post a Comment for "Strange Behaviour. A Single Negative Value Generated In Pandas Dataframe, When New Column Is Created"