Open 2 excel files and create a third file with values that are averaged in python

python

#1

File 1:

ID TIME A B C
1 23:00:00 4 5 10
2 2:00:00 1 4 6
3 3:00:00 3 2 7

File 2:

ID TIME A B C
1 23:00:00 6 7 10
2 2:00:00 1 6 6
3 3:00:00 3 4 7

Expected output file:
ID TIME A B C
1 23:00:00 5 6 10
2 2:00:00 1 5 6
3 3:00:00 3 3 7

here, values of column A,B and C are averaged line by line
Also i need to use groupby (ID, TIME ) to not average on these columns


#2

Hi,

First of all, you have to merge file1 and file2. Use the following code to merge both the files:
temp = pd.merge(file1, file2, on=[β€˜ID’, β€˜TIME’], how=β€˜left’)

Now, you have to find the average of A,B and C columns, follow this to do that:
temp[β€˜A’]=(temp[β€˜A_x’] + temp[β€˜A_y’]) / 2
temp[β€˜B’]=(temp[β€˜B_x’] + temp[β€˜B_y’]) / 2
temp[β€˜C’]=(temp[β€˜C_x’] + temp[β€˜C_y’]) / 2

Now drop the extra columns:
temp.drop(temp.columns[2:7], axis=1, inplace=True)
temp.drop(β€˜C_y’, 1)

You will get the required table of averaged values after successfully running this code.


#3

im sorry , this does not work


#4

Hi,

Can you tell me what is the error you are facing?

I am sharing my approach, which will help you to understand the solution better.

First of all i created 2 excel worksheets from the data which you have given and saved them in .csv format.

Screenshot%20from%202018-03-27%2013-11-52

Then i read both these files in python using pandas and got the output using the following code:

Hope this helps.


#5


Attached is the error msg.
SyntaxError: invalid character in identifier

the original data looks like:

         ID                 TIME                   SMS      CALL   INTERNET
0           1  2013-12-02 23:00:00  0.764926  0.106876   8.718231
1           1  2013-12-02 23:10:00  0.328401  0.109826   8.948879
2           1  2013-12-02 23:20:00  0.437688  0.027925   8.509328
3           1  2013-12-02 23:30:00  0.403777  0.029088   8.748348
4           1  2013-12-02 23:40:00  0.108663  0.055225   8.290081
5           1  2013-12-02 23:50:00  0.354992  0.041263   7.901937
6           1  2013-12-03 00:00:00  0.360892  0.027300   8.030421
7           1  2013-12-03 00:10:00  0.246953  0.013650   8.287841
8           1  2013-12-03 00:20:00  0.134262  0.006825   7.598228
9           1  2013-12-03 00:30:00  0.216163  0.003413   8.115358

10 1 2013-12-03 00:40:00 0.133637 0.001706 9.318157

Values in ID and Time are same but values in SMS CALL and INTERNET change in file2


#6

Looks like you copy pasted the code from this forum so the character you are seeing in the python nodebook is not single quote try deleting quote character and add it manually


#7

Hi,

Try this: pd.concat([data1,data2]).groupby([β€˜ID’,β€˜TIME’])[β€˜A’,β€˜B’,β€˜C’].mean()


#8

Thank you for all the comments

this works perfect:

dflist = [file1,file2]

## empty data frame to use for appending
df=pd.DataFrame()

#looping through each item in list and appending to empty data frame
for i in dflist:
    df = df.append(i)

# group by and calculating mean on index
data_mean=df.groupby(level=0).mean()

#9

I dont know why your code is still giving me the values in the first file itself. and not the mean.