R code help: Removing duplicates based on one column & condition in other column

r
datamanipulation

#1

Hi,
Help needed in R. As shown in above image, i want to remove duplicates from column based on session_id but if another column “status” shows 3 , i want to keep those all rows.

Remove duplicate based on session_id, but keep all rows with status 3.

Hope i explained what i am looking for.

Cheers!
Parind


#2

It would be good if you had shared a reproducible code.
I think this might work, but I have not tested:

df2[!(duplicated(df2$session_id) & df2$status != 3), ]


#3

Hi @ParindDhillon

just to clarify;

  1. if status == 3 then keep all observations

2 . if status != 3 and not in 1. (line before) then keep unique session_id observation

Problem in 2 for example with second and third row of your table which one do you want? Id does not matter which one or the one the earliest current_time_mysql (for example)

If you can clarify I think many people could help you.

Good luck

Alain


#4

Hi Sonny,
It works…


#5

sample.csv (575.9 KB)
i have tried your solution @sonny – df2[!(duplicated(df2$session_id) & df2$status != 3), ]
It wont been able to remove all duplicates, which i managed to get using dplyr
samp2<-sample %>% group_by (session_id) %>%
arrange(status) %>%
slice(1)
If i do it using dplyr, it gives me 713 less rows which is accurate.
Can you please help me in figuring out the reason why it is not working as i am curious to know the reason, because the approach seems right to me.
I have uploaded sample file for re-production.

Cheers!