Need help in solving Data Analysis problem



I try to solve this using sqldf package from R but after 1st question i couldn’t figure it out how to solve next questions
I dont how to extract and separate data from columns ( events and properties, timestamp) …All suggestions are welcome


  • eventdump - contains event triggered within the app for customers

  • Info:

  • Customer information contains install-uninstall-reinstall cycles in vertical format
    [so for a customer 1 lifetime cycle is a single install-uninstall event set or a reinstall-uninstall event set]
    [so if a customer install-uninstalls-reinstalls-uninstalls then he has 2 life cycles within the app which need to be treated separately]

  • Event logs contain the events triggered within the app by these customers

  • Generate the following from the data provided [Analysis questions below are along the lines of CLTV(Customer Lifetime Value) analysis]

1- Customer retention trends from their lifetime cycles [frequency chart or histogram plot]
[retention is defined as the duration of one install-uninstall cycle, so multiple re-installs have to be treated separately]

2- Find out the time of day when the customers are most active [use your own discretion for time of day bucketing] [activity is defined on the basis of events]

3 - Purchase value buckets [find purchase/checkout events from event logs and parse the ‘properties’ column to get total value associated and generate a simple bucketed frequency chart/histogram plot]

4- Behavior of purchasing and non-purchasing customers [something along the lines of their in-app event frequency in a given install-uninstall cycle]

5- Week over Week revenue trends for purchasing customers

6- How are their purchases distributed post install? [the number and value of purchases after installing the app in one retention cycle]

7- Do they perform purchases in the 2nd,3rd etc weeks post install? [if their retention cycle is greater than 1 week]

8- Is there a steady inflow of revenue for customers with high retention? [growth can decline but is it still a positive gradient?]

9- Any other actionable insights that can be drawn from the given data?

Data Sets :—




@deva123 you can follow the code given below to extract timestamp and email from the ‘properties’ column in the events file. I have used regular expressions and stringr package.

# extract timestamp from the properties column
events$timestamp = str_match(events$properties, "timeStamp\": \"(.*?)\"")[,2] 

# extract email ID's from the properties column
events$email = unlist(regmatches(events$properties, gregexpr("([_a-z0-9-]+(\\.[_a-z0-9-]+)*@[a-z0-9-]+(\\.[a-z0-9-]+)*(\\.[a-z]{2,4}))", events$properties)))

You can further convert the timestamp into date and time.


I’m little bit late to respond …Thank you so much for the reply…
Do you know how to approach such kind of problem…? . I’m New to such problem
could you tell any reference blog or site where I can get such kind of problem for practice…?
any tool and techniques are welcome …


2nd extract email showing error
Error in $<*tmp*, email, value = c("", :
replacement has 127901 rows, data has 995304
My actual events file is 180 MB but posted small portion of that
How to fix this …?


@deva123 that code should work irrespective of the no. of rows in the dataframe. Could you share your code once?


cycles <- read.csv(“uicycles.csv”)
–# my 180mb orignal file size name eventdump
event <-read.csv(“eventdump.csv”)

loading SQL libraries
–# install.packages(‘sqldf’)
–# View(cycles)
–# View(event)
sql <- sqldf(“SELECT uuid,COUNT (uuid) as occurances,event_type
FROM cycles
GROUP BY uuid”)
–# View(sql)

sql1 <- sqldf("SELECT *,
WHEN event_type = ‘install’ THEN ‘Retain’
WHEN event_type = ‘reinstall’ THEN ‘Retain’
WHEN event_type = ‘uninstall’ THEN ‘Not_Retain’
END AS in_unin
FROM sql

–# View(sql1)
customer <- sqldf(“SELECT COUNT(in_unin) as retain ,in_unin
FROM sql1
GROUP BY in_unin”)

barplot(customer$retain,names.arg = customer$in_unin ,xlab = " status ",ylab = “No of customer”,col = “red”,
border = “green”,main = “Retention chart”)

–# left join events and uuid

left <- sqldf(“SELECT *
FROM event e
LEFT JOIN sql b ON e.uuid = b.uuid”)


–# Extract timestamp from properties

event$timestamp = str_match(event$properties, “timeStamp”: “(.*?)”")[,2]

–# extract email ID’s from the propeties column
–# insted of events i use event in my code.
event$email = unlist(regmatches(event$properties,gregexpr("([_a-z0-9-]+(\.[_a-z0-9-]+)@[a-z0-9-]+(\.[a-z0-9-]+)(\.[a-z]{2,4}))", event$properties)))