Need help in solving Data Analysis problem

r
sql

#1

I try to solve this using sqldf package from R but after 1st question i couldn’t figure it out how to solve next questions
I dont how to extract and separate data from columns ( events and properties, timestamp) …All suggestions are welcome

Problem

  • eventdump - contains event triggered within the app for customers

  • Info:

  • Customer information contains install-uninstall-reinstall cycles in vertical format
    [so for a customer 1 lifetime cycle is a single install-uninstall event set or a reinstall-uninstall event set]
    [so if a customer install-uninstalls-reinstalls-uninstalls then he has 2 life cycles within the app which need to be treated separately]

  • Event logs contain the events triggered within the app by these customers

  • Generate the following from the data provided [Analysis questions below are along the lines of CLTV(Customer Lifetime Value) analysis]

1- Customer retention trends from their lifetime cycles [frequency chart or histogram plot]
[retention is defined as the duration of one install-uninstall cycle, so multiple re-installs have to be treated separately]

2- Find out the time of day when the customers are most active [use your own discretion for time of day bucketing] [activity is defined on the basis of events]

3 - Purchase value buckets [find purchase/checkout events from event logs and parse the ‘properties’ column to get total value associated and generate a simple bucketed frequency chart/histogram plot]

4- Behavior of purchasing and non-purchasing customers [something along the lines of their in-app event frequency in a given install-uninstall cycle]

5- Week over Week revenue trends for purchasing customers

6- How are their purchases distributed post install? [the number and value of purchases after installing the app in one retention cycle]

7- Do they perform purchases in the 2nd,3rd etc weeks post install? [if their retention cycle is greater than 1 week]

8- Is there a steady inflow of revenue for customers with high retention? [growth can decline but is it still a positive gradient?]

9- Any other actionable insights that can be drawn from the given data?

Data Sets :—

cycles


events


#2

@deva123 you can follow the code given below to extract timestamp and email from the ‘properties’ column in the events file. I have used regular expressions and stringr package.

# extract timestamp from the properties column
events$timestamp = str_match(events$properties, "timeStamp\": \"(.*?)\"")[,2] 

# extract email ID's from the properties column
events$email = unlist(regmatches(events$properties, gregexpr("([_a-z0-9-]+(\\.[_a-z0-9-]+)*@[a-z0-9-]+(\\.[a-z0-9-]+)*(\\.[a-z]{2,4}))", events$properties)))

You can further convert the timestamp into date and time.


#3

I’m little bit late to respond …Thank you so much for the reply…
Do you know how to approach such kind of problem…? . I’m New to such problem
could you tell any reference blog or site where I can get such kind of problem for practice…?
any tool and techniques are welcome …


#4

2nd extract email showing error
Error in $<-.data.frame(*tmp*, email, value = c("careers24@gmail.com", :
replacement has 127901 rows, data has 995304
My actual events file is 180 MB but posted small portion of that
How to fix this …?


#5

@deva123 that code should work irrespective of the no. of rows in the dataframe. Could you share your code once?


#6

cycles <- read.csv(“uicycles.csv”)
–# my 180mb orignal file size name eventdump
event <-read.csv(“eventdump.csv”)

loading SQL libraries
–# install.packages(‘sqldf’)
library(sqldf)
–# View(cycles)
–# View(event)
sql <- sqldf(“SELECT uuid,COUNT (uuid) as occurances,event_type
FROM cycles
GROUP BY uuid”)
–# View(sql)

sql1 <- sqldf("SELECT *,
CASE
WHEN event_type = ‘install’ THEN ‘Retain’
WHEN event_type = ‘reinstall’ THEN ‘Retain’
WHEN event_type = ‘uninstall’ THEN ‘Not_Retain’
END AS in_unin
FROM sql
")

–# View(sql1)
customer <- sqldf(“SELECT COUNT(in_unin) as retain ,in_unin
FROM sql1
GROUP BY in_unin”)
View(customer)

barplot(customer$retain,names.arg = customer$in_unin ,xlab = " status ",ylab = “No of customer”,col = “red”,
border = “green”,main = “Retention chart”)

–# left join events and uuid

left <- sqldf(“SELECT *
FROM event e
LEFT JOIN sql b ON e.uuid = b.uuid”)

View(left)

–# Extract timestamp from properties

library(stringr)
event$timestamp = str_match(event$properties, “timeStamp”: “(.*?)”")[,2]
event$timestamp

–# extract email ID’s from the propeties column
–# insted of events i use event in my code.
event$email = unlist(regmatches(event$properties,gregexpr("([_a-z0-9-]+(\.[_a-z0-9-]+)@[a-z0-9-]+(\.[a-z0-9-]+)(\.[a-z]{2,4}))", event$properties)))
event$email