How to extract values from a large character object with different separators between values in R?

r

#1

Hello,

I have a character object which contains data in this way.->

[1] “[[-8.618643,41.141412],[-8.618499,41.141376],[-8.620326,41.14251],[-8.622153,41.143815],[-8.623953,41.144373],[-8.62668,41.144778],[-8.627373,41.144697],[-8.630226,41.14521],[-8.632746,41.14692],[-8.631738,41.148225],[-8.629938,41.150385],[-8.62911,41.151213],[-8.629128,41.15124],[-8.628786,41.152203],[-8.628687,41.152374],[-8.628759,41.152518],[-8.630838,41.15268],[-8.632323,41.153022],[-8.631144,41.154489],[-8.630829,41.154507],[-8.630829,41.154516],[-8.630829,41.154498],[-8.630838,41.154489]]”

These are latitudes and longitudes. I want to extract the first and the last set of latitudes and longitudes from it.
How can do this? I tried using strsplit function but could not get it.

Thanks


#2

Try:

lat_lon <- function(x) {
  s <- strsplit(x, '\\](,)\\[')
  res <- s[[1]][c(1, length(s[[1]]))]
  vals <- gsub('\\[|\\]', '', res)
  vals
}

lat_lon(teststring)
[1] "-8.618643,41.141412" "-8.630838,41.154489"

#3

This is really good but I have a very large dataset (1710670 entries). If I run a for loop for each of them, it would take hours for it to run. Is there some way other than doing this without for loop?


#4

How is actually represented in the table. Is the whole character object a single value in an observation, or do they represent multiple observations?

An ugly, arguably insufficient piece of code typed up while trying to keep awake during an office meeting.

> latlong = as.vector("[[-8.618643,41.141412],[-8.618499,41.141376],[-8.620326,41.14251],[-8.622153,41.143815],[-8.623953,41.144373],[-8.62668,41.144778],[-8.627373,41.144697],[-8.630226,41.14521],[-8.632746,41.14692],[-8.631738,41.148225],[-8.629938,41.150385],[-8.62911,41.151213],[-8.629128,41.15124],[-8.628786,41.152203],[-8.628687,41.152374],[-8.628759,41.152518],[-8.630838,41.15268],[-8.632323,41.153022],[-8.631144,41.154489],[-8.630829,41.154507],[-8.630829,41.154516],[-8.630829,41.154498],[-8.630838,41.154489]]")
> latlong = strsplit(latlong, "\\],\\[")
> substr(head(latlong[[1]], n = 1), start = 3, stop = 25)
[1] "-8.618643,41.141412"
> substr(tail(latlong[[1]], n = 1), start = 1, stop = 19)
[1] "-8.630838,41.154489"

#5

You won’t necessarily need a for loop. ‘A very large dataset’ can mean a lot of things. Is it a data.frame? a matrix? a list? a vector? data table? What happens when you run str() on your data? So if you’re data is named ourdata, enter str(ourdata) and post it as a reply.


#6

The last column, it has data in a list in json format.

str(train)
‘data.frame’: 1710670 obs. of 9 variables:
TRIP_ID : num 1.37e+18 1.37e+18 1.37e+18 1.37e+18 1.37e+18 ... CALL_TYPE : chr “C” “B” “C” “C” …
ORIGIN_CALL : int NA NA NA NA NA NA NA NA NA NA ... ORIGIN_STAND: int NA 7 NA NA NA NA NA NA NA NA …
TAXI_ID : int 20000589 20000596 20000320 20000520 20000337 20000231 20000456 20000011 20000403 20000320 ... TIMESTAMP : int 1372636858 1372637303 1372636951 1372636854 1372637091 1372636965 1372637210 1372637299 1372637274 1372637905 …
DAY_TYPE : chr "A" "A" "A" "A" ... MISSING_DATA: chr “False” “False” “False” “False” …
$ POLYLINE : chr “[[-8.618643,41.141412],[-8.618499,41.141376],[-8.620326,41.14251],[-8.622153,41.143815],[-8.623953,41.144373],[-8.62668,41.1447”| truncated “[[-8.639847,41.159826],[-8.640351,41.159871],[-8.642196,41.160114],[-8.644455,41.160492],[-8.646921,41.160951],[-8.649999,41.16”| truncated “[[-8.612964,41.140359],[-8.613378,41.14035],[-8.614215,41.140278],[-8.614773,41.140368],[-8.615907,41.140449],[-8.616609,41.140”| truncated “[[-8.574678,41.151951],[-8.574705,41.151942],[-8.574696,41.151933],[-8.57466,41.15196],[-8.574723,41.151933],[-8.574714,41.1519”| truncated

Now what would be a good and fast way to get the latitudes and longitudes from the last column into separate columns?

EDIT: Okay I used that lat_lon function without the for loop and got a much faster way. I used lapply->

k=train$POLYLINE
d=data.frame(jsoncol=k,stringsAsFactors = FALSE)
f=lapply(d[[1]],lat_lon)

This worked fine for me!

Thanks!


#7

Better than splitting string is to use JSON to get a list object and then tidy it up and return as a list object

library(rjson)

get.taxi.route <- function(x){
#Given a string representing JSON returns a data frame of double
lonlat <- fromJSON(x)

route <- data.frame(lonlat)
route <- t(route)

#Sets the names of the dimensions of the matrix
colnames(route) <- c(“latitude”, “longitude”)
rownames(route) <- 1:nrow(route)

return(route)
}


#8

This seems to be a better solution!

But I am still facing a problem. I want the starting point latitudes and longitudes from this data. I am trying it like this using the function get.taxi.route->

m=c()
for(i in 1:50000)
{
m[i]=get.taxi.route(train$POLYLINE[i])[1,1]
}

But I am getting an error saying->

Error in colnames<-(*tmp*, value = c(“latitude”, “longitude”)) :
length of ‘dimnames’ [2] not equal to array extent

I dont get the error if I run the loop for say 500 times. Why is this happening?


#9
lat_lon(train$POLYLINE)

The function accepts character vectors directly. No need for any loops.


#10

get.start.end.latitude.longitude <- function(x){
  #return a vector of co-ordinates of start and end of taxi route
  start.latitude <- x[1, "latitude"]
  start.longitude <- x[1, "longitude"]
  end.latitude <- x[nrow(x), "latitude"]
  end.longitude <- x[nrow(x), "longitude"]
  start.end <- c("start.latitude" = start.latitude
                 , "start.longitude" = start.longitude
                 , "end.latitude" = end.latitude
                 , "end.longitude" = end.longitude)
  
  return(start.end)
}

# sapply returns a matrix but will have to swap column and row
start.end <- sapply(taxi.route, get.start.end.latitude.longitude)
start.end <- t(start.end)

#create new columns in the training data frame which maps to the column vectors of start.end
train$start.latitude <- start.end[, "start.latitude"]
train$start.longitude <- start.end[, "start.longitude"]
train$end.latitude <- start.end[, "end.latitude"]
train$end.longitude <- start.end[, "end.longitude"]


#11

@Pierre_Lafortune,

Tried that too. still got the same error.


#12

Does d$POLYLINE exist? As long as the lat and lon are a vector, you will not require a loop.