# A tibble: 265 × 4
location_id borough zone service_zone
<int> <chr> <chr> <chr>
1 1 EWR Newark Airport EWR
2 2 Queens Jamaica Bay Boro Zone
3 3 Bronx Allerton/Pelham Gardens Boro Zone
4 4 Manhattan Alphabet City Yellow Zone
5 5 Staten Island Arden Heights Boro Zone
6 6 Staten Island Arrochar/Fort Wadsworth Boro Zone
7 7 Queens Astoria Boro Zone
8 8 Queens Astoria Park Boro Zone
9 9 Queens Auburndale Boro Zone
10 10 Queens Baisley Park Boro Zone
# … with 255 more rows
why use str_replace_all() and not str_remove_all()
String manipulation exercise
collect(short_zones)
# A tibble: 265 × 3
zone abbr_zone abbr_zone_len
<chr> <chr> <int>
1 Prospect-Lefferts Gardens Prspct-LffrtsGrdns 18
2 Flushing Meadows-Corona Park FlshngMdws-CrnPrk 17
3 Springfield Gardens North SprngfldGrdnsNrth 17
4 Springfield Gardens South SprngfldGrdnsSth 16
5 Washington Heights North WshngtnHghtsNrth 16
6 Williamsburg (North Side) Wllmsbrg(NrthSd) 16
7 Financial District North FnnclDstrctNrth 15
8 Washington Heights South WshngtnHghtsSth 15
9 Williamsburg (South Side) Wllmsbrg(SthSd) 15
10 Financial District South FnnclDstrctSth 14
# … with 255 more rows
# A tibble: 265 × 2
pickup_location_id pickup_borough
<int> <chr>
1 1 EWR
2 2 Queens
3 3 Bronx
4 4 Manhattan
5 5 Staten Island
6 6 Staten Island
7 7 Queens
8 8 Queens
9 9 Queens
10 10 Queens
# … with 255 more rows
Why didn’t this work?
nyc_taxi |>left_join(pickup) |>collect()
Error in `collect()`:
! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(pickup_location_id) of type int64 and FieldRef.Name(pickup_location_id) of type int32
/home/danielle/GitHub/projects/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:122 ValidateSchemas(join_type, left_schema, left_keys, left_output, right_schema, right_keys, right_output, left_field_name_suffix, right_field_name_suffix)
/home/danielle/GitHub/projects/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:697 schema_mgr->Init( join_options.join_type, left_schema, join_options.left_keys, join_options.left_output, right_schema, join_options.right_keys, join_options.right_output, join_options.filter, join_options.output_suffix_for_left, join_options.output_suffix_for_right)
For the January 2022 NYC data, we want to rank the taxi rides in chronological order by pickup time. We then want to find those taxi rides where the pickup occurred on the 59th minute of the hour, and the 59th second of minute, andp the numerical rank of that ride contains the number 59 in it. We would like to return the rank, the pickup time, and a “magic number” that removes all the digits from the rank that are not 5 or 9
# A tibble: 6,567,396 × 4
pickup_datetime pickup_longitude pickup_latitude pickup_location_id
<dttm> <dbl> <dbl> <int>
1 2019-09-01 15:14:09 NA NA 186
2 2019-09-01 15:36:17 NA NA 138
3 2019-09-01 15:29:19 NA NA 132
4 2019-09-01 15:33:09 NA NA 79
5 2019-09-01 15:57:43 NA NA 132
6 2019-09-01 15:59:16 NA NA 132
7 2019-09-01 15:20:06 NA NA 132
8 2019-09-01 15:27:54 NA NA 224
9 2019-09-01 15:35:08 NA NA 79
10 2019-09-01 15:19:37 NA NA 97
# … with 6,567,386 more rows
# A tibble: 6 × 3
location year city
<chr> <dbl> <chr>
1 Stonewall Inn 1969 New York
2 Compton's Cafeteria 1966 San Francisco
3 Cooper Do-nuts 1959 Los Angeles
4 Dewey's 1965 Philadelphia
5 King's Cross 1978 Sydney
6 La Rambla 1977 Barcelona