No Arabic abstract
The intercity freight trips of heavy trucks are important data for transportation system planning and urban agglomeration management. In recent decades, the extraction of freight trips from GPS data has gradually become the main alternative to traditional surveys. Identifying the trip ends (origin and destination, OD) is the first task in trip extraction. In previous trip end identification methods, some key parameters, such as speed and time thresholds, have mostly been defined on the basis of empirical knowledge, which inevitably lacks universality. Here, we propose a data-driven trip end identification method. First, we define a speed threshold by analyzing the speed distribution of heavy trucks and identify all truck stops from raw GPS data. Second, we define minimum and maximum time thresholds by analyzing the distribution of the dwell times of heavy trucks at stop location and classify truck stops into three types based on these time thresholds. Third, we use highway network GIS data and freight-related points-of-interest (POIs) data to identify valid trip ends from among the three types of truck stops. In this step, we detect POI boundaries to determine whether a heavy truck is stopping at a freight-related location. We further analyze the spatiotemporal characteristics of intercity freight trips of heavy trucks and discuss their potential applications in practice.
Intracity heavy truck freight trips are basic data in city freight system planning and management. In the big data era, massive heavy truck GPS trajectories can be acquired cost effectively in real-time. Identifying freight trip ends (origins and destinations) from heavy truck GPS trajectories is an outstanding problem. Although previous studies proposed a variety of trip end identification methods from different perspectives, these studies subjectively defined key threshold parameters and ignored the complex intracity heavy truck travel characteristics. Here, we propose a data-driven trip end identification method in which the speed threshold for identifying truck stops and the multilevel time thresholds for distinguishing temporary stops and freight trip ends are objectively defined. Moreover, an appropriate time threshold level is dynamically selected by considering the intracity activity patterns of heavy trucks. Furthermore, we use urban road networks and point-of-interest (POI) data to eliminate misidentified trip ends to improve method accuracy. The validation results show that the accuracy of the method we propose is 87.45%. Our method incorporates the impact of the city freight context on truck trajectory characteristics, and its results can reflect the spatial distribution and chain patterns of intracity heavy truck freight trips, which have a wide range of practical applications.
Heavy-tailed metrics are common and often critical to product evaluation in the online world. While we may have samples large enough for Central Limit Theorem to kick in, experimentation is challenging due to the wide confidence interval of estimation. We demonstrate the pressure by running A/A simulations with customer spending data from a large-scale Ecommerce site. Solutions are then explored. On one front we address the heavy tail directly and highlight the often ignored nuances of winsorization. In particular, the legitimacy of false positive rate could be at risk. We are further inspired by the idea of robust statistics and introduce Huber regression as a better way to measure treatment effect. On another front covariates from pre-experiment period are exploited. Although they are independent to assignment and potentially explain the variation of response well, concerns are that models are learned against prediction error rather than the bias of parameter. We find the framework of orthogonal learning useful, matching not raw observations but residuals from two predictions, one towards the response and the other towards the assignment. Robust regression is readily integrated, together with cross-fitting. The final design is proven highly effective in driving down variance at the same time controlling bias. It is empowering our daily practice and hopefully can also benefit other applications in the industry.
Atmospheric modeling has recently experienced a surge with the advent of deep learning. Most of these models, however, predict concentrations of pollutants following a data-driven approach in which the physical laws that govern their behaviors and relationships remain hidden. With the aid of real-world air quality data collected hourly in different stations throughout Madrid, we present an empirical approach using data-driven techniques with the following goals: (1) Find parsimonious systems of ordinary differential equations via sparse identification of nonlinear dynamics (SINDy) that model the concentration of pollutants and their changes over time; (2) assess the performance and limitations of our models using stability analysis; (3) reconstruct the time series of chemical pollutants not measured in certain stations using delay coordinate embedding results. Our results show that Akaikes Information Criterion can work well in conjunction with best subset regression as to find an equilibrium between sparsity and goodness of fit. We also find that, due to the complexity of the chemical system under study, identifying the dynamics of this system over longer periods of time require higher levels of data filtering and smoothing. Stability analysis for the reconstructed ordinary differential equations (ODEs) reveals that more than half of the physically relevant critical points are saddle points, suggesting that the system is unstable even under the idealized assumption that all environmental conditions are constant over time.
The identification of precipitation regimes is important for many purposes such as agricultural planning, water resource management, and return period estimation. Since precipitation and other related meteorological data typically exhibit spatial dependency and different characteristics at different time scales, clustering such data presents unique challenges. In this paper, we develop a flexible model-based approach to cluster multi-scale spatial functional data to address such problems. The underlying clustering model is a functional linear model , and the cluster memberships are assumed to be a realization from a Markov random field with geographic covariates. The methodology is applied to a precipitation data from China to identify precipitation regimes.
The autoregressive (AR) model is a widely used model to understand time series data. Traditionally, the innovation noise of the AR is modeled as Gaussian. However, many time series applications, for example, financial time series data, are non-Gaussian, therefore, the AR model with more general heavy-tailed innovations is preferred. Another issue that frequently occurs in time series is missing values, due to system data record failure or unexpected data loss. Although there are numerous works about Gaussian AR time series with missing values, as far as we know, there does not exist any work addressing the issue of missing data for the heavy-tailed AR model. In this paper, we consider this issue for the first time, and propose an efficient framework for parameter estimation from incomplete heavy-tailed time series based on a stochastic approximation expectation maximization (SAEM) coupled with a Markov Chain Monte Carlo (MCMC) procedure. The proposed algorithm is computationally cheap and easy to implement. The convergence of the proposed algorithm to a stationary point of the observed data likelihood is rigorously proved. Extensive simulations and real datasets analyses demonstrate the efficacy of the proposed framework.