dask df.col.unique() vs df.col.drop_duplicates() -
in dask difference between
df.col.unique() and
df.col.drop_duplicates() both return series containing unique elements of df.col. there difference in index, unique result indexed 1..n while drop_duplicates indexed arbitrary looking sequence of numbers.
what significance of index returned drop_duplicates?
is there reason use 1 on other if index not important?
dask.dataframe has both because pandas has both, , dask.dataframe copies pandas api. unique holdover pandas' history numpy.
in [1]: import pandas pd in [2]: df = pd.dataframe({'x': [1, 2, 1], 'y': [1., 2., 3.]}, index=pd.index(['a', 'b', 'a'], name='i')) in [3]: df.x.drop_duplicates() out[3]: 1 b 2 name: x, dtype: int64 in [4]: df.x.unique() out[4]: array([1, 2]) in dask.dataframe deviate , choose use dask.dataframe.series rather dask.array.array because 1 can't precompute length of array , can't act lazily.
in practice there little reason use unique on drop_duplicates
Comments
Post a Comment